Big Data Stewardship is a Big Deal
Digital Preservation 2012 Day 2 began with a panel, Big Data Stewardship is a Big Deal, featuring short talks and a discussion by:
- Trisha Cruse, University of California Curation Center, California Digital Library
- Tom Kalil, Office of Science and Technology Policy, EOP
- Myron Gutmann, Social, Behavioral and Economic Sciences, National Science Foundation
- Leslie Johnston, NDIIPP (presenter and moderator)
Highlights from these presentations include:
- “By 2015 the world will generate the equivalent of 93 million Libraries of Congress”
- Characteristics of big data include volume, heterogeneity, diversity, and complexity
- To be useful, must “get form data to knowledge to action”
- We need more tools to manage data
- We have more data than we can store
- Data sets are too large to download
- The long tail
- Two services used to mitigate the long tail challenge are:
- Data Management Planning Tool (DMPTool)
- Gets researchers to think about how to manage their data upfront
- Open source
- Facilitates sharing, preserving, using, citing, and reusing tabular data
- Available as a web app and as software
- Can use it to create a citation
- Three case studies of digital collections as big data:
- Web archives
- Historic newspapers
- The ingest and inventory of these collections are understood with the exception of scale
- How do we process these collections? Do we utilize the infrastructure of the LOC or do we make the collections available as is
- The service model is changing and presents issues, and increasingly we are looking towards a self-service model
- Analyze the (big) data where it resides
- We can never anticipate how users will use data – we can facilitate and should strive to provide data as simply as possible and let researchers at it.
- Researchers don’t work within their institution, they work within their discipline
Preserving Digital Culture
The day’s second presentation, Preserving Digital Culture, included talks and discussion by:
- Jim Boulton, Story Worldwide
- Doug Reside, New York Public Library
- Ben Fino-Radin, Rhizome
- Megan Winget, University of Texas
- Kari Kraus, University of Maryland (moderator)
Jim’s presentation on digital archaeology covered a range of material, from his exhibits of old websites on old browsers and computers (InternetWeek), to the three types of ownership of web content: physical, legal, and emotional. Central to Doug’s presentation was the concept that we should see scholars as collaborators who can assist in crowdsourcing. Ben indicated that the best way for the individuals at Rhizome to ensure the preservation of electronic art content was to become archivists as 75% of Rhizome’s works are stored externally. The issue of link rot is only part of the preservation problem for digital art preservation, as authentic rendering is just as important. Indeed, Ben stated that electronic art changes due to the evolution of technology. Thus emulation becomes important to preserve the aesthetics of the work as it does not modify the bit stream and presents the work as it was originally intended. Finally, Megan made the case that digital preservation is a wicked, rather than tame, problem. While digital preservation looks like a tame problem, it is not because one does not know when it is completely solved. A key challenge includes how to preserve content on sites such as Pinterest and Polyvore which present different content to each person. However, there are individuals and groups working on preserving content including Jason Scott and the Archive Team, and the Internet Archive. Megan stressed that while we need to address the technical aspects, we must also remember to address the cultural concerns as well.
The panel as a whole addressed concerns such as how it is known if emulation is emulating correctly. Panel members likened emulation to oral history, one must sit with the authors and talk to those who have seen it, as well as preserving the bits and metadata. Also address was the usage of the term “archaeology” in this context. It was determined that “archaeology” is used because it is within the cultural sphere, we are interested in both the media and the data.
Funding the Digital Preservation Agenda: A Status Report and Open Discussion with Major U.S. Granting Institutions
The final all group presentation was, Funding the Digital Preservation Agenda: A Status Report and Open Discussion with Major U.S. Granting Institutions, presented by
- Bob Horton, Institute of Museum and Library Services
- Kathleen Williams, National Historical Publications and Records Commission
- Joel Wurl, National Endowment for the Humanities
- Bill LeFurgy, Library of Congress (moderator)
Highlights from these presentations included:
- NHPRC is a part of the National Archives: http://www.archives.gov/nhprc/announcement/electronic.html
- Digital preservation and fixity do not always go together
- Long term access is a continuous process – “dynamic nature of digital preservation”
- “Paradigm shift from project to program – think about digital preservation as an institutional program”
Planning Digital Preservation at Different Scales for Smaller Institutions
The afternoon was comprised of breakout sessions, so our WSU NDSA contingent split-up, with each of us attending different sessions. The first session, Planning Digital Preservation at Different Scales for Smaller Institutions, included three presentations by
- Jessica Branco Colati, Northeast Document Conservation Center
- Jennifer Gunter Kind, Hampshire College
- Deborah J. Rossum, SCOLA
Colati’s presentation, Digital Preservation Pathways, discussed her experience at the Northeast Document Conservation Center (NDCC), a non-collecting, regional conservation center focused on providing digital preservation education through conference presentations and webinars. These educational methodologies center on digital collections care ranging from reformatting to security. Colati discussed how the NDCC communicates and provides guidance, stating that the biggest challenge is understanding, specifically understanding clients, collaborators, and even themselves, and they must “drink our own champagne.” Three things the NDCC remains cognizant of are:
- Be aware of special needs
- Remember people want to do the right thing and preserve for the long term
- Start with the basics and ask for help when needed
The primary digital preservation questions that the NDCC faces are:
- Should I _____?
- How do I _____?
- What’s the right _____?
- Who should _____?
- How do I talk to _____ about doing this?
- How do I know if I made the right choice?
The primary areas of concern are:
- Standards and best practices
- Workflows and staffing
- Storage for preservation
- Rights and usage
Colati stated that the NDCC strives to “put clients safely on the right path” by offering good, better, and best digital preservation advice, which allows the client to choose the option that is realistically best for them.
Colati also listed the following resources:
- NISO’s A framework of Guidance for Building Good Digital Collections
- CRL’s Trustworthy Repositories Audit & Certification: Criteria and Checklist
- FADGI (a good decision making matrix)
- LOC – Sustainable Formats
- Check out Google to see what else is out there for your needs
- New Roles for New Times: Digital Curation for Preservation
- ARL – Code of Best practices
- Seeing Standards: A Visualization of Metadata Standards
Kind’s presentation, Planning Digital Preservation and Different Scales for Smaller Institutions: The Mount Holyoke Story, discussed the implementation of a records retention policy for born digital materials.
Kind stated that the born digital material caused Mount Holyoke difficulty as the workflows previously used for analog were not translating properly for preserving born digital content. The records retention policy put into place on January 1, 2007 required the same treatment of both paper and electronic records, causing difficulty for the mere 2 FTE on staff, therefore a grant proposal was sent to the NHPRC. Although the program began small with publications and meeting minutes, this provided the tools to become larger and more pragmatic by learning through doing. Software, applications, and repositories within the electronic records workflow include:
- Duke Data Accessioner
Kind concluded that although “born digital materials are undercollected, undercounted, undermanaged, and [often] inaccessible,” strides are being made to reverse this.
The final presenter, Rossum, presented, Best Practices for Digital Preservation in Smaller Institutions: The SCOLA Model. SCOLA is a small, educational non-profit located in Iowa, specializing in less-commonly taught languages. Rossum noted that their archival process included four steps: ingest, publish, storage, and use. Their end users are primarily language educators and linguists. Finally, SCOLA’s “original partnership with the NDIIPP stemmed from SCOLA’s role as a nexus of foreign news broadcasts and was part of the LOC’s ‘initiative to collect and preserve.'”
Defining Levels of Preservation (Sponsored by the NDSA Innovation Working Group)
The final session attended was Defining Levels of Preservation (Sponsored by the NDSA Innovation Working Group), presented by members of the NDSA Innovation Working Group. This project was chartered as an NDSA action team, with the goal “to develop concise and easy to use rubric to help organizations manage and mitigate digital preservation risks.” As such, the Working Group presented their draft rubric, asking for feedback. The purpose of the rubric was to:
- Develop plans, although not a plan in itself
- Levels within the rubric are nonjudgmental
- Levels can be applied to collection(s) or system(s)
- This is designed to be content and system agnostic.
The rubric consists of four levels for data: protect, know, monitor, and fix. Each level adds additional layers of risk mitigation, and is inclusive (e.g., Level 2 includes Level 1). Additionally, there are six categories which are ranked with the corresponding level. The rubric (and project) are not yet finished, and the Working Group welcomes ideas.