2016 Web Archiving RT Meeting Agenda!

Web Archiving Roundtable Meeting
Wednesday, August 3
4-5:30 PM, Salon D

4-4:15  Welcome and General Business Meeting (Kate Stratton and John Bence)
4:20-4:35 NDSA Survey update (Nicholas Taylor)
4:40-4:55 Internet Archive WASAPI project update (Jefferson Bailey)
5-5:30 OCLC Research Web Archiving and Metadata Working group update and discussion (Jackie Dooley)

We’re looking forward to seeing you there!

Guest Post on Web Archiving: Andrea Goethals

Variations of this post have also been published on the Harvard Library website, the Library of Congress’ Signal blog, and the IIPC’s blog.

In the last couple years, managing born-digital material, including content that originated on the Web, has been one of Harvard Library’s strategic priorities. Although the Library had been engaged in the collection, preservation and delivery of web content for several years, a strategy was needed to make this activity more scalable and sustainable at the university. The Library formed a Web Archive Working Group to gather information and make recommendations for a web archiving strategy for Harvard Library. One of the information-gathering activities the Working Group engaged in over the last year was an environmental scan of the current practices, issues and trends in web archiving nationally and internationally. Two members of the Working Group, Andrea Goethals and Abigail Bordeaux, worked closely with a consultant, Gail Truman of Truman Technologies, to conduct the five-month study and write the report. The study began in August 2015 and was made possible by the generous support of the Arcadia Fund. The final report is now available from Harvard’s open access repository, DASH.

The study included a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the International Internet Preservation Consortium (IIPC), the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Ruters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed in Table 1 and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.


22 Opportunities to Address Common Challenges

(the order has no significance)

1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
14. Train researchers with the skills they need to be able to analyze big data found in web archives.
15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
17. Establish a feedback loop between researchers and the librarians/archivists.
18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
19. Increase interaction with users, and develop deep collaborations with computer scientists.
20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Table 1: The 22 opportunities for further research and development that emerged from the environmental scan

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.



Web archiving roundup: February 14, 2016

Happy Valentine’s Day! Here’s your web archiving roundup for February 14, 2016:

  • GDELT + Internet Archive’s Collaboration To Archive The World’s Online Journalism: GDELT, global news coverage, and the Internet Archive’s “No More 404” program.
  • A new, free tool that’s like x-ray glasses for political ads: The Internet Archive’s Political TV Ad Archive will house all the presidential ads expected to air in eight battleground states during the primaries. Plus, fact-checking!
  • Announcing Archive-It 5.0! What’s new in Archive-It’s version 5.0.
  • State of the WARC–Our Digital Preservation Survey Results: The takeaways from Archive-It’s June 2015 survey of local digital preservation activities involving WARC files.
  • Emulating Digital Art Works: A critique of Oya Rieger and Tim Murray’s recent white paper, Preserving and Emulating Digital Art Objects.
  • Compute Canada Support: “Web Archives for Longitudinal Knowledge”Breaking down the silos in Canadian web archiving.
  • On the Road: Some Upcoming Lectures and TalksIan Milligan’s upcoming slate of lectures on digital humanities/digital history/web archiving.
  • To ZIP or not to ZIP, that is the (web archiving) question: What trade-offs are made when we compress (or don’t compress) web archive files?
  • January 2016 Federal Cloud Computing Summit: An overview.

Web archiving roundup: January 22, 2016

Here’s your web archiving roundup for January 22, 2016!

  • Guest post–Ilya Kreymer on oldweb.todayIlya Kreymer explains how oldweb.today works.
  • The Internet is for CatsIf the most important content genre on the Internet is cat videos, how did the Internet work back when there was no video?
  • Political TV ad archive preserves lies for the agesThe Internet Archive will help you call out politicians who stretch the truth.
  • BowieNet: How David Bowie’s ISP foresaw the future of the internet.
  • The Top 10 Blog Posts of 2015 on The SignalIn case you missed them, here are the most popular posts from the Library of Congress’s digital preservation blog.
  • Rhizome Awarded $600,000 by The Andrew W. Mellon Foundation to build Webrecorder, a tool to archive the dynamic web.
  • Web Archives, Performance & CaptureChristie Peterson shares her talk from Web Archives 2015.
  • ‘From Clay to the Cloud’ examines human record: Museum exhibit urges us to consider the cultural record we create through the Internet and how that record is preserved.
  • Survey: How Do You Approach Web Archiving?Do you have fifteen minutes to tell the National Digital Stewardship Alliance about your organization’s web archiving activities?


Weekly web archiving roundup: January 10, 2016

Happy new year, Roundtablers! Here is the weekly web archiving roundup for January 10, 2016!

  • Review of WS-DL’s 2015: The Web Science and Digital Libraries Research Group revisit their accomplishments in 2015.
  • CNI Fall 2015 Membership Meeting Trip Report: An overview of the Coalition for Networked Information’s 2015 Fall Meeting.
  • Memento–Help Us Route URI Lookups to the Right Archives: An IIPC funded Archive Profiling project attempts to create a high level summary of the holdings of each web archive.
  • IIPC Co-Chair Cathy Hartman Retires: The IIPC bids a fond farewell to Cathy Hartman.
  • Aggregating Web Archives: Even small Web archives can make a contribution.
  • Why Not Store It All? Website bloat and the dangers of digital storage.

Weekly web archiving roundup: December 20, 2015

Here’s your weekly web archiving roundup for December 20, 2015!

  • Web Archiving–An Overview: The Metropolitan New York Library Council announces the first in a series of webinars on web archiving.
  • These Old-School Internet Browsers Are Like Real-Life Time Machines: A new tool lets you experience the glory—and embarrassment—of the internet of yore.
  • Browsing the ancient Web with an ancient browser: Nicholas Taylor shares some findings after browsing oldweb.today.
  • Questions of ethics at Web Archives 2015: Despite diverse perspectives on web archiving, ethics seemed to be a persistent subject.

Weekly web archiving roundup: December 13, 2015

Here’s the weekly web archiving roundup for December 13, 2015:

  • The Internet Archive is hosting a telethon! An actual Telethon, hosted and run by Internet Archive employees, in front of a live audience!
  • From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step WalkthroughReleasing data is only useful if we show people how they can use it.
  • Acquiring at Digital Scale–Harvesting the StoryCorps.me CollectionMeeting the challenge of acquiring tens of thousands of interviews at a time thanks to the ability to harvest them via the web.
  • The Internet Is for Humans, Not RobotsA new study finds people outnumber bots online for the first time in four years. But a closer inspection of the data reveals a more complicated picture of what’s happening on the web.
  • Evaluating the Temporal Coherence of Composite MementosOnly one in five archived web pages existed as presented.

Weekly web archiving roundup: December 5, 2015

Here is the weekly web archiving roundup for December 5, 2015:

  • Data Storage on DNA Can Keep It Safe for Centuries: Recent advances suggest there may be a new way to store the exploding amount of computer data–and for centuries, rather than decades.
  • Building an archive on the Moon (and doing science, too): In theory, an extraterrestrial data archive will pay for some unique science.
  • Recreate the old-school internet with this web browser emulator: Oldweb.today not only shows ancient websites, but lets you visit them with ancient browsers.
  • Why It’s So Important To Understand What’s In Our Web Archives: It is simply impossible to archive the “entire internet” and perfectly preserve every change to every page in existence.
  • IHR workshop on web archiving: An Introduction to Web Archiving for Historians.
  • People, communities and platforms–Digital cultural heritage and the web: Trevor Owens’s opening keynote for the National Digital Forum in New Zealand.