Web Archiving Roundup: December 11, 2017

Happy December, Roundtablers! Here’s your Web Archiving Roundup for December 11, 2017:

  • Sustaining the Software that Preserves Access to Web Archives: on Digital Preservation Day, Andrew Jackson took a look at open source tools that enable access to web archives, and asked us to think about what comes next.
  • Speaking of moving forward — how do you move a web archive? On their blog, the National Archives details what went into moving 120 terabytes of data, on seventy drives, from Internet Memory Research’s data centre in Paris, to the Archives site in Kew, and, finally, to the Cloud. (Archived link.)
  • For the Digital Preservation Coalition, David S. H. Rosenthal writes about how we might be Losing the Battle to Archive the Web.
  • And, at the Atlantic, Alexis C. Madrigal writes that Future Historians Probably Won’t Understand Our Internet, and That’s Okay. Today, he notes: ‘there is more data about more people than ever before, however, the cultural institutions dedicated to preserving the memory of what it was to be alive in our time, including our hours on the internet, may actually be capturing less usable information than in previous eras.’ Still, as Nick Seaver says, ‘Is it terrible that not everything that happens right now will be remembered forever? Yeah, that’s crappy, but it’s historically quite the norm.’ (Archived link.)
  • Web Archiving Histories and Futures: the International Internet Preservation Consortium has announced its Call for Papers for its annual conference, to be held at the National Library of New Zealand in Wellington from November 13-15, 2018. Abstracts should be 300 to 500 words in length, and may touch upon topics related to: building web archives, maintaining web archive content and operations, using and researching web archives, web archive histories and futures, and more. Proposals are due February 28, 2018. 
Advertisements

Web Archiving Roundup: November 27, 2017

Here’s your post-Thanksgiving Web Archiving Roundup for November 26, 2017:

Web Archiving Roundup — Gothamist edition: November 13, 2017

Gothamist shutdown:
On Thursday, November 2, it was announced that the online-only, city-centric news outlets Gothamist and DNAinfo had been abruptly shuttered — archives and all — by owner Joe Ricketts in response to the organization’s vote to unionize. Both online newspapers, Gothamist (and LAist, DCist, Chicagoist, and SFist) and DNAinfo were updated numerous times each day, with a focus on local news, events, food, and culture.

This special edition of the Web Archiving Roundup takes a look at what others are saying about Gothamist and DNAinfo — and online news — in the wake of their sudden shutdown.

  • Archive, archive, archive: NiemanLab links to several external efforts to archive both Gothamist and DNAinfo, and reminds us of the risks of ‘billionaire-funded media.’ (Archived link.)
  • What We Lose in the Disappearing Digital Archive: on Splinter, David Uberti writes: ‘It’s likely that additional existing [online] publications will close in the face of economic upheaval, leaving their sites vulnerable to technical failure without consistent upkeep.’ Uberti also speaks with Abbie Grotke, web archiving team lead at the Library of Congress, who discusses the difficulties of capturing online news. (Archived link.)
  • When your server crashes, you could lose decades of digital news content — forever: in 2014, the Columbia Missourian suffered a server crash and ‘in less than a second, the newspaper’s digital archive of fifteen years of stories and seven years of photojournalism were gone forever.’ What’s worse, as Edward McCain writes, is that ‘very little is known about the policies and practices of news organizations when it comes to born-digital content.’ (Archived link.)
  • If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can: written in 2015, ‘Raiders of the Lost Web‘ argues that ‘the web, as it appears at any one moment, is a phantasmagoria. It’s not a place in any reliable sense of the word. It is not a repository. It is not a library. It is a constantly changing patchwork of perpetual nowness. You can’t count on the web, okay? It’s unstable. You have to know this.’ (Archived link.)

Tools and additional links:

Conference alert: on November 15 and 16, follow along with Dodging the Memory Hole, a conference dedicated to the issue of preserving born-digital news content.

Web Archiving Roundup: November 6, 2017

Here are a few quick links on recent web archiving topics:

  • Remembering October 1. Multiple Las Vegas institutions are joining forces to document last month’s horrific mass shooting, its aftermath, and the community’s response using a multi-tech approach to web archiving. The project is actively accepting contributions from the general public. Live link
  • History of Syria’s war at risk as YouTube reins in content. Excerpt: “Syrian activists fear all that history could be erased as YouTube moves to rein in violent content. In the past few months, the online video giant has implemented new policies to remove material considered graphic or supporting terrorism, and hundreds of thousands of videos from the conflict suddenly disappeared without notice. Activists say crucial evidence of human rights violations risks being lost — as well as an outlet to the world that is crucial for them.” Live link
  • Archiving the Belgian web. The Royal Library of Belgium launched Preserving Online Multiple Information: towards a Belgian strategy (PROMISE) on 1 June 2017, and aims to develop a federal strategy for the preservation of the Belgian web. Live link
  • Visualizing the changing web. With support from the National Endowment for the Humanities and the Institute of Museum and Library Services, the Web Science and Digital Libraries Research Group at Old Dominion University aims to visualize webpage changes over time.  Live link
  • Web archiving labor. Jessica Ogden explores digital labor in relation to web archiving in “Web Archiving as Maintenance and Repair.” Live link
  • Evaluating a web archiving program. The Dutch National Library asks, “How can we improve our web collecting?” Live link
  • Open call. Rhizome announces its open call for participation in its National Forum on Ethics and Archiving the Web. Proposals are due November 14, 2017: Live link

 

Web Archiving Roundup: October 2017

After a brief hiatus, the Web Archiving Roundup is back this month. Here are a few quick links on recent web archiving topics:

  • How the Victoria and Albert Museum collected WeChat: “How do you collect an app? What is the thing you’re actually collecting? And what for?”
  • Ashley Blewer asks, “How do web archiving frameworks work?” “If you wish to explain how web archiving works from a technical standpoint, you must first understand the ecosystem.”
  • Collecting social media a bite at a time at the National Library of New Zealand: It “worries us that some of our documentary heritage may be lost if we don’t start collecting content” from social media.
  • Can machine-learning models successfully identify content-rich PDF and Word documents from web archives? With support from the Institute of Museum and Library Services, the University of North Texas aims to find out.
  • Rhizome to Host National Forum on Ethics and Archiving the Web: March 22-24, 2018, in conjunction with Documenting the Now and the New Museum in New York City.
  • Is your organization involved in web archiving, or in the process of planning a web archive? If yes, and your organization is based in the United States, you have until November 17 to take this year’s NDSA Web Archiving Survey!

Web Archiving Roundup: February 2017

A few quick links on web archiving topics

Webinar : An Introduction to Web Archive APIs

Jefferson Bailey, Director, Web Archiving Programs at the Internet Archive will be presenting the first webinar of 2017 for the Web Archiving Section of the Society of American Archivists.

Description: This webinar will provide a basic introduction to the many existing, and emergent, APIs specific to web archives and web archiving. Topics covered will include an overview of the role of APIs in the web archiving lifecycle, examples of APIs that exist for querying public web archives, and examples of collection and content specific APIs for use by curators and researchers. The webinar will demonstrate some basic examples for querying APIs and associated tools. Lastly, the webinar will present the work of the IMLS-funded WASAPI project (Web Archiving Systems APIs) which is developing APIs for the exchange of preservation web data and exploring models for API-based systems interoperability in web archiving.

Day:  March 8, 2017

Time:  1pmEST/12pm Central/10am PST

Where: Online via WebEx

If you are interested in attending the webinar, we ask that you RSVP via this online form so that we can plan accordingly.  We will send registered attendees a link to access the webinar in advance of March 8, 2017.

2016 Web Archiving RT Meeting Agenda!

Web Archiving Roundtable Meeting
Wednesday, August 3
4-5:30 PM, Salon D

Agenda:
4-4:15  Welcome and General Business Meeting (Kate Stratton and John Bence)
4:20-4:35 NDSA Survey update (Nicholas Taylor)
4:40-4:55 Internet Archive WASAPI project update (Jefferson Bailey)
5-5:30 OCLC Research Web Archiving and Metadata Working group update and discussion (Jackie Dooley)

We’re looking forward to seeing you there!

Guest Post on Web Archiving: Andrea Goethals

Variations of this post have also been published on the Harvard Library website, the Library of Congress’ Signal blog, and the IIPC’s blog.

In the last couple years, managing born-digital material, including content that originated on the Web, has been one of Harvard Library’s strategic priorities. Although the Library had been engaged in the collection, preservation and delivery of web content for several years, a strategy was needed to make this activity more scalable and sustainable at the university. The Library formed a Web Archive Working Group to gather information and make recommendations for a web archiving strategy for Harvard Library. One of the information-gathering activities the Working Group engaged in over the last year was an environmental scan of the current practices, issues and trends in web archiving nationally and internationally. Two members of the Working Group, Andrea Goethals and Abigail Bordeaux, worked closely with a consultant, Gail Truman of Truman Technologies, to conduct the five-month study and write the report. The study began in August 2015 and was made possible by the generous support of the Arcadia Fund. The final report is now available from Harvard’s open access repository, DASH.

The study included a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the International Internet Preservation Consortium (IIPC), the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Ruters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed in Table 1 and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.

 

22 Opportunities to Address Common Challenges

(the order has no significance)

1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
14. Train researchers with the skills they need to be able to analyze big data found in web archives.
15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
17. Establish a feedback loop between researchers and the librarians/archivists.
18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
19. Increase interaction with users, and develop deep collaborations with computer scientists.
20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Table 1: The 22 opportunities for further research and development that emerged from the environmental scan

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.