Coffee Chat Recap: U.S. Federal Government Web Archiving

By Susan Paterson, SAA Web Archiving Section Vice-Chair and liaison librarian at Koerner Library at the University of British Columbia Vancouver.

The Web Archiving Section hosted its first coffee chat of the year on March 29, 2022. Melissa Wertheimer, Chair of the Web Archiving Section, led a panel discussion on US Federal Government web archiving activities. We were joined by web archiving experts from the National Library of Medicine, Government Publishing Office, the Smithsonian and the Library of Congress. Topics ranged from content curation, collaboration amongst agencies, staffing, workflow models, and successes and challenges, as well as the impact of COVID-19 on their collecting activities.

The coffee chat recording is available to viewers until summer 2022.

National Library of Medicine – History of Medicine Division, Digital Manuscripts Program

The session began with an informative presentation by Christie Moffatt, Manager of the Digital Manuscripts Program, History of Medicine Division at the National Library of Medicine (NLM). Christie described NLM’s approach to web collecting, an activity which began in 2009 with their Health and Medical Blog Collection. Christie provided an overview of the NLM web archive collections, and the collection development policies that guide their web collecting.

Christie noted that thematic and events based collections have grown and are a key component to their collection building. The Global Health Events Web Archive collection, one of their largest, began in 2015 with the Ebola outbreak. With the World Health Organization’s announcement of a global health pandemic in January 2020, NLM started work on a COVID-19 pandemic collection. Interestingly, this specific designation of a global health emergency is worked into their collection development policy and the NLM takes responsibility for building a web archive once this designation is made. An aim of the COVID-19 collection is to ensure a diversity of perspectives on both the impact and the response to the pandemic are archived. Tools and communication used for outreach during the pandemic (i.e. TikTok, Twitter) as well as personal narratives of the pandemic make up part of the collection which is the Library’s largest collection with 3.5 FTES working on the project. The National Library of Medicine’s Archive It website can be explored here.

Government Publishing Office

Dory Bower, Archives Specialist at the Government Publishing Office (GPO) provided an overview of GPO’s web archiving activities and explained how legislation Title 44 of the U.S. Code is the mandate for Public Printing and Documents. Specifically, Chapter 19 discusses the Depository Library Program. Federal agencies should be distributing and publishing their documents through the GPO. If a federal agency is not doing so, the Superintendent of Documents should be notified. Unfortunately, with born digital publications and publishing directly onto websites this is not happening and material is being missed. GPO joined the Archive-it community in 2012 and since then they have been using web archiving to help fill this gap.

The web archive is part of the overall GPO Digital Collection Development Plan. For material to be part of the GPO collection, it must be federally funded and be within the scope of the Federal Depository Library Program (FDLP). GPO is also focusing on collections of interest geared towards women, tribal libraries, and Native American communities, just to name a few. GPO maintains over 213 collections in Archive-it, making up 38.3 TB of data and consisting of over 392 million urls. You can explore the Federal Depository Library Program Web Archive here.

Smithsonian Institution – Libraries and Archives

Lynda Schmitz Fuhrig, Digital Archivist at the Smithsonian Institution Libraries and Archives, presented next. Lynda provided a fascinating overview of the web archiving work that’s being done at the Smithsonian, which celebrated its 175th anniversary in 2021.

The Libraries and Archives of the Smithsonian is the official repository and record keeper for the Smithsonian. Their responsibilities include sharing, collecting, and preserving Smithsonian activities and output which includes documenting its unique web presence, which launched in 1995. They now have nearly 400 public websites and blogs and a very active social media presence covering Twitter, Instagram, Flickr, and YouTube – just to name a few.

Like the National Library of Medicine, the Smithsonian has documented the impacts and effects of the COVID-19 pandemic on America. For example, beginning in March 2020, more focused and frequent crawls were necessary to document the altered scheduling of the closing and reopening of the Museums and the Zoo. Additionally, the closure of museums created a need for an increased digital presence, and the Smithsonian launched several new websites and initiatives, including Care Package, Vaccines & US and Our Shared Future.

Audience members were particularly interested in their web and social media workflows and tools. Along with Conifer and Browsertrix, the Smithsonian uses netyltic, which was developed by the Social Media Lab at Ryerson University to collect Smithsonian hashtags and accounts.

They are one of the few organizations that download the WARCS from Archive-it. They developed an in-house tool called WARC Retriever which they hope to release on Github later this year. Lynda’s summation was poignant: “The Smithsonian web collections will continue to tell the history and stories of the Smithsonian.” You can explore the Smithsonian Archive-it page here.

Library of Congress – Web Archiving Program

To round out the panel, Meghan Lyon and Lauren Baker, Digital Collection Specialists from the Library of Congress (LOC) Web Archiving Program, provided an overview of the activities at LOC. The Web Archiving Program began in 2000 and is part of the Library’s Digital Content Management Section.

The LOC web archives consist of 3PB of data organized into 174 collections, 75 of which are active collections. Like many of the other speakers, the Web Archiving Team collaborates frequently on web archive collections, relying on the contributions of collaborators around the Library. The Collection Development Office helps guide collection initiatives, and various committees review subject proposals and select content to archive and determine collection focus. LOC comprehensively collects content from Legislative Branch Agencies and U.S. House and Senate offices and committees. They collect content about U.S. national elections as well as managing other events based and thematic collections.

Megan and Lauren addressed the issue of permission and web archiving. Their permissions policy is determined by the Office of the General Counsel, which is based on Copyright Law. Permission requests must be sent to site owners for anything selected for web archiving. There are two permission requests: a permission to crawl, and permission to display based on the country of publication and the category of the entity. You can explore the LOC Web Archiving Program website here.

The panel closed out the session by discussing how they became interested in web archiving and how their careers started in the field. Their initial experiences ranged from practicums to learning on the job. The remainder of the conversation also included the topics of trends, the and future of web archiving tools – including what improvements people hope for and imagining better tools for harvesting and user awareness. The session was well attended with 181 registrants and over 80 attendees. Thank you to everyone who presented and who attended for such an engaging hour. Stay tuned for our next coffee chat, which will be in May!

Introducing the DocNow App

This week’s post was written by Zakiya Collier, Community Manager at Documenting the Now.

This week the Documenting the Now project announces the release of DocNow, an application for appraising, collecting, and gathering consent for Twitter content. DocNow reimagines the relationship between content creators and social media analysts (archivists and researchers) by addressing two of the most challenging issues of social media archiving practice—the issues of consent and appraisal.

The Documenting the Now Project is happy to release version 1.0 of our open-source tool freely for anyone to use. Read all about the app, what it does, and how to engage with the DocNow project team for support and providing feedback.

Over the last seven years, Documenting the Now has helped to foster an environment where a more broad sector of cultural memory workers can learn about web archiving tools and practices and can become involved with web archiving networks. This has largely been achieved by practicing transparency and inviting people who have traditionally been left out of established web content archiving networks into the project to participate, namely students, activists, and archivists who represent marginalized communities and who work in community-centered organizations, HBCUs, public libraries, community-based archives, and tribal libraries and archives.

Documenting the Now was a response to the need among scholars, activists, archivists, and other memory workers for new tools that would provide easily-accessible and user-friendly means to collect, visualize, analyze, and preserve web and social media content to better document public events. In addition, it aimed to respond to questions and concerns related to ethics, safety, intellectual property, and access issues for the collection, preservation, and dissemination of Twitter data in particular.

Documenting the Now has also developed community-centered web and social media archiving tools that both prioritize care for content creators and robust functionality for users:

  • Twarc – a command line tool and Python library for collecting tweet data from Twitter’s official API
  • Hydrator – a desktop application for turning Tweet ID datasets back into tweet data to use in your research
  • Social Humans – a label system to specify the terms of consent for social media content
  • The Catalog – a community-sourced clearinghouse to access and share tweet identifier datasets

In continuing to support and develop tools that embody ethical practices for social media archiving, the DocNow app joins this suite of tools. DocNow is an application for appraising, collecting, and gathering consent for Twitter content and includes several new features including:

  • Trends tab to view trending topics across the globe in real time
  • Explore tab to view content by users, media, URLs, and related hashtags all on one screen
  • Live testing and refining of collecting parameters on recent tweets
  • Tweets per hour calculator to easily identify Twitter bot accounts
  • Search and Collect tweets back in time via Search API and forwards with Stream API
  • Activate toggle to start collecting tweets and send a notification tweet to encourage transparency and communication in Twitter data collection
  • Collections tab to share information about your collection with the public
  • “Find Me” and Insights Overview features to specify and gather consent using Social Humans labels
  • Download Tweet ID archive for sharing following Twitter’s terms of service

The DocNow app also works in concert with other Documenting the Now tools, creating for users, a 4-step social media archiving journey:

Step 1: Collect content with the DocNow App by activating a search. Set collection limits and explore insights as your collection grows.
Step 2: Download your archive from the DocNow App, which includes a Tweet Viewer, Tweet IDs, and media files.
Step 3: Hydrate your Tweet IDs from the archive’s tweets.csv file back into full
tweets using DocNow’s Hydrator desktop application.
Step 4: Describe your collection and share your Tweet IDs with other researchers by adding them to the DocNow Catalog.

Ways to Use DocNow
There are 3 different ways to use DocNow including joining the community instance, running DocNow locally on a computer, and installing an instance of DocNow in the cloud. The Community Instance is a great way to get familiar with the tool before committing to running an instance but those with development skills may want to administer their own instance of DocNow locally or in the cloud.

  1. Join Documenting the Now‘s hosted community instance
  2. Run DocNow locally on your machine
  3. Install your own instance in the cloud

For help with installation and getting started, the Documenting the Now team will host community conversations. Dates will be announced soon! More information about the DocNow App can be found here.

Documenting the Now is seeking community input on all of our features as we continue to develop DocNow. Please join our slack channel by going to our website or email us at info@docnow.io.