Coffee Chat Recap with Dr. Ian Milligan – “Unleashing” Web Archives for Research

This post was written by Susan Paterson, Web Archiving Steering Committee Chair

On December 6, 2022, the Web Archiving Section hosted Dr. Ian Milligan, professor of history and Associate Vice-President, Research Oversight and Analysis at the University of Waterloo for its first Coffee Chat of the term. Dr. Milligan discussed the research use of web archives, particularly the work being done by the Archives Unleashed Project (AU). His talk was attended by over 140 registrants.

Dr. Milligan addressed why web archives matter, key web archiving terminology, the broader research landscape of web archives, web archives as research objects and how they are indispensable for studying contemporary culture. 

The Internet Archive (IA), along with national libraries and archives around the world, are archiving vast amounts of data: together, they have already archived approximately 200 PBs of unique web data, including more than 682 billion URLs archived by the IA alone. Milligan explains that many of URLs collected by the IA would have otherwise disappeared because content was not collected by traditional libraries, archives, or government agencies. He argues that it would be virtually impossible to do a thorough study of the 1990s without web archives, and that they are indispensable for today’s political historians, social historians, economic historians, and cultural historians.

Milligan explained that the problem facing researchers at this point isn’t collecting or locating the material, it’s how to use the material effectively in their research. Researchers often start their search for material in the IA’s WayBack Machine, but as Milligan says the IA hasn’t changed to support the research techniques of contemporary researchers, including exploratory text mining, detailed queries, computer vision, or analyzing videos and other media at scale.

Instead, researchers are working with the WARC file format (ISO 28500:2009), which is important because researchers and developers can create tools and infrastructure around the WARC data structure. The advantage of WARC files is that they bring together all of the different file types (HTML, JPG, PDF, legacy file formats) that often make up a website. WARC files allow websites to be replayed as if they were still live and researchers can analyze the metadata associated with both the WARC file and each individual component within it.

However, the WARC file format can be difficult to use, so AU’s goal is to make WARC files more readily usable for researchers. The scale is massive and researchers routinely work with hundreds of TBs of data and a variety of specialized software. Although the scale of the WARCs can be challenging it’s also one of the things that make them so exciting! The sheer scope and scale of web archives make them a treasure trove for researchers.

New skill sets and new means to deliver data and understand data are key skills for historians of all stripes, but are sadly lacking from most academic history programs. Milligan reasoned that researchers need to develop new skill sets to deal with digital data, but that web archives also need to become more usable. It’s not reasonable for historians to all become sophisticated computational experts. Milligan’s solution: work with derivative data sets rather than the raw WARC files. 

That’s where the Archives Research Compute Hub (ARCH) comes in. ARCH’s goal is to lower the barriers for researchers using web archives through the use of sharable programming notebooks, which will soon be available on Google Colab. ARCH is also integrated into Archive-IT, a subscription-based web archiving service from the Internet Archive. Milligan demonstrated ARCH and explained how techniques like network analysis can be done without any coding. The researcher is able to generate and view derivative datasets, generates files, and plug them into Gephi to create network diagrams.

One of the goals is to encourage researchers to work with web archives and they aim to build a sustainable web archiving research ecosystem which includes working with web archives, publishing in the field and inspiring others to do the same. 

In addition to hosting ARCH, AU also hosts training datathons across Western Europe and North America. Cohorts receive mentoring as well as funds to complete their projects. Some of the projects include the COVID-19 pandemic, feminist media tactics, health misinformation, online commenting systems on news websites, Latin American Women’s rights movements, querying queer web archives, and the use of cultural practices in postconflict societies during the reconciliation processes. Milligan underscored that the goal of this work and the projects are to build community, support a virtuous cycle, and inspire disciplinary colleagues to work together on web archives.

Milligan concluded his talk by reiterating the three key points that motivate AU: historians in the future will need to understand the web; they’re not yet ready to do so; let’s make sure they can. Archives Unleashed is funded by the Andrew W. Mellon Foundation, the University of Waterloo, York University, the Internet Archive, and StartSmart Labs.

Milligan just published an open access book, “Transformation of Historical Research in the Digital Age“, which is intended to be a classroom resource and addresses some of the needs and basic competencies and be engaged and active historians.

Milligan also addressed the need for historians to think more about method as well as algorithmic bias of why some information ends up on the web or database, why some materials have been microfilmed and some haven’t, search engine prioritization, etc. and some don’t, encourage interdisciplinary collaboration. 

For folks wanting to delve further into web archives, Milligan recommended the Journal of Internet Histories, The Web as History (2017) by Niels Brugger and Ralph Schroeder, and the Sage Handbook of Web History (2018) by Neils Brugger and Ian Milligan.

Archiving the Web as Public Service

This week’s post was written by Daniel Gomes, Head of Arquivo.pt.

Arquivo.pt: a Searchable Web Archive

Arquivo.pt is a public and free service that enables anyone to search and access historical information preserved from the Web since the 1990s. Arquivo.pt contains billions of files collected from websites in several languages (about half of its users come from outside of Portugal).

Periodically, the Arquivo.pt system automatically collects and stores information published on the web. The Arquivo.pt hardware infrastructure is hosted at its own datacenter, and it is managed by full-time dedicated staff. 

The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search, and application programming interfaces (API) that facilitate the development of added-value applications by third parties.

Arquivo.pt is supported by the Ministry of Science and Higher Education of Portugal. 

Showing off the Value of Web Archives

Web archives preserve web documents for future access, but they must also demonstrate their value in everyday life situations.

Thematic exhibitions and collaborative collections have been developed to illustrate the utility of web archives as a source of historical documentation. A list of all the collections preserved by Arquivo.pt is publicly available. The data sets generated to create these exhibitions or derived from the operation of the service are openly available.

Arquivo.pt has been launching complementary services to engage individuals and organizations in web archiving.

SavePageNow: Archive a Web Page Immediately

Web pages change rapidly and sometimes web archives cannot find them to be preserved on time. Arquivo.pt provides a public form where users can suggest websites to be preserved

Arquivo.pt also launched SavePageNow that enables users to immediately archive a set of web pages in high quality. The user enters a web page, starts browsing and all the visited content is archived. This service enables users to archive a small website autonomously. 

The web archived content becomes later available in Arquivo.pt.

Complete Page: Crowdsourced Digital Curation

Web archives do the best they can to thoroughly archive web pages. However, sometimes users find missing content in web archived pages (e.g. missing embedded images).

Arquivo.pt provides the “Complete page” option at the replay user interface which automatically looks for missing content in external web archives and on the live web. 

The obtained content is later integrated in Arquivo.pt and becomes available for all the users. “Complete page” engages users in the curation of the web-archived collections. 

Arquivo404: Fix Broken Links

Link rot has been a prevalent problem since the early days of the web. Arquivo404 is a single-line javascript code to be installed on the “404 – Page not found” error pages that mitigates broken links. 

If a given page was not found, arquivo404 generates a message that suggests an alternative link to a web archived version of the broken URL preserved at Arquivo.pt. 

Notice that the message is displayed only if the page exists in Arquivo.pt. If it was not archived, the default “page not found” message error is presented. The list of web archives to be used is configurable.

Memorial: Preserve Your Old Website

There are many historical websites that provide valuable information but are no longer updated and require significant resources to be kept online. Moreover, costs grow as websites become older and dangerous security issues frequently occur. 

The Arquivo.pt Memorial offers high-quality storing of websites’ content with the possibility of maintaining their original domains. This way, the website content remains searchable through liveweb search engines. 

The links to internal pages on the website are also redirected to the correspondent webarchived pages to avoid the occurrence of broken links from external pages.

Training and Education on Web Preservation

Arquivo.pt has been raising awareness about the importance of web preservation. It issued a set of recommendations to develop preservable sites and has been promoting a free training programme, composed by 4 modules:

  • New ways of searching the past: presents the search and access available at Arquivo.pt and targets any Internet user;
  • Well publish to well preserve: discusses recommendations for publishing preservable websites and targets web authors;
  • Automatic processing of information preserved from the Web: presents  the Arquivo.pt APIs and targets web developers;
  • Web archiving – Do-it-yourself!: teaches how to adequately acquire, store, and replay web content and targets information professionals.

The Arquivo.pt Award annually distinguishes innovative works based on the historical information preserved by Arquivo.pt. The Arquivo.pt awards began in 2018, and the 15 works awarded so far clearly demonstrate the utility of web archives.

The members of the Arquivo.pt team have been publishing technical and scientific articles related to web archiving in open-access since 2008, including the book The Past Web: Exploring Web Archives (Green Open Access). All the developed software is available as free open source projects.

Main Challenge: Spread the Word About Arquivo.pt!

The Arquivo.pt project began in 2007, and it has been a public running service since 2013. However, most people in Portugal and all over the world have never heard about it. Getting people’s attention is a major challenge, especially in the online world. 

As most online information and services are apparently available for free, web archives must compete with the Internet giants (e.g. Google, Tik Tok, or Meta) for the web users attention. If you find Arquivo.pt to be useful and want to support it: Spread the word about Arquivo.pt!

Greetings from the Web Archiving Section: A New Beginning!

The 2022-2023 Steering Committee met for the first time in October to discuss our goals for the upcoming year, and we are currently scheduling Coffee Chats and lining up some great blog and Twitter content for the next year!

We have a few new members of the Steering Committee this year, so we are introducing ourselves by giving a brief introduction of our names, roles, institution affiliations, goals for the section, and our favorite web archive.

Susan Paterson (chair)

​​Greetings everyone! My name is Susan Paterson and I’m the Chair of the Web Archiving Section. I am a liaison librarian at Koerner Library at the University of British Columbia Vancouver campus, which is located on the traditional, ancestral, and unceded territory of the xʷməθkʷəy̓əm (Musqueam) people. My liaison areas are government information, social work and French studies. My web archiving work stems from my role as a government information librarian. I have been involved in various groups in Canada to document and preserve historical government content through web archiving.

This is my second year as both a SAA and WAS member and I hope to learn much more about web archiving from my colleagues. I hope to build on the tremendous work of past committee members to make this section both welcoming and an invaluable resource for web archivists. The more the merrier!

I admire the work being done by Saving Ukrainian Cultural Heritage Online (SUCHO), a group of over 1300 cultural heritage professionals collaboratively working to preserve at-risk Ukrainian cultural sites. The group was founded in March 2022 by Quinn Dombrowski, Sebastian Majstorovic, and Anna Kijas and you can learn more about SUCHO’s work and the SAA’s Resolution Honoring SUCHO

Corinne Chatnik (vice-chair)

I’m Corinne Chatnik, I recently began my position as Digital Collections and Preservation Librarian at Union College in Schenectady, NY and was previously a professional archivist specializing in digital archiving for five years at the New York State Archives. This is my first time on the Web Archive Steering Committee and will be serving as Vice-Chair. This year I’m excited to meet others working on web archiving and hearing about their experiences and successes. 

Allison Fischbach (secretary)

Hello, I’m Allison Fischbach, Digital Archivist for the Chesney Medical Archives at Johns Hopkins University & Medicine in Baltimore, MD. I previously served as a Student Member of the SAA Web Archiving Section and I’m glad to serve again as Secretary. This year I’m excited to learn more about the descriptive elements of web archiving, how researchers use web archives, and the tools and techniques used for capture and access.

My favorite web archive is Artbase from Rhizome. I think the way web content is rendered and re-interpreted as art is especially interesting. 

Amanda Greenwood (education coordinator)

Hello everyone! I’m Amanda Greenwood, Bigelow Project Archivist at Union College in Schenectady, NY. This is my second consecutive term on the SAA Web Archiving Section Steering Committee, and I’m looking forward to this year being as enjoyable as the last! I’ve worked on a few web archiving projects, some smaller and some larger-scale, and I also enjoy helping others start web archiving initiatives at their institutions.

In addition, my writings and research interests in web archiving focus on labor and maintenance; collective memory and responsibility; and emotion and trauma in web archives, so I’m very interested in hearing how other people who are involved in web archiving are discussing these ideas. For these next two years, my goal is to continue to support collaboration between other SAA sections and help promote innovative web archiving projects, methods, and resources to the community. As the Education Coordinator, I’m looking forward to facilitating engaging meetings and talks for our section members. Two web archives I highly recommend are Arquivo.pt, the Portuguese web-archive, and Oldweb.today, a site that lets users browse the past web using past browsers!

Mara Friedman (student member)

Mara is a current MLIS student at Rutgers University where I am splitting my studies between the Data Science and the Archives and Preservation concentrations. Professionally, I am the Gift Processing Associate in Fundraising at Doctors Without Borders USA. I first became fascinated with information science while working in fundraising operations on a database migration project where I assisted in untangling and rebuilding the underlying metadata records and data ingestion procedures. After a year’s worth of coursework I have become particularly interested in the preservation of digital-born resources — web archiving sits at the very crux of this. I’m excited for the opportunity to learn from and contribute to the Web Archives section.

Call for News!

We would love you to get involved and share ideas, so we invite you to participate by submitting news, announcements, and topics of interest for inclusion in the blog, listserv, or Twitter. We also welcome guest contributors to the blog, so please feel free to contact us with your ideas. Collaboration and submission are open to members of all SAA sections.

Please send items and suggestions to Allison Fischbach (afischbach@jhmi.edu).

Call for Web Archiving Section Committee Steering Members (deadline extended)

The Web Archiving Section is excited to accept nominations for the following Steering Committee positions for the 2022-2023 year!

  • Vice-Chair/Chair-Elect
  • Secretary
  • Communications Manager
  • Education Coordinator
  • Student Member

If you or someone you know would like to run for a position on the the Web Archiving Section Steering Committee please fill out this form by August 2022 with the following:​

  • Candidate Name
  • Job Title and Institution, if applicable
  • Bio and Candidate Statement (1-2 paragraphs)
  • Title of Steering Committee position sought

Position descriptions can be found below. Please keep in mind that membership in the Web Archiving Section is required in order to participate in elections through candidacy or casting a ballot. You may only run for one position. To learn more about the Web Archiving Section, check out the Web Archiving Section microsite and the Web Archiving Section blog

We look forward to hearing from you!

Position Descriptions:

Vice-Chair/Chair-Elect: The Vice Chair serves for two years, the first year as Chair-Elect and the second year as Chair.

  • Supports duties and responsibilities of the Chair as assigned.
  • Operates as acting Chair in the absence of the Chair.
  • Serves as member of the Steering Committee.
  • Fulfills all responsibilities specified in Section IX: Sections of the SAA Governance Manual.

Secretary (two-year term)

  • In consultation with Chair and Vice Chair establishes all Steering Committee meetings.
  • Calls for and distributes agenda items for Steering Committee meetings. 
  • Records meeting minutes and distributes them to the Steering Committee. 
  • Serves as member of the Steering Committee.

Education Coordinator (two-year term)

  • Serves as the section’s liaison to SAA Education Committee.
  • Arranges informal online meet-ups for members.
  • Prepares educational experiences, such as guest speakers, etc.
  • Serves as member of the Steering Committee.

Communications Manager (one-year term)

  • Maintains and updates the section’s microsite, blog, and Twitter feed.
  • Keeps section’s email list recipients informed on section news, events, and regular activities.
  • Serve as a member of the Steering Committee. 

Student Member (one-year term)

  • Serves as a liaison to SAA student chapters and groups. 
  • Serves as a member of the Steering Committee. 
  • Must be an actively enrolled student and student member of SAA at the time of election.

Coffee Chat Recap with Library and Archives Canada’s Tom Smyth

By Susan Paterson, SAA Web Archiving Section Vice-Chair and liaison librarian at Koerner Library at the University of British Columbia Vancouver.

The Web Archiving Section hosted another successful Coffee Chat which featured Tom Smyth, Program Manager of the Web and Social Media Preservation Program within the Digital Preservation Division at Library and Archives Canada (LAC). Tom discussed the evolution of the program which began in December 2005. The collection exceeds 100 TB of content and includes over 2.64 billion assets from wide ranging collections such as Government of Canada websites, thematic and rapid response collections. 

At the start of the LAC’s National Web Archive Program, collection activities focused on the Government of Canada domain, provincial and federal elections, Canada’s experience in the Summer and Winter Olympics and Paralympics and commemorative events such as state funerals and the War of 1812. In 2013, the program expanded to include thematic collections and events-based collections such as COVID-19, the 2022 trucking convoy protest in Ottawa, and current Canadian perspectives from the war in Ukraine. The program conducts rescue and preservation harvesting and ensures that federal government websites are preserved, pending decommission or content removal to another website. 

So who does all this work at LAC? The web archiving team consists of digital librarians and archivists that bring their own unique perspectives to web archival curation. On the technical side, the program’s senior crawl engineers are a critical component in tackling complex quality control issues. Tom and his team ensure that data curation aligns with user requirements and includes a digital humanities perspective with the purpose of building datasets for data historians, researchers and scholars. Tom used the following example to explain how the the digital librarians and digital archivists approach collection curation: “We ask the question, ‘Twenty years from now, when a digital historian sits down to write about the history of COVID-19 and its impact on Canada, what kind of data and sources do they wish they would have?’ This influences how we select for curation.” COVID-19 has demonstrated that web archiving is a key resource for  documenting history and the pandemic has influenced how they collect materials. 

LAC’s COVID-19 web archive consists of over 2000 resources, 16 TB of data and over 478 million objects, including 34 newspapers in both official languages representing various political and regional perspectives from coast to coast – and it continues to grow. The collection includes social media and over 4 million tweets concentrating on COVID-19 dialogue and its impact on Canadians.

Tom discussed the “black hole of quality control” (QC) and described the importance of using a methodological approach when conducting QC. He explained the importance of both a framework and a balance as QC can be a never ending project. For more on LAC’s approach to quality control, you can view one of the 2021 IIPC presentations from Tom Smyth and Patricia Klambauer The Black Hole of Quality Control: Toward a Framework for Managing QC Effort to Ensure Value

The importance of web archive finding aids was a key thread throughout the talk and later in the discussion. As Tom explained, finding aids enable researchers to see at a glance whether the collection or datasets are helpful in addressing their research questions. Ideal finding aids would include practical information such as a master seedlist, how many seeds for each theme, resource distribution, type of resource, language, QC status, and metadata specifications. 

The Prime Minister of Canada’s website (www.pm.gc.ca) was used as an example to describe why web archiving is so important. Back in 2006, LAC had only five days to ensure that all of outgoing Prime Minister Paul Martin’s website was captured before the site was replaced with the website of newly elected Prime Minister Stephen Harper. Tom underlined the importance of being proactive and the need to have good working relationships with partners and government departments such as the Privy Council Office, which supports the Office of the Prime Minister. 

The Truth and Reconciliation Commission Web Archive (TRC) is another example of collaboration between LAC and other institutions. LAC worked jointly with the University of Winnipeg Library, the University of Manitoba Libraries and the National Centre for Truth and Reconciliation (NCTR) to develop this collection. The relationship with Indigenous Peoples is a concentration for LAC and a constant subject of collection. All of these collections will be publicly findable in the Government of Canada Web Archive, which is planned to be released in the fall of 2022. 

Tom discussed the ethical and legal considerations of web archiving. The Library and Archives of Canada Act (S.C. 2004, c. 11, s.8(2)) empowers LAC to collect pertinent web content. Steps are taken to ensure the copyright owner is informed that their site will be archived and takedown requests are respected. Even with LAC’s extensive web archiving experience and resources, some websites just can’t be crawled due to their structure and the need for human input. 

LAC is a client of the Internet Archive and uses the Archive-IT software.  All of the WARCS are transferred back to LAC for local digital preservation via LTO tape. 

Tom closed his talk on a hopeful note. Our efforts today will help us to reduce the likelihood of what Vint Cerf has termed a “digital black hole” and “the forgotten century.” The global community of web archivists, through their concerted efforts, are making every effort to ensure that nationally relevant materials are captured, preserved and made available for future generations.

Virtual Joint Section Meeting: Web Archiving & Performing Arts

SAA’s Web Archiving Section and Performing Arts Section will hold a virtual joint section meeting on Thursday, August 11, 2022 from 1-2:30 pm CT.

The steering committees invite colleagues to present case studies (up to 10 min) about the intersections between special collections, web archives, and born-digital records in all areas of the performing arts. Possible topics could include:

  • Projects and workflows for archiving web presences and born-digital records of performing arts organizations, artists, and companies, including social media
  • How and why researchers use and interact with performing arts web archives
  • Data computation, analysis, and visualization projects based on performing arts web archives and born-digital records
  • Collection development policy initiatives to incorporate web archives into performing arts special collections
  • Including performing arts web archives in finding aids
  • Reference and engagement work around performing arts web archives and born-digital records
  • Web archiving and post-custodial performing arts archives
  • Performing arts web archives design for mixed audiences/diverse patrons

For consideration, please send a presentation title, brief description (100 words or less), and name(s) of presenter(s) to Performing Arts Section Co-Chair Cecily Marcus (cecily.marcus@mnhs.org) and Web Archiving Section Chair Melissa Wertheimer (melissa.wertheimer@gmail.com) by July 1, 2022.

Library and Archives Canada Coffee Chat: Video and Slides

On Tuesday, May 10, 2022 the SAA Web Archiving Section was honored to host Library and Archives Canada’s Tom Smyth discussing web and social media preservation.

A video of the coffee chat and presenter slides will be available for a limited time.

Video

Library and Archives Canada Coffee Chat

Presenter Slides

Web and Social Media Preservation Program at the National Library and Archives Canada

Coffee Chat Recap: U.S. Federal Government Web Archiving

By Susan Paterson, SAA Web Archiving Section Vice-Chair and liaison librarian at Koerner Library at the University of British Columbia Vancouver.

The Web Archiving Section hosted its first coffee chat of the year on March 29, 2022. Melissa Wertheimer, Chair of the Web Archiving Section, led a panel discussion on US Federal Government web archiving activities. We were joined by web archiving experts from the National Library of Medicine, Government Publishing Office, the Smithsonian and the Library of Congress. Topics ranged from content curation, collaboration amongst agencies, staffing, workflow models, and successes and challenges, as well as the impact of COVID-19 on their collecting activities.

The coffee chat recording is available to viewers until summer 2022.

National Library of Medicine – History of Medicine Division, Digital Manuscripts Program

The session began with an informative presentation by Christie Moffatt, Manager of the Digital Manuscripts Program, History of Medicine Division at the National Library of Medicine (NLM). Christie described NLM’s approach to web collecting, an activity which began in 2009 with their Health and Medical Blog Collection. Christie provided an overview of the NLM web archive collections, and the collection development policies that guide their web collecting.

Christie noted that thematic and events based collections have grown and are a key component to their collection building. The Global Health Events Web Archive collection, one of their largest, began in 2015 with the Ebola outbreak. With the World Health Organization’s announcement of a global health pandemic in January 2020, NLM started work on a COVID-19 pandemic collection. Interestingly, this specific designation of a global health emergency is worked into their collection development policy and the NLM takes responsibility for building a web archive once this designation is made. An aim of the COVID-19 collection is to ensure a diversity of perspectives on both the impact and the response to the pandemic are archived. Tools and communication used for outreach during the pandemic (i.e. TikTok, Twitter) as well as personal narratives of the pandemic make up part of the collection which is the Library’s largest collection with 3.5 FTES working on the project. The National Library of Medicine’s Archive It website can be explored here.

Government Publishing Office

Dory Bower, Archives Specialist at the Government Publishing Office (GPO) provided an overview of GPO’s web archiving activities and explained how legislation Title 44 of the U.S. Code is the mandate for Public Printing and Documents. Specifically, Chapter 19 discusses the Depository Library Program. Federal agencies should be distributing and publishing their documents through the GPO. If a federal agency is not doing so, the Superintendent of Documents should be notified. Unfortunately, with born digital publications and publishing directly onto websites this is not happening and material is being missed. GPO joined the Archive-it community in 2012 and since then they have been using web archiving to help fill this gap.

The web archive is part of the overall GPO Digital Collection Development Plan. For material to be part of the GPO collection, it must be federally funded and be within the scope of the Federal Depository Library Program (FDLP). GPO is also focusing on collections of interest geared towards women, tribal libraries, and Native American communities, just to name a few. GPO maintains over 213 collections in Archive-it, making up 38.3 TB of data and consisting of over 392 million urls. You can explore the Federal Depository Library Program Web Archive here.

Smithsonian Institution – Libraries and Archives

Lynda Schmitz Fuhrig, Digital Archivist at the Smithsonian Institution Libraries and Archives, presented next. Lynda provided a fascinating overview of the web archiving work that’s being done at the Smithsonian, which celebrated its 175th anniversary in 2021.

The Libraries and Archives of the Smithsonian is the official repository and record keeper for the Smithsonian. Their responsibilities include sharing, collecting, and preserving Smithsonian activities and output which includes documenting its unique web presence, which launched in 1995. They now have nearly 400 public websites and blogs and a very active social media presence covering Twitter, Instagram, Flickr, and YouTube – just to name a few.

Like the National Library of Medicine, the Smithsonian has documented the impacts and effects of the COVID-19 pandemic on America. For example, beginning in March 2020, more focused and frequent crawls were necessary to document the altered scheduling of the closing and reopening of the Museums and the Zoo. Additionally, the closure of museums created a need for an increased digital presence, and the Smithsonian launched several new websites and initiatives, including Care Package, Vaccines & US and Our Shared Future.

Audience members were particularly interested in their web and social media workflows and tools. Along with Conifer and Browsertrix, the Smithsonian uses netyltic, which was developed by the Social Media Lab at Ryerson University to collect Smithsonian hashtags and accounts.

They are one of the few organizations that download the WARCS from Archive-it. They developed an in-house tool called WARC Retriever which they hope to release on Github later this year. Lynda’s summation was poignant: “The Smithsonian web collections will continue to tell the history and stories of the Smithsonian.” You can explore the Smithsonian Archive-it page here.

Library of Congress – Web Archiving Program

To round out the panel, Meghan Lyon and Lauren Baker, Digital Collection Specialists from the Library of Congress (LOC) Web Archiving Program, provided an overview of the activities at LOC. The Web Archiving Program began in 2000 and is part of the Library’s Digital Content Management Section.

The LOC web archives consist of 3PB of data organized into 174 collections, 75 of which are active collections. Like many of the other speakers, the Web Archiving Team collaborates frequently on web archive collections, relying on the contributions of collaborators around the Library. The Collection Development Office helps guide collection initiatives, and various committees review subject proposals and select content to archive and determine collection focus. LOC comprehensively collects content from Legislative Branch Agencies and U.S. House and Senate offices and committees. They collect content about U.S. national elections as well as managing other events based and thematic collections.

Megan and Lauren addressed the issue of permission and web archiving. Their permissions policy is determined by the Office of the General Counsel, which is based on Copyright Law. Permission requests must be sent to site owners for anything selected for web archiving. There are two permission requests: a permission to crawl, and permission to display based on the country of publication and the category of the entity. You can explore the LOC Web Archiving Program website here.

The panel closed out the session by discussing how they became interested in web archiving and how their careers started in the field. Their initial experiences ranged from practicums to learning on the job. The remainder of the conversation also included the topics of trends, the and future of web archiving tools – including what improvements people hope for and imagining better tools for harvesting and user awareness. The session was well attended with 181 registrants and over 80 attendees. Thank you to everyone who presented and who attended for such an engaging hour. Stay tuned for our next coffee chat, which will be in May!

Introducing the DocNow App

This week’s post was written by Zakiya Collier, Community Manager at Documenting the Now.

This week the Documenting the Now project announces the release of DocNow, an application for appraising, collecting, and gathering consent for Twitter content. DocNow reimagines the relationship between content creators and social media analysts (archivists and researchers) by addressing two of the most challenging issues of social media archiving practice—the issues of consent and appraisal.

The Documenting the Now Project is happy to release version 1.0 of our open-source tool freely for anyone to use. Read all about the app, what it does, and how to engage with the DocNow project team for support and providing feedback.

Over the last seven years, Documenting the Now has helped to foster an environment where a more broad sector of cultural memory workers can learn about web archiving tools and practices and can become involved with web archiving networks. This has largely been achieved by practicing transparency and inviting people who have traditionally been left out of established web content archiving networks into the project to participate, namely students, activists, and archivists who represent marginalized communities and who work in community-centered organizations, HBCUs, public libraries, community-based archives, and tribal libraries and archives.

Documenting the Now was a response to the need among scholars, activists, archivists, and other memory workers for new tools that would provide easily-accessible and user-friendly means to collect, visualize, analyze, and preserve web and social media content to better document public events. In addition, it aimed to respond to questions and concerns related to ethics, safety, intellectual property, and access issues for the collection, preservation, and dissemination of Twitter data in particular.

Documenting the Now has also developed community-centered web and social media archiving tools that both prioritize care for content creators and robust functionality for users:

  • Twarc – a command line tool and Python library for collecting tweet data from Twitter’s official API
  • Hydrator – a desktop application for turning Tweet ID datasets back into tweet data to use in your research
  • Social Humans – a label system to specify the terms of consent for social media content
  • The Catalog – a community-sourced clearinghouse to access and share tweet identifier datasets

In continuing to support and develop tools that embody ethical practices for social media archiving, the DocNow app joins this suite of tools. DocNow is an application for appraising, collecting, and gathering consent for Twitter content and includes several new features including:

  • Trends tab to view trending topics across the globe in real time
  • Explore tab to view content by users, media, URLs, and related hashtags all on one screen
  • Live testing and refining of collecting parameters on recent tweets
  • Tweets per hour calculator to easily identify Twitter bot accounts
  • Search and Collect tweets back in time via Search API and forwards with Stream API
  • Activate toggle to start collecting tweets and send a notification tweet to encourage transparency and communication in Twitter data collection
  • Collections tab to share information about your collection with the public
  • “Find Me” and Insights Overview features to specify and gather consent using Social Humans labels
  • Download Tweet ID archive for sharing following Twitter’s terms of service

The DocNow app also works in concert with other Documenting the Now tools, creating for users, a 4-step social media archiving journey:

Step 1: Collect content with the DocNow App by activating a search. Set collection limits and explore insights as your collection grows.
Step 2: Download your archive from the DocNow App, which includes a Tweet Viewer, Tweet IDs, and media files.
Step 3: Hydrate your Tweet IDs from the archive’s tweets.csv file back into full
tweets using DocNow’s Hydrator desktop application.
Step 4: Describe your collection and share your Tweet IDs with other researchers by adding them to the DocNow Catalog.

Ways to Use DocNow
There are 3 different ways to use DocNow including joining the community instance, running DocNow locally on a computer, and installing an instance of DocNow in the cloud. The Community Instance is a great way to get familiar with the tool before committing to running an instance but those with development skills may want to administer their own instance of DocNow locally or in the cloud.

  1. Join Documenting the Now‘s hosted community instance
  2. Run DocNow locally on your machine
  3. Install your own instance in the cloud

For help with installation and getting started, the Documenting the Now team will host community conversations. Dates will be announced soon! More information about the DocNow App can be found here.

Documenting the Now is seeking community input on all of our features as we continue to develop DocNow. Please join our slack channel by going to our website or email us at info@docnow.io.