Learning How to Web Archive: A Graduate Student Perspective

This week’s post was written by Amanda Greenwood, Project Archivist at Union College’s Schaffer Library in Schenectady, New York.

While it is not a new aspect of archival work, web archiving is an important practice that has advanced significantly in the past ten years. Developers are creating new web archiving tools and improving the current ones, but the process can be challenging because of dynamic web content, changing web technology, and ethical concerns. However, the need for preserving web-based material has become a priority for improving preservation, creating “greater equity and justice in our preservation practices, and [finding] ways to safeguard the existence of historical records that will allow us in future to bear witness, with fairness and truth and in a spirit of reconciliation, to our society’s response to COVID-19” and other social justice issues. The explosion of web archiving initiatives in various institutions and organizations has created web archivists of us all; however, how difficult is it for someone with zero experience to learn this important skill?

From October 2020-May 2021, I was awarded the Anna Radkowski-Lee Web Archives Graduate Assistantship at the University of Albany, State University of New York. My responsibility was to manage the web archives collections with Archive-It through web crawling, scheduling, reading host reports, and rescoping. These activities would culminate with a meeting at the end of my assistantship to discuss appraisal of the active collections. While I was excited to learn this archival skill, I did not have any experience with web development, so translating the host reports took a few months to learn because I did not understand some elements of the URLs that were in the host reports. For example, I did not know that some parts of a URL told me the website was a WordPress site, and I had never heard of “robots.txt” before. Thus, my supervisor, University Archivist Greg Wiedeman, spent a lot of time at the beginning of the assistantship teaching me how to translate the host reports. I needed to learn HTTP codes and other parameters that would help me make sense of how to rescope the crawls more efficiently.

I really appreciated the support from everyone at the Archive-It Help Center because they were instrumental in helping me solve a lot of problems related to the crawls. However, I felt the Help Center website instructions and tutorials were a bit difficult to follow at times. I think they are probably easier to for more experienced users to understand, or at least for users who are familiar with web development. The other frustrating element was that managing the collections via web crawling was extremely time-consuming, but I was only allowed to work 20 hours a week per the assistantship. It was quite a substantial role, and I realized the job required someone to work on this full-time and not only 4-5 hours a day.

The unpredictability, frequency, and length of website updates also proved to be a large obstacle. I would work hard to efficiently scope one collection and schedule it, but the host report would come back with an error because the website was being updated. I often had to put those collections aside and return to them at a later date, but then the update would yield a whole new set of problems with the host report, and I would spend more time rescoping and rerunning test crawls for those collections. Test crawls would also require multiple runs and constant rescoping, which meant that I had to run longer test crawls, and some of those would take four weeks. The host reports were difficult to decipher because of crawler traps or long strings of characters. Learning how to read the host reports took the longest amount of time.

Additionally, I would get so excited at a potential strong test crawl, but after QA I would notice dynamic content like interactive web elements were not captured. Thus, I would need to rescope and rerun the test crawls and use other web crawling tools like Conifer. I looked into using other open source tools such as WARCreate, WAIL, and Wget, but Conifer ultimately helped me with what I needed at the time.

Moreover, we often were presented with impromptu requests to archive a faculty website in a timely fashion because of retirement or the website host was taking down the website, so it was stressful prioritizing those new projects. Some of the faculty websites utilized advanced web development elements, so I needed to use the aforementioned open source web crawling tools to capture the site and upload the WARC file into Archive-It.

In sum, having zero web developing and coding experience did not impede my ability to learn web archiving, although having it would have helped me significantly. Having a subscription to Archive-It can be helpful because of the technical support and video tutorials, but open access tools and software are plentiful, and the community is helpful and supportive in terms of training and instructive documentation. With my web archiving training, I was able to help other institutions initiate their own web archiving programs, and I hope to continue on this trajectory in the future.


Jones EW, Sweeney S, Milligan I, Bak G, and McCutcheon J-A. 2021. Remembering is a form of honouring: preserving the COVID-19 archival record. FACETS 6: 545–568. doi:10.1139/facets2020-0115

Community Input for Web Archiving at Small and Rural Libraries

This week’s post was written by Grace McGann (Moran), Teen Librarian, Tipp City Public Library.

Before I begin, I want to acknowledge that Tipp City Public Library is on the unceded lands of the Kaskaskia, Shawnee, Hopewell, and Myaamia.

The Tipp City Public Library has been open for nearly a century now, in a town rich with history. We are located in the downtown historic district of a city whose population falls just below 10,000. Because we are such an ingrained part of the community, it is difficult to work here and not notice how our patrons care about our city’s history. This blog post briefly examines the importance of involving our respective communities in the collection development process.

My previous experience in web archiving at the University of Illinois drove me to apply to the Community Webs program at the Internet Archive. This program was built for small cultural heritage institutions to create a diverse web history while using Archive-it technology at zero cost.

Having never built a web archive from the ground up, one of the first questions I faced was: where to start? I find that with any collection, the problem isn’t necessarily struggling to find materials, it is choosing a specific policy of selection and collection. In a previous blog post, I made an argument for separate collection development policies for web archiving. Having one in place makes evaluating websites simpler, especially when the discovery process relies on outside entities such as community members.

After joining Community Webs just a few months ago, I had a curated list of websites for capture. However, these only came from my cursory searches around the internet. In order to move beyond those first websites and create a more representative archive, I leveraged the local network. First, I reached out to a sociology professor at the University of Dayton, Dr. Leslie Picca. I knew that she was doing research on race and could possibly have connections to colleagues in the digital humanities. She led me to Dr. Todd Uhlman of the history department at UD.

Connecting with Dr. Uhlman changed my thinking about how to build this web archive. In learning about the digital humanities work he is doing in the Greater Dayton Area, I found valuable websites to be captured and preserved.

After speaking to Dr. Uhlman and agreeing to capture his content, I was contacted by community members who wanted websites to be preserved. After evaluating and capturing the websites, I created a community input form. While I am still waiting for this form to gain more traction, my theory is that community input is crucial for web archiving at small cultural heritage institutions.

I am not alone in this assertion. Papers about community archival practices demonstrate an urgent need for this sort of involvement. Zavala et. al. (2017)(though speaking more about physical archives) shared this:

There is no reason why government or university archives could not engender post-custodial practice, foster community autonomy and promote shared governance, if only they are willing to share power and authority with the communities they have historically left out.

By changing the way we engage in collection development, we challenge the systems of oppression that have been institutionalized within record-keeping institutions (whether we are aware of them or not).

I have a lot left to say about web archiving, but I want to drive this point home. Archives, whatever form they take, provide cultural value. Culture does not exist without community. Therefore, it’s actually pretty simple: communities should help create archives.