Learning How to Web Archive: A Graduate Student Perspective

This week’s post was written by Amanda Greenwood, Project Archivist at Union College’s Schaffer Library in Schenectady, New York.

While it is not a new aspect of archival work, web archiving is an important practice that has advanced significantly in the past ten years. Developers are creating new web archiving tools and improving the current ones, but the process can be challenging because of dynamic web content, changing web technology, and ethical concerns. However, the need for preserving web-based material has become a priority for improving preservation, creating “greater equity and justice in our preservation practices, and [finding] ways to safeguard the existence of historical records that will allow us in future to bear witness, with fairness and truth and in a spirit of reconciliation, to our society’s response to COVID-19” and other social justice issues. The explosion of web archiving initiatives in various institutions and organizations has created web archivists of us all; however, how difficult is it for someone with zero experience to learn this important skill?

From October 2020-May 2021, I was awarded the Anna Radkowski-Lee Web Archives Graduate Assistantship at the University of Albany, State University of New York. My responsibility was to manage the web archives collections with Archive-It through web crawling, scheduling, reading host reports, and rescoping. These activities would culminate with a meeting at the end of my assistantship to discuss appraisal of the active collections. While I was excited to learn this archival skill, I did not have any experience with web development, so translating the host reports took a few months to learn because I did not understand some elements of the URLs that were in the host reports. For example, I did not know that some parts of a URL told me the website was a WordPress site, and I had never heard of “robots.txt” before. Thus, my supervisor, University Archivist Greg Wiedeman, spent a lot of time at the beginning of the assistantship teaching me how to translate the host reports. I needed to learn HTTP codes and other parameters that would help me make sense of how to rescope the crawls more efficiently.

I really appreciated the support from everyone at the Archive-It Help Center because they were instrumental in helping me solve a lot of problems related to the crawls. However, I felt the Help Center website instructions and tutorials were a bit difficult to follow at times. I think they are probably easier to for more experienced users to understand, or at least for users who are familiar with web development. The other frustrating element was that managing the collections via web crawling was extremely time-consuming, but I was only allowed to work 20 hours a week per the assistantship. It was quite a substantial role, and I realized the job required someone to work on this full-time and not only 4-5 hours a day.

The unpredictability, frequency, and length of website updates also proved to be a large obstacle. I would work hard to efficiently scope one collection and schedule it, but the host report would come back with an error because the website was being updated. I often had to put those collections aside and return to them at a later date, but then the update would yield a whole new set of problems with the host report, and I would spend more time rescoping and rerunning test crawls for those collections. Test crawls would also require multiple runs and constant rescoping, which meant that I had to run longer test crawls, and some of those would take four weeks. The host reports were difficult to decipher because of crawler traps or long strings of characters. Learning how to read the host reports took the longest amount of time.

Additionally, I would get so excited at a potential strong test crawl, but after QA I would notice dynamic content like interactive web elements were not captured. Thus, I would need to rescope and rerun the test crawls and use other web crawling tools like Conifer. I looked into using other open source tools such as WARCreate, WAIL, and Wget, but Conifer ultimately helped me with what I needed at the time.

Moreover, we often were presented with impromptu requests to archive a faculty website in a timely fashion because of retirement or the website host was taking down the website, so it was stressful prioritizing those new projects. Some of the faculty websites utilized advanced web development elements, so I needed to use the aforementioned open source web crawling tools to capture the site and upload the WARC file into Archive-It.

In sum, having zero web developing and coding experience did not impede my ability to learn web archiving, although having it would have helped me significantly. Having a subscription to Archive-It can be helpful because of the technical support and video tutorials, but open access tools and software are plentiful, and the community is helpful and supportive in terms of training and instructive documentation. With my web archiving training, I was able to help other institutions initiate their own web archiving programs, and I hope to continue on this trajectory in the future.

Source:

Jones EW, Sweeney S, Milligan I, Bak G, and McCutcheon J-A. 2021. Remembering is a form of honouring: preserving the COVID-19 archival record. FACETS 6: 545–568. doi:10.1139/facets2020-0115

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s