Reviving Union College’s Web Archiving Program

This post was written by Corinne Chatnik, Web Archiving Steering Committee Chair and Digital Collections and Preservation Librarian at Union College.

In 2023, the web archiving program at Union College was in a unique situation. The program was started using Archive-It by a former librarian several years ago. After they left, crawls and collections were not maintained and some crawls continued to run for several years. I’ve only been in my current position for a little over a year and during that time, I analyzed our institution’s previous web archiving efforts and realized we needed to reevaluate our workflow and develop policies and procedures. 

Archive-It’s software and service is amazing, but it needs to be maintained and reevaluated every so often. Our instance was capturing anything that had been linked from the main Union College website, student blogs, social media for every organization on campus, YouTube videos, and  Flickr accounts for years unchecked. All that data amounted to 7 TB, which is just an incredibly large and expensive amount of data.

Essentially, we decided to continue to use Archive-It, but start everything from scratch. To give me a hand in this project, I hired an intern named Grant to work 20 hours a week over the summer. Grant started analyzing the type of content that was already in our Archive-It instance and broke them into categories like college records, social media, student work, and news stories. Then, we took those categories of records to the College Records Manager and Archivist and discussed what fell into our Archive and Special Collections collecting scope.

We went to Special Collections and Archives because we didn’t want web archiving to be considered this intimidating, technical process, another type of record within our collection that had a unique way of being collected and harvested. With the Archivist, we decided that records such as student work that was  uploaded to a class site and nonofficial social media should not be captured because it wasn’t ethical to archive student work without permission. With image repositories like Flickr and official videos from YouTube, we would try to get the originals from the source instead of crawling those sites and preserving them like born-digital archival material.

In terms of the practical use of the Archive-It software, the documentation is  robust, so we ran a lot of test crawls trying out the different methods of crawling. Through trial and error, we performed quality assurance checks to make sure all the content looked right. Outside of the technical aspects of web archiving, we wanted to justify the program by creating a policy. We wanted this policy to address the goals of the web archiving program, the stakeholders, selection and appraisal, access, and takedown.

To help us make these decisions, we first researched other college and university’s web archiving policies and identified aspects of each that we thought could apply to us. I also found this great worksheet from the Digital Preservation and Outreach and Education Network to help you define the scope and collecting statement of your web archiving program.

The shortened version of our policy is as follows:

  •  Our goal is to proactively archive web site content as historical and cultural record for future historians and researchers.
  • Our stakeholders include the faculty, students, alumni, and staff of Union College.
  • The target audience for the use of archived material is those in the community with scholarly and intellectual interests in connection with Union College.
  • Our selection policy is identifying official websites and digital records related to Union College that are of permanent value or no longer actively used.
  • Our access policy is that these materials will be made accessible publicly via Archive-It. Archive-It is a community of partner organizations and their shared, “end-to-end” web archiving service is administered by the Internet Archive.
  • Our takedown policy is that Schaffer Library will respect the creator’s intellectual property rights by providing “opt-out” recourse after acquisition.

After we were happy with all of that we permanently crawled the sites within our scope and worked on a finding aid for the Union College Web Site Archive collection.  You can also view Union College’s web archive collection.

Finally, a quote Grant gave me about his internship experience is the following: “My work on web archiving has not only allowed me to familiarize myself with Archive-It software and preserve essential material about Union College on the web, but also the methodology behind what to include or exclude from a web archive due to obstacles unique to the medium of the internet and the potential legal consequences of certain errors or oversights.”

Leave a comment