A Tale of Two Tools: Archiving the University at Buffalo’s Web Periodicals

On February 21, 2024 By webarchrtIn Case Study, Tools & TechniquesLeave a comment

This post was written by Grace Trimper, Digital Archives Technician at University at Buffalo.

University Archives has been collecting alumni magazines, school and departmental newsletters, student newspapers, and literary magazines for decades – likely since the department was established in 1964. For a long time, this meant getting on snail mail lists or walking to departments to pick up the print issue of whichever publication was just released. These days, collecting periodicals also looks like clicking “download” or copying and pasting a URL into a web archiving tool.

Our work with digital periodicals ramped up in 2020, partially because the pandemic caused more and more publications to be delivered online. Most of UB’s periodicals are now available both in print and as PDFs, which makes preservation relatively straightforward: we download the PDF, create Dublin Core metadata, and ingest the package into our digital preservation system, Preservica.

We also saw schools, departments, and organizations begin publishing strictly web-based content without print or PDF surrogates. One of the first web periodicals we added to our collection was “The Baldy Center Magazine,” the semesterly publication of the Baldy Center for Law and Social Policy. The Digital Archivist set up a web crawl using Preservica’s Web Crawl and Ingest workflow, which uses Internet Archive’s Heritrix web crawler to capture websites and their various subpages.

The main benefit of this approach is convenience. We set the workflow to make a predetermined number of hops and capture a maximum depth, so we can crawl the magazine and its linked pages without capturing UB’s entire website. Other than checking the resulting web archive file (WARC), there’s not much else we need to do once the workflow is run. When the web crawl is complete, the WARC automatically ingests into the collection’s working folder, and we can link it to the finding aid without any other intermediary steps.

However, we did notice the Heritrix web crawler struggles with some university webpages and does not always capture images and multimedia. The captured magazine looked almost right – it was just missing a few of the pictures. We learned that the renderer the system uses has trouble intercepting URLs for some of the images throughout the website. This is a known limitation, and the construction of UB’s website can be more complex than the system can handle.

We ran into similar obstacles when running weekly crawls on the website for UB’s student newspaper, The Spectrum. As an independent newspaper, its website is not hosted by the university. At first, this had some benefits: the Heritrix crawler didn’t have the image problems we saw with UB-hosted sites, and the rendered WARCs looked basically identical to the live site.

Then, they redid their website. Our captures haven’t looked good since. Certain fonts are wrong, and the navigation menu expands to cover 90% of the page any time you try to click on or read an article. It seems to be another issue with the renderer, so we continue to crawl the site as usual to make sure we don’t miss out on preserving new content.

Even though the Heritrix crawler worked well for some of our collections, it was costing us time and energy that could be spent on other projects. We needed another option. Enter Conifer.

Conifer is a free and open-source web archiving tool with a web interface. We have had better luck capturing images and multimedia with Conifer and Webrecorder. Like The Spectrum, UB’s undergraduate literary magazine, NAME is hosted independently from UB’s website. Its construction is relatively straightforward: there’s a webpage for each issue with links to contributions, and the website is full of images. There are also a couple of technologically unique works on the site, including a link to a JavaScript multimedia piece.

When I crawled the site for preservation in UB’s Poetry Collection, Conifer had no problem capturing any of these, and the resulting WARCs display perfectly in our public user interface.

This approach doesn’t come without its drawbacks. Where the Web Crawl and Ingest workflow in Preservica is convenient and automatic, using Conifer can be tedious. First, there is no setting and forgetting; if you want to capture a complex website with various links and subpages, you must start the application and then open each link to every page you want to capture. If you have too many tabs open, the application can randomly stop in the middle of a crawl, leaving you to start all over again. On top of that, we have the extra steps of downloading, unzipping, and ingesting the WARC, plus manually copying and pasting the URL into the asset’s metadata before the captured page will display in the digital preservation system.

No approach has been perfect thus far, and I don’t expect it will be for a while. Web archiving technology is constantly growing and improving, and how we attack web archiving depends heavily on the material. But with the tools we have available to us, we can preserve important pieces of UB’s history we wouldn’t have been able to before. And that’s kind of the point, isn’t it?

Reviving Union College’s Web Archiving Program

On December 9, 2023 By webarchrtIn Case Study, Tools & TechniquesLeave a comment

This post was written by Corinne Chatnik, Web Archiving Steering Committee Chair and Digital Collections and Preservation Librarian at Union College.

In 2023, the web archiving program at Union College was in a unique situation. The program was started using Archive-It by a former librarian several years ago. After they left, crawls and collections were not maintained and some crawls continued to run for several years. I’ve only been in my current position for a little over a year and during that time, I analyzed our institution’s previous web archiving efforts and realized we needed to reevaluate our workflow and develop policies and procedures.

Archive-It’s software and service is amazing, but it needs to be maintained and reevaluated every so often. Our instance was capturing anything that had been linked from the main Union College website, student blogs, social media for every organization on campus, YouTube videos, and Flickr accounts for years unchecked. All that data amounted to 7 TB, which is just an incredibly large and expensive amount of data.

Essentially, we decided to continue to use Archive-It, but start everything from scratch. To give me a hand in this project, I hired an intern named Grant to work 20 hours a week over the summer. Grant started analyzing the type of content that was already in our Archive-It instance and broke them into categories like college records, social media, student work, and news stories. Then, we took those categories of records to the College Records Manager and Archivist and discussed what fell into our Archive and Special Collections collecting scope.

We went to Special Collections and Archives because we didn’t want web archiving to be considered this intimidating, technical process, another type of record within our collection that had a unique way of being collected and harvested. With the Archivist, we decided that records such as student work that was uploaded to a class site and nonofficial social media should not be captured because it wasn’t ethical to archive student work without permission. With image repositories like Flickr and official videos from YouTube, we would try to get the originals from the source instead of crawling those sites and preserving them like born-digital archival material.

In terms of the practical use of the Archive-It software, the documentation is robust, so we ran a lot of test crawls trying out the different methods of crawling. Through trial and error, we performed quality assurance checks to make sure all the content looked right. Outside of the technical aspects of web archiving, we wanted to justify the program by creating a policy. We wanted this policy to address the goals of the web archiving program, the stakeholders, selection and appraisal, access, and takedown.

To help us make these decisions, we first researched other college and university’s web archiving policies and identified aspects of each that we thought could apply to us. I also found this great worksheet from the Digital Preservation and Outreach and Education Network to help you define the scope and collecting statement of your web archiving program.

The shortened version of our policy is as follows:

Our goal is to proactively archive web site content as historical and cultural record for future historians and researchers.
Our stakeholders include the faculty, students, alumni, and staff of Union College.
The target audience for the use of archived material is those in the community with scholarly and intellectual interests in connection with Union College.
Our selection policy is identifying official websites and digital records related to Union College that are of permanent value or no longer actively used.
Our access policy is that these materials will be made accessible publicly via Archive-It. Archive-It is a community of partner organizations and their shared, “end-to-end” web archiving service is administered by the Internet Archive.
Our takedown policy is that Schaffer Library will respect the creator’s intellectual property rights by providing “opt-out” recourse after acquisition.

After we were happy with all of that we permanently crawled the sites within our scope and worked on a finding aid for the Union College Web Site Archive collection. You can also view Union College’s web archive collection.

Finally, a quote Grant gave me about his internship experience is the following: “My work on web archiving has not only allowed me to familiarize myself with Archive-It software and preserve essential material about Union College on the web, but also the methodology behind what to include or exclude from a web archive due to obstacles unique to the medium of the internet and the potential legal consequences of certain errors or oversights.”

Domain Archiving Experience at the National Library, Singapore

On February 15, 2023February 10, 2023 By Allison FischbachIn Case Study, Guest postLeave a comment

This week’s post was written by Shereen Tay, Librarian at the National Library, Singapore.

The National Library, Singapore (NLS) is a knowledge institution under the National Library Board, which also manages 26 public libraries, the National Archives, and the Asian Film Archives. At the National Library of Singapore, we have a mandate to preserve the published heritage of our nation through the legal deposit of works published in Singapore, as well as web archiving of Singapore websites.

NLS started archiving Singapore websites in 2006 as a response to the growing use and popularity of the Internet. However, we discovered it was an administratively cumbersome process as we were required to seek the written consent of website owners first. Challenges in identifying the website owners and the low response rate hampered our ability to build a comprehensive national collection of Singapore websites. To enable us to scale up our collecting efforts, we updated our legislation to empower NLS to archive websites ending in “.sg” without the need for written permission. The new law came into effect on 31 Jan 2019.
We conducted our very first domain archiving in 2019. Domain archiving is mostly done in-house, which includes pre-archiving checks, archiving, indexing, quality assessment, and providing access. Prior to this, we also had to establish new workflows, automate processes, as well as enhance our Web Archive Singapore (WAS) portal to cope with the enormous volume of websites that we were going to crawl, some of which are detailed below.

*The Web Archive Singapore portal (https://eresources.nlb.gov.sg/webarchives)*

Each year, we receive about 180,000 registered .sg domain names via our Memorandum of Understanding with the Singapore Network Information Centre, who is the national registry of .sg websites in Singapore. To handle the large volume of websites, we leveraged Amazon Web Services to run the crawls as opposed to using our own servers, which we had been tapping on for our thematic crawls. This helped to reduce the time taken to archive from more than six months to three months.

Another process that we instituted was the two-step quality assessment (QA). Before the legislative changes, our team had done manual QA for all our thematic crawls. However, this became a challenge with the increased volume of websites harvested under domain archiving. To address this, the team developed an automated QA script to help sieve out archived websites that potentially do not contain substantial content, e.g., domain for sale, blank pages, under construction, etc. Those that do not pass the script are then sent for manual checking. As manual checking is an equally intensive process, we created a simple interface that displays the screenshot of the archived websites to help speed up the review. This is so that staff can assess the look and feel of the archived website at a glance. With this in place, we were able to complete the entire QA process within three months.

*Screenshot of the web archiving manual QA system*.

Our efforts would not be meaningful without providing public access. Prior to the domain archiving in 2019, we paved the way for access on the WAS portal by giving it a major makeover. Key enhancements included Solr full-text search, curation, public nomination of websites, and rights management.

Within our second year of domain archiving, we discovered that the sheer data size of the collection has become a strain on the WAS portal’s Solr indexing. We thus implemented distributed search with index sharding in 2020. This helped to achieve scalability by enabling the portal to query and fetch results in optimal time. The distributed Solr also helped in the indexing speed of our collection, which we estimated to grow by about 15% annually due to the domain archiving.

These are just some of the major implementations that we had done as part of our domain archiving journey. As of 2022, our collection (including thematic crawls) contains over 317,000 archived websites, which amounts to more than 200 TB. As we continue to carry out our mandate of archiving Singapore websites, our team is looking into migrating our collection to the government cloud infrastructure with the possibility of using SolrCloud, as well as providing a web archiving dataset for research. We have written a short blog post on our own observations of the .sg web using the data that we had collected in the past four years. We hope that in time, this collection will grow to be a valuable resource for researchers and Singaporeans.

Shereen Tay is a Librarian with the National Library, Singapore. She is part of the team that oversees the statutory functions of the National Library, in particular web archiving.