A Tale of Two Tools: Archiving the University at Buffalo’s Web Periodicals

This post was written by Grace Trimper, Digital Archives Technician at University at Buffalo.

University Archives has been collecting alumni magazines, school and departmental newsletters, student newspapers, and literary magazines for decades – likely since the department was established in 1964. For a long time, this meant getting on snail mail lists or walking to departments to pick up the print issue of whichever publication was just released. These days, collecting periodicals also looks like clicking “download” or copying and pasting a URL into a web archiving tool.

Our work with digital periodicals ramped up in 2020, partially because the pandemic caused more and more publications to be delivered online. Most of UB’s periodicals are now available both in print and as PDFs, which makes preservation relatively straightforward: we download the PDF, create Dublin Core metadata, and ingest the package into our digital preservation system, Preservica.

We also saw schools, departments, and organizations begin publishing strictly web-based content without print or PDF surrogates. One of the first web periodicals we added to our collection was “The Baldy Center Magazine,” the semesterly publication of the Baldy Center for Law and Social Policy. The Digital Archivist set up a web crawl using Preservica’s Web Crawl and Ingest workflow, which uses Internet Archive’s Heritrix web crawler to capture websites and their various subpages.

The main benefit of this approach is convenience. We set the workflow to make a predetermined number of hops and capture a maximum depth, so we can crawl the magazine and its linked pages without capturing UB’s entire website. Other than checking the resulting web archive file (WARC), there’s not much else we need to do once the workflow is run. When the web crawl is complete, the WARC automatically ingests into the collection’s working folder, and we can link it to the finding aid without any other intermediary steps.

However, we did notice the Heritrix web crawler struggles with some university webpages and does not always capture images and multimedia. The captured magazine looked almost right – it was just missing a few of the pictures. We learned that the renderer the system uses has trouble intercepting URLs for some of the images throughout the website. This is a known limitation, and the construction of UB’s website can be more complex than the system can handle.

We ran into similar obstacles when running weekly crawls on the website for UB’s student newspaper, The Spectrum. As an independent newspaper, its website is not hosted by the university. At first, this had some benefits: the Heritrix crawler didn’t have the image problems we saw with UB-hosted sites, and the rendered WARCs looked basically identical to the live site.

Then, they redid their website. Our captures haven’t looked good since. Certain fonts are wrong, and the navigation menu expands to cover 90% of the page any time you try to click on or read an article. It seems to be another issue with the renderer, so we continue to crawl the site as usual to make sure we don’t miss out on preserving new content.

Even though the Heritrix crawler worked well for some of our collections, it was costing us time and energy that could be spent on other projects. We needed another option. Enter Conifer.

Conifer is a free and open-source web archiving tool with a web interface. We have had better luck capturing images and multimedia with Conifer and Webrecorder. Like The Spectrum, UB’s undergraduate literary magazine, NAME is hosted independently from UB’s website. Its construction is relatively straightforward: there’s a webpage for each issue with links to contributions, and the website is full of images. There are also a couple of technologically unique works on the site, including a link to a JavaScript multimedia piece.

When I crawled the site for preservation in UB’s Poetry Collection, Conifer had no problem capturing any of these, and the resulting WARCs display perfectly in our public user interface.

This approach doesn’t come without its drawbacks. Where the Web Crawl and Ingest workflow in Preservica is convenient and automatic, using Conifer can be tedious. First, there is no setting and forgetting; if you want to capture a complex website with various links and subpages, you must start the application and then open each link to every page you want to capture. If you have too many tabs open, the application can randomly stop in the middle of a crawl, leaving you to start all over again. On top of that, we have the extra steps of downloading, unzipping, and ingesting the WARC, plus manually copying and pasting the URL into the asset’s metadata before the captured page will display in the digital preservation system.

No approach has been perfect thus far, and I don’t expect it will be for a while. Web archiving technology is constantly growing and improving, and how we attack web archiving depends heavily on the material. But with the tools we have available to us, we can preserve important pieces of UB’s history we wouldn’t have been able to before. And that’s kind of the point, isn’t it?

Leave a comment