A Tale of Two Tools: Archiving the University at Buffalo’s Web Periodicals

This post was written by Grace Trimper, Digital Archives Technician at University at Buffalo.

University Archives has been collecting alumni magazines, school and departmental newsletters, student newspapers, and literary magazines for decades – likely since the department was established in 1964. For a long time, this meant getting on snail mail lists or walking to departments to pick up the print issue of whichever publication was just released. These days, collecting periodicals also looks like clicking “download” or copying and pasting a URL into a web archiving tool.

Our work with digital periodicals ramped up in 2020, partially because the pandemic caused more and more publications to be delivered online. Most of UB’s periodicals are now available both in print and as PDFs, which makes preservation relatively straightforward: we download the PDF, create Dublin Core metadata, and ingest the package into our digital preservation system, Preservica.

We also saw schools, departments, and organizations begin publishing strictly web-based content without print or PDF surrogates. One of the first web periodicals we added to our collection was “The Baldy Center Magazine,” the semesterly publication of the Baldy Center for Law and Social Policy. The Digital Archivist set up a web crawl using Preservica’s Web Crawl and Ingest workflow, which uses Internet Archive’s Heritrix web crawler to capture websites and their various subpages.

The main benefit of this approach is convenience. We set the workflow to make a predetermined number of hops and capture a maximum depth, so we can crawl the magazine and its linked pages without capturing UB’s entire website. Other than checking the resulting web archive file (WARC), there’s not much else we need to do once the workflow is run. When the web crawl is complete, the WARC automatically ingests into the collection’s working folder, and we can link it to the finding aid without any other intermediary steps.

However, we did notice the Heritrix web crawler struggles with some university webpages and does not always capture images and multimedia. The captured magazine looked almost right – it was just missing a few of the pictures. We learned that the renderer the system uses has trouble intercepting URLs for some of the images throughout the website. This is a known limitation, and the construction of UB’s website can be more complex than the system can handle.

We ran into similar obstacles when running weekly crawls on the website for UB’s student newspaper, The Spectrum. As an independent newspaper, its website is not hosted by the university. At first, this had some benefits: the Heritrix crawler didn’t have the image problems we saw with UB-hosted sites, and the rendered WARCs looked basically identical to the live site.

Then, they redid their website. Our captures haven’t looked good since. Certain fonts are wrong, and the navigation menu expands to cover 90% of the page any time you try to click on or read an article. It seems to be another issue with the renderer, so we continue to crawl the site as usual to make sure we don’t miss out on preserving new content.

Even though the Heritrix crawler worked well for some of our collections, it was costing us time and energy that could be spent on other projects. We needed another option. Enter Conifer.

Conifer is a free and open-source web archiving tool with a web interface. We have had better luck capturing images and multimedia with Conifer and Webrecorder. Like The Spectrum, UB’s undergraduate literary magazine, NAME is hosted independently from UB’s website. Its construction is relatively straightforward: there’s a webpage for each issue with links to contributions, and the website is full of images. There are also a couple of technologically unique works on the site, including a link to a JavaScript multimedia piece.

When I crawled the site for preservation in UB’s Poetry Collection, Conifer had no problem capturing any of these, and the resulting WARCs display perfectly in our public user interface.

This approach doesn’t come without its drawbacks. Where the Web Crawl and Ingest workflow in Preservica is convenient and automatic, using Conifer can be tedious. First, there is no setting and forgetting; if you want to capture a complex website with various links and subpages, you must start the application and then open each link to every page you want to capture. If you have too many tabs open, the application can randomly stop in the middle of a crawl, leaving you to start all over again. On top of that, we have the extra steps of downloading, unzipping, and ingesting the WARC, plus manually copying and pasting the URL into the asset’s metadata before the captured page will display in the digital preservation system.

No approach has been perfect thus far, and I don’t expect it will be for a while. Web archiving technology is constantly growing and improving, and how we attack web archiving depends heavily on the material. But with the tools we have available to us, we can preserve important pieces of UB’s history we wouldn’t have been able to before. And that’s kind of the point, isn’t it?

When the Past Becomes Present: Reparative Description in NYU’s Web Collections

This post was written by Lizzy Zarate, Web Archives Student Assistant for NYU Archival Collections Management and Student Member of the Web Archiving Steering Committee. She is currently completing an MA in Archives & Public History at NYU.

Among the technical elements involved in web archiving, it’s easy to neglect the importance of description. It doesn’t take long to notice that an archived website is not playing videos or that the images are missing. It is harder to discern what is missing in a website’s description. Unlike a physical document, most websites are constantly changing. As such, it can be difficult to write descriptions that persist over time. Many of the websites in NYU’s collections were first captured in the 2010s; naturally, society has changed, and descriptive language should evolve with it. This consideration led me to wonder: how can we engage in reparative description work for NYU’s web collections?

In February of 2022, with the help of Web Archivist Nicole Greenhouse, I began researching best practices for inclusive and anti-oppressive description. While the resources I discovered were extremely helpful, I wasn’t able to find much guidance specifically geared towards web archives. Granted, many of the practices from traditional archival description can and should be applied, but there is still the problem of describing materials that can rapidly and drastically change at any time. With this consideration in mind, I utilized resources such as the Digital Transgender Archive Style Guide and Anti-Racist Description Resources to inform my work.1 As I began to comb through the web collections, it became clear that much of the reparative description would focus on revising languages of exclusion. 

Here’s one example. The Communications Workers of America website has been crawled 324 times since 2007. This is how the website was originally described:

“CWA, America’s largest communications and media union, represents over 700,000 men and women in both private and public sectors.”

The use of “men and women” implicitly erases nonbinary people and other individuals who don’t identify with these categories. To figure out a course of action, I started by visiting the most recent version of the archived website to read their current organizational biography, which referred to members as workers rather than in terms of their gender. Next, I used the history of the website itself to verify that the language I was using was faithful to the organization’s history. In the earliest crawls, the website had also used “men and women” to refer to its members. Using the archive, I was able to determine that this was changed in 2015.

Screenshot of CWA's "Profile & History" webpage captured in 2008.

CWA’s webpage in 2008 refers to its members as “men and women”

Screenshot of CWA's "About CWA" webpage captured in 2016.

CWA’s webpage in 2016 refers to its members as “workers”

Because of this change, I felt it was appropriate to broaden the language used in this description. Here is how I revised it:

“CWA, America’s largest communications and media union, represents over 700,000 workers in both private and public sectors.”

This is a small shift in wording, yet it has larger implications for the archive: our descriptive practices should not default to language such as “men and women” when we’re really just talking about a group of people, gender identity irrelevant. The value of archiving CWA’s website is to document the history of labor organizations. In this case, the language that was initially used actually ends up being a distraction from the primary function of the description. Much has been written in the field about archival silences. For web collections, this is present not only in whose websites we choose to collect, but in how we represent them.

Many of the descriptions I flagged belonged to entries in the Student Organizations collection. It appeared that most of the descriptions in this collection were reproduced from the organization’s own pages at the time of their first capture, which raised a few questions about gender-inclusive description. If a club for women referred to itself as an “all-female” group in 2013, what obligation did we have to preserve such language in 2023, if at all? Given that student members had written their own descriptions, what authority did I have to define the stance of their organization? After all, these descriptions were written by students who may have changed their views since then, but I am also a student. What if the work I was doing ended up later being seen as inadequate, the same way I was labeling theirs? I wasn’t entirely sure how to proceed.

In most cases, I tried to look up the club’s current page on the live web. Many had updated their information with trans-inclusive and gender-equitable language, so I could revise the description without qualms. Still, a few websites remained where they had either gone inactive or still retained this language. For these instances, I decided to keep the language as it was while adding quotation marks as needed. As written in Archival Collections Management’s Statement on Harmful Language, “While we have control over description of our collections, we cannot alter the content.”2 Making these changes avoided misrepresenting the position of the organization, but clarified that the language used did not necessarily align with the stance of our department. 

Working through this issue forced me to confront the idea that I had less power over the archive as a student worker. The choices that I made would directly impact how users interact with NYU’s web collections. They would also indirectly reflect ACM’s position on these topics. Consequently, I had to take responsibility for the choices I made in reparative description. I did so with the understanding that all description is iterative and no language could ever perfectly represent all the voices of one community.

Reparative description is often discussed in the classroom, but to engage in it practically as a student worker in web archives has helped clarify my own personal ethos as an archivist. The work is ongoing with no clear endpoint, but it is important to make the time and space for it within our daily work. As Dorothy Berry writes in “The House Archives Built”, “Our descriptive systems are often the first interaction patrons have with our institutions, and when the language and systems feel alienating, patrons will take what they need and leave the rest.”3 By repairing harmful descriptions where we see them, we can remove an unnecessary barrier to access for users of web archives.

References:

  1. “Anti-Racist Description Resources,” Archives for Black Lives in Philadelphia, Oct 2019. Accessed Dec 2023. https://archivesforblacklives.files.wordpress.com/2019/10/ardr_final.pdf.
    “DTA Style Guide,” Cailin Roles and Eamon Schlotterback, Fall 2020. Accessed 1 Dec 2023. https://docs.google.com/document/d/1qou1h4DLFQEZg4BIvXiEpGy_TI3rDnrJsPXCsRL-Ki8/edit.
  2. “Inclusive and Reparative Work”, Archival Collections Management, NYU Libraries, updated 4 Dec 2023. Accessed 4 Dec 2023. https://guides.nyu.edu/archival-collections-management/inclusive
  3. “The House Archives Built,” Dorothy Berry, 22 June 2021. Accessed 1 Dec 2023. https://www.uproot.space/features/the-house-archives-built

Author bio:

This post was written by Lizzy Zarate, the Web Archives Student Assistant for NYU Archival Collections Management. She is currently completing an MA in Archives & Public History at NYU.

Reviving Union College’s Web Archiving Program

This post was written by Corinne Chatnik, Web Archiving Steering Committee Chair and Digital Collections and Preservation Librarian at Union College.

In 2023, the web archiving program at Union College was in a unique situation. The program was started using Archive-It by a former librarian several years ago. After they left, crawls and collections were not maintained and some crawls continued to run for several years. I’ve only been in my current position for a little over a year and during that time, I analyzed our institution’s previous web archiving efforts and realized we needed to reevaluate our workflow and develop policies and procedures. 

Archive-It’s software and service is amazing, but it needs to be maintained and reevaluated every so often. Our instance was capturing anything that had been linked from the main Union College website, student blogs, social media for every organization on campus, YouTube videos, and  Flickr accounts for years unchecked. All that data amounted to 7 TB, which is just an incredibly large and expensive amount of data.

Essentially, we decided to continue to use Archive-It, but start everything from scratch. To give me a hand in this project, I hired an intern named Grant to work 20 hours a week over the summer. Grant started analyzing the type of content that was already in our Archive-It instance and broke them into categories like college records, social media, student work, and news stories. Then, we took those categories of records to the College Records Manager and Archivist and discussed what fell into our Archive and Special Collections collecting scope.

We went to Special Collections and Archives because we didn’t want web archiving to be considered this intimidating, technical process, another type of record within our collection that had a unique way of being collected and harvested. With the Archivist, we decided that records such as student work that was  uploaded to a class site and nonofficial social media should not be captured because it wasn’t ethical to archive student work without permission. With image repositories like Flickr and official videos from YouTube, we would try to get the originals from the source instead of crawling those sites and preserving them like born-digital archival material.

In terms of the practical use of the Archive-It software, the documentation is  robust, so we ran a lot of test crawls trying out the different methods of crawling. Through trial and error, we performed quality assurance checks to make sure all the content looked right. Outside of the technical aspects of web archiving, we wanted to justify the program by creating a policy. We wanted this policy to address the goals of the web archiving program, the stakeholders, selection and appraisal, access, and takedown.

To help us make these decisions, we first researched other college and university’s web archiving policies and identified aspects of each that we thought could apply to us. I also found this great worksheet from the Digital Preservation and Outreach and Education Network to help you define the scope and collecting statement of your web archiving program.

The shortened version of our policy is as follows:

  •  Our goal is to proactively archive web site content as historical and cultural record for future historians and researchers.
  • Our stakeholders include the faculty, students, alumni, and staff of Union College.
  • The target audience for the use of archived material is those in the community with scholarly and intellectual interests in connection with Union College.
  • Our selection policy is identifying official websites and digital records related to Union College that are of permanent value or no longer actively used.
  • Our access policy is that these materials will be made accessible publicly via Archive-It. Archive-It is a community of partner organizations and their shared, “end-to-end” web archiving service is administered by the Internet Archive.
  • Our takedown policy is that Schaffer Library will respect the creator’s intellectual property rights by providing “opt-out” recourse after acquisition.

After we were happy with all of that we permanently crawled the sites within our scope and worked on a finding aid for the Union College Web Site Archive collection.  You can also view Union College’s web archive collection.

Finally, a quote Grant gave me about his internship experience is the following: “My work on web archiving has not only allowed me to familiarize myself with Archive-It software and preserve essential material about Union College on the web, but also the methodology behind what to include or exclude from a web archive due to obstacles unique to the medium of the internet and the potential legal consequences of certain errors or oversights.”

Web Archiving Roundup: July 2, 2018

Here’s your Web Archiving Roundup for July 2, 2018:

  • In Web Archives for the Analog Archivist, Aleksandr Gelfand writes: ‘By incorporating the use of Internet Archive’s Wayback Machine into their workflows, archivists working primarily with analog records may enhance their ability in such tasks as the construction of a processing plan, the creation of more accurate historical descriptions for finding aids, and potentially may be able to provide better reference services to their patrons.’
  • The British Library looks at its preservation plans for emerging formats, including archived websites.
  • In Wales, MirrorWeb will work with the Welsh Government ‘to digitally archive the Welsh nation’s online presence.’
  • In the next few weeks, Cobweb will begin conducting user testing of its functional prototype to learn from potential users about what they’d consider most useful in a collaborative collection development platform.
  • A look at Memento Tracer: High Fidelity Web Archiving at Scale.

Web Archiving Roundup: May 29, 2018

Here’s your Web Archiving Roundup for May 29, 2018:

Web Archiving Roundup: May 7, 2018

Here’s your Web Archiving Roundup: May 7, 2018

Web Archiving Roundup: April 16, 2018

Here’s your Web Archiving Roundup for April 16, 2018:

Web Archiving Roundup: April 2, 2018

Here’s your Web Archiving Roundup for April 2, 2018:

Web Archiving Roundup: March 19, 2018

Here’s your Web Archiving Roundup for March 19, 2018:

Web Archiving Roundup: March 5, 2018

Here’s your Web Archiving Roundup for March 5, 2018: