Personal Digital (Web) Archiving: Guest Post by Nicholas Taylor

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Physical keepsakes like photos, letters, and vital documents have a thankfully long shelf-life with little affirmative effort. The susceptibility of the digital artifacts that increasingly replace them to loss or corruption has raised awareness about the heightened need to curate, manage, and protect these assets. Interest in “personal digital archiving” has grown significantly within the last few years, as demonstrated by three annual conferences on the topic, practical guidance developed by the National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, as well as attention from other educational and cultural heritage institutions.

exhibits / 2012 National Book Festival, by wlef70
exhibits / 2012 National Book Festival, by wlef70

The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.

NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.

Tools

The Web Archiving Integration Layer (WAIL) is a user-friendly interface in front of the tools used by many cultural heritage web archiving organizations: the Heritrix archival crawler and the Wayback web archive replay platform. WAIL supports one-click capture of an entire website and replay in a local instance of Wayback. Data is captured to the WARC format, which has the advantage of being the ISO standard web archiving preservation format of choice and allowing for a more faithful representation of the original website via the bundled Wayback. The downside is that WARC is a relatively opaque format to all but a few specialized applications. Given that WAIL has only one maintainer, in a personal archiving context it might make sense to also copy web content into more readily legible formats, in addition to WARC.

Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.

stranger 7/100 abdul hoque, by Hasin Hayder
stranger 7/100 abdul hoque, by Hasin Hayder

HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.

Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.

Platforms

Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.

Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.

Twitter provides a mechanism for downloading “Your Twitter Archive” which creates a zip file containing a standalone JavaScript-enabled application that evokes the experience of using the Twitter web service in the browser. At first glance, this resembles the format of Facebook’s data export, but a key differentiator is that the Twitter data export includes each of the individual tweets in JSON format and provides the standalone application as a convenience for browsing them. Since the exported data is separate from the presentation, it’s much easier to re-purpose or manipulate it with other tools.

Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.

Weekly web archiving roundup: April 16, 2014

Here’s the weekly web archiving roundup for April 16, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

Weekly web archiving roundup: April 9, 2014

Another quick update for the weekly web archiving roundup, April 9, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

  • Vanderbilt-led effort to build new digital environments receives Mellon grant,” Ann Marie Deer Owens. The Andrew W. Mellon Foundation has awarded Vanderbilt University a grant to support the Committee on Coherence at Scale for Higher Education. http://news.vanderbilt.edu/2014/04/mellon-library-grant/

Weekly web archiving roundup: April 2, 2014

Just a a short update for the weekly web archiving roundup for April 2, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

 

Weekly web archiving roundup: March 26, 2014

Just a a short update for the weekly web archiving roundup for March 26, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

  • Preserving Web-based Auction Catalogs at the Frick Art Reference Library“, Gretchen Nadasky; The Frick Art Reference Library is working on a project to preserve their online auction catalogs. This paper describes the second phase, the goal of which is to determine preservation priorities and identify material which has already been captured. http://www.dlib.org/dlib/march14/nadasky/03nadasky.html

Weekly web archiving roundup: March 19, 2014

Here’s the weekly web archiving roundup for March 19, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

Weekly web archiving roundup: March 12, 2014

Here’s the weekly web archiving roundup for March 12, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving.  Although news items are compiled by the Web Archiving Roundtable steering committee and Best Practices/Toolbox committee members, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

  • Keep Everything 1.0 – Light and Fast Web Archiver for iOS“, by Hyoseo Park: Groosoft has released an iPhone app which allows users to archive any web content from a browser on their phone. http://prmac.com/release-id-65946.htm

Weekly web archiving roundup: March 5, 2014

Just one item for the weekly web archiving roundup for March 5, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving.

Weekly web archiving roundup: February 26, 2014

Here’s the weekly web archiving roundup for February 26, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving.  Although news items are compiled by the Web Archiving Roundtable steering committee and Best Practices/Toolbox committee members, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

  • Tessella reports: Preservica simplifies active preservation for DSpace repositories“, Christina Tealdi: Tessella has updated the workflow for its Preservica service to make it easier to ingest digital content from DSpace.  http://www.prweb.com/releases/2014/02/prweb11614657.htm
  • More Podcast, Less Process Episode 7: humans.txt.mp3 — The Web Archivists Are Present“, with Jefferson Bailey and Joshua Ranger: A discussion of the complexities of web archiving with guests Alex Thurman and Lily Pregill.  http://keepingcollections.org/more-podcast-less-process/