““Distant Reading” and Web Archiving“, by Andrea Fox. It is possible to perform computational analysis of digitized text and find patterns which individual readers cannot. Those techniques could be applied to web archive collections.
The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.
NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.
Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.
HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.
Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.
Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.
Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.
Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.
Another quick update for the weekly web archiving roundup, April 9, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!
“Vanderbilt-led effort to build new digital environments receives Mellon grant,” Ann Marie Deer Owens. The Andrew W. Mellon Foundation has awarded Vanderbilt University a grant to support the Committee on Coherence at Scale for Higher Education. http://news.vanderbilt.edu/2014/04/mellon-library-grant/