Personal Digital (Web) Archiving: Guest Post by Nicholas Taylor

On April 18, 2014 By webarchrtIn Guest postLeave a comment

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Physical keepsakes like photos, letters, and vital documents have a thankfully long shelf-life with little affirmative effort. The susceptibility of the digital artifacts that increasingly replace them to loss or corruption has raised awareness about the heightened need to curate, manage, and protect these assets. Interest in “personal digital archiving” has grown significantly within the last few years, as demonstrated by three annual conferences on the topic, practical guidance developed by the National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, as well as attention from other educational and cultural heritage institutions.

exhibits / 2012 National Book Festival, by wlef70

The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.

NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.

Tools

The Web Archiving Integration Layer (WAIL) is a user-friendly interface in front of the tools used by many cultural heritage web archiving organizations: the Heritrix archival crawler and the Wayback web archive replay platform. WAIL supports one-click capture of an entire website and replay in a local instance of Wayback. Data is captured to the WARC format, which has the advantage of being the ISO standard web archiving preservation format of choice and allowing for a more faithful representation of the original website via the bundled Wayback. The downside is that WARC is a relatively opaque format to all but a few specialized applications. Given that WAIL has only one maintainer, in a personal archiving context it might make sense to also copy web content into more readily legible formats, in addition to WARC.

Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.

stranger 7/100 abdul hoque, by Hasin Hayder

HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.

Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.

Platforms

Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.

Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.

Twitter provides a mechanism for downloading “Your Twitter Archive” which creates a zip file containing a standalone JavaScript-enabled application that evokes the experience of using the Twitter web service in the browser. At first glance, this resembles the format of Facebook’s data export, but a key differentiator is that the Twitter data export includes each of the individual tweets in JSON format and provides the standalone application as a convenience for browsing them. Since the exported data is separate from the presentation, it’s much easier to re-purpose or manipulate it with other tools.

Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.

Web Archiving Roundtable survey

On April 18, 2014April 18, 2014 By webarchrtIn SurveyLeave a comment

Don’t forget to take our survey–we’re looking for ways to define roundtable activities and offerings in the future!

Weekly web archiving roundup: April 16, 2014

On April 16, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Here’s the weekly web archiving roundup for April 16, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Earliest Web Screenshots“, by Sean B. Palmer: Some screenshots from the first page on the World Wide Web. http://inamidst.com/stuff/web/screens

“Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resources Consortium (NYARC)’“, from the Frick Art Reference Library; A presentation about the New York Art Resource Consortium’s web archiving projects. http://www.slideshare.net/DeborahKempe/making-the-black-hole-gray-web-archiving-art-resources-at-new-york-art-resources-consortium-nyarc

“Embedded Metadata in WAVE Files: A look inside issues and tools”, Chris Lacinak; Metadata embedded within digital objects is important for long term preservation, but the current environment presents a number of challenges. http://www.avpreserve.com/wp-content/uploads/2014/04/EmbeddedMetadata.pdf

“The Web Archiving Life Cycle Model“, by Kristine Hanna; There is not yet a set of best practices for web archiving despite nearly a decade of work. The Web Archiing Life Cycle model attempts to rectify that. http://www.accioncultural.es/media/Default%20Files/activ/2014/multimedia/anuario%20ace/ACE_Digital_Cultural_Annual_Report_2014.pdf#page=80

“How to Preserve the Web’s Past for the Future“, by Hannah Kuchler; A look at some of the challenges and considerations involved in archiving web based material. The article also discusses some of the organization’s attempting to preserve the web. http://www.ft.com/cms/s/2/d87a33d8-c0a0-11e3-8578-00144feabdc0.html#axzz2yyswk4CZ

Weekly web archiving roundup: April 9, 2014

On April 9, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

Another quick update for the weekly web archiving roundup, April 9, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Vanderbilt-led effort to build new digital environments receives Mellon grant,” Ann Marie Deer Owens. The Andrew W. Mellon Foundation has awarded Vanderbilt University a grant to support the Committee on Coherence at Scale for Higher Education. http://news.vanderbilt.edu/2014/04/mellon-library-grant/

“Digital Repository of Ireland brings history to life online,” Claire O’Connell. The Digital Repository of Ireland seeks to preserve Irish history in a way which will ensure it remains accessible to future generations. http://www.siliconrepublic.com/innovation/item/36371-wit2014/

Weekly web archiving roundup: April 2, 2014

On April 2, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Just a a short update for the weekly web archiving roundup for April 2, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“If You Can’t Make It to the Lecture,” Jane L. Levere. Museums are holding online lectures and then archiving them for future viewing. http://www.nytimes.com/2014/03/20/arts/artsspecial/if-you-cant-make-it-to-the-lecture.html?_r=1

“Death to “Link Rot”: Here’s Where the Internet Goes to Live Forever,” Sarah Laskow. A startup based in the Harvard Law Library is attempting to address the issue of link rot. http://www.fastcompany.com/3028321/whos-next/death-to-link-rot-heres-where-the-internet-goes-to-live-forever

Weekly web archiving roundup: March 26, 2014

On March 26, 2014March 26, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Just a a short update for the weekly web archiving roundup for March 26, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Introducing Archive-it 4.9 and Umbra.” Archive-It has released a new version which takes a step in implementing the Umbra web archiving program. http://blog.archive-it.org/2014/03/13/introducing-archive-it-4-9-and-umbra/

“Preserving Web-based Auction Catalogs at the Frick Art Reference Library“, Gretchen Nadasky; The Frick Art Reference Library is working on a project to preserve their online auction catalogs. This paper describes the second phase, the goal of which is to determine preservation priorities and identify material which has already been captured. http://www.dlib.org/dlib/march14/nadasky/03nadasky.html

Weekly web archiving roundup: March 19, 2014

On March 19, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

Here’s the weekly web archiving roundup for March 19, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“The UK Government Web Archive: a resource for contemporary historians“, by Simon Demissie: An overview of the UK Government Web Archive and what it means for historians. http://blog.nationalarchives.gov.uk/blog/uk-government-web-archive-resource-contemporary-historians-1/

“Web archiving – an antidote to ‘present shock’?“, Paul Koerbin; A discussion of whether web archives can help bring order to the world wide web. The example of archiving political websites is used. http://www.nla.gov.au/blogs/web-archiving/2014/03/18/web-archiving-an-antidote-to-present-shock

“Meet the Geniuses on a Quixotic Quest to Archive the Entire Internet“, Harry Swartout; The World Wide Web needs active effort to keep its content from being lost. http://time.com/20636/meet-the-geniuses-on-a-quixotic-quest-to-archive-the-entire-internet/

“Avoid the Museum of Dead Content: What to Do When Web Content Expires“,Dominic Smith;This article lists steps that should be periodically taken in order to keep a website’s content fresh and archive old material.http://www.marketingprofs.com/articles/2014/24624/avoid-the-museum-of-dead-content-what-to-do-when-web-content-expires

Weekly web archiving roundup: March 12, 2014

On March 12, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

Here’s the weekly web archiving roundup for March 12, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving. Although news items are compiled by the Web Archiving Roundtable steering committee and Best Practices/Toolbox committee members, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Keep Everything 1.0 – Light and Fast Web Archiver for iOS“, by Hyoseo Park: Groosoft has released an iPhone app which allows users to archive any web content from a browser on their phone. http://prmac.com/release-id-65946.htm

“Forever accessible archives? Michigan moves its records to the cloud“, John Breeden II; The state of Michigan has moved its electronic records to the cloud using commercial software called Preservica: http://gcn.com/articles/2014/03/05/michigan-archives.aspx

“Ukraine Conflict Collection: Web archive ensures site stability through crisis“: The Archive-It team is working to ensure access to web content related to the Ukranian Cconflict through its Ukranian Conflict Collection. http://blog.archive-it.org/2014/03/11/ukraine-conflict-collection-web-archive-ensures-site-stability-through-crisis/

“Tim Berners-Lee on the Web at 25: the past, present and future“,Tim Berners-Lee; Tim Berners-Lee reflects on the world wide web and lays out his vision of the future. http://www.wired.co.uk/magazine/archive/2014/03/web-at-25/tim-berners-lee

“CALLING FOR BACK UP“, hosted by Bob Garfield; Another story about the how the Crimean Center for Investigative Journalism backed up their web history using Archive-It. http://www.onthemedia.org/story/calling-back/

Weekly web archiving roundup: March 5, 2014

On March 5, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Just one item for the weekly web archiving roundup for March 5, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving.

“The US News Editor on the Disappearance of Its Online Archives“, by James Fallows: The pre-2007 archives of US News were rendered inaccessible via the web following the upgrade to a new Content Management System. http://www.theatlantic.com/national/archive/2014/02/the-em-us-news-em-editor-on-the-disappearance-of-its-online-archives/284065/

Weekly web archiving roundup: February 26, 2014

On February 26, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

Here’s the weekly web archiving roundup for February 26, 2014–these news items go up each Wednesday, and will be pertinent to anyone with an interest in web archiving. Although news items are compiled by the Web Archiving Roundtable steering committee and Best Practices/Toolbox committee members, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Preserving the Internet’s Embarassing Past“, by Spencer Longo: another article on “One Terabyte of Kilobyte Age”. http://thecreatorsproject.vice.com/blog/preserving-the-internets-embarrassing-past

“Borges and Twitter“, Rebecca Reynolds: http://britishlibrary.typepad.co.uk/webarchive/2014/02/jorge-luis-borges-and-twitter.html

“Duracloud announces new Archive-It integration feature“: http://blog.archive-it.org/2014/02/19/duracloud-announces-new-archive-it-integration-feature/

“The 8 Most Popular Document Formats on the Web“, Duff Johnson: PDFs remain by far the most popular document format on the web. http://duff-johnson.com/2014/02/17/the-8-most-popular-document-formats-on-the-web/

“Tessella reports: Preservica simplifies active preservation for DSpace repositories“, Christina Tealdi: Tessella has updated the workflow for its Preservica service to make it easier to ingest digital content from DSpace. http://www.prweb.com/releases/2014/02/prweb11614657.htm

“More Podcast, Less Process Episode 7: humans.txt.mp3 — The Web Archivists Are Present“, with Jefferson Bailey and Joshua Ranger: A discussion of the complexities of web archiving with guests Alex Thurman and Lily Pregill. http://keepingcollections.org/more-podcast-less-process/