Weekly web archiving roundup: April 30, 2014

On April 30, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

After taking a week off, here’s the weekly web archiving roundup for April 30, 2014.

“Archiving the World, One Protest at a Time“, by Gayatri Chandrasekaran. CrowdVoice is a website dedicated to keeping track of protests going on around the world primarily through the use of crowdsourcing.

““Distant Reading” and Web Archiving“, by Andrea Fox. It is possible to perform computational analysis of digitized text and find patterns which individual readers cannot. Those techniques could be applied to web archive collections.

“California Senate erases websites of 3 lawmakers, including Leland Yee“, Associated Press. The names and online archives of three California State Senators have been erased following their suspension from the legislature.

“California Senate to three suspended members: You’re dead to me“, by Kerry Cavanaugh. An opinion piece about the erasure of the online archives of three Calfornia State Senators following their suspension from the legislature.

“How Brewster Kahle is using open-source principles to build affordable housing for non-profit workers“, by Paul Sawers. Internet Archive founder Brewster Kahle is advocating for low cost housing for employees of non-profits.

Personal Digital (Web) Archiving: Guest Post by Nicholas Taylor

On April 18, 2014 By webarchrtIn Guest postLeave a comment

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Physical keepsakes like photos, letters, and vital documents have a thankfully long shelf-life with little affirmative effort. The susceptibility of the digital artifacts that increasingly replace them to loss or corruption has raised awareness about the heightened need to curate, manage, and protect these assets. Interest in “personal digital archiving” has grown significantly within the last few years, as demonstrated by three annual conferences on the topic, practical guidance developed by the National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, as well as attention from other educational and cultural heritage institutions.

exhibits / 2012 National Book Festival, by wlef70

The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.

NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.

Tools

The Web Archiving Integration Layer (WAIL) is a user-friendly interface in front of the tools used by many cultural heritage web archiving organizations: the Heritrix archival crawler and the Wayback web archive replay platform. WAIL supports one-click capture of an entire website and replay in a local instance of Wayback. Data is captured to the WARC format, which has the advantage of being the ISO standard web archiving preservation format of choice and allowing for a more faithful representation of the original website via the bundled Wayback. The downside is that WARC is a relatively opaque format to all but a few specialized applications. Given that WAIL has only one maintainer, in a personal archiving context it might make sense to also copy web content into more readily legible formats, in addition to WARC.

Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.

stranger 7/100 abdul hoque, by Hasin Hayder

HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.

Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.

Platforms

Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.

Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.

Twitter provides a mechanism for downloading “Your Twitter Archive” which creates a zip file containing a standalone JavaScript-enabled application that evokes the experience of using the Twitter web service in the browser. At first glance, this resembles the format of Facebook’s data export, but a key differentiator is that the Twitter data export includes each of the individual tweets in JSON format and provides the standalone application as a convenience for browsing them. Since the exported data is separate from the presentation, it’s much easier to re-purpose or manipulate it with other tools.

Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.

Web Archiving Roundtable survey

On April 18, 2014April 18, 2014 By webarchrtIn SurveyLeave a comment

Don’t forget to take our survey–we’re looking for ways to define roundtable activities and offerings in the future!

Weekly web archiving roundup: April 16, 2014

On April 16, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Here’s the weekly web archiving roundup for April 16, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Earliest Web Screenshots“, by Sean B. Palmer: Some screenshots from the first page on the World Wide Web. http://inamidst.com/stuff/web/screens

“Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resources Consortium (NYARC)’“, from the Frick Art Reference Library; A presentation about the New York Art Resource Consortium’s web archiving projects. http://www.slideshare.net/DeborahKempe/making-the-black-hole-gray-web-archiving-art-resources-at-new-york-art-resources-consortium-nyarc

“Embedded Metadata in WAVE Files: A look inside issues and tools”, Chris Lacinak; Metadata embedded within digital objects is important for long term preservation, but the current environment presents a number of challenges. http://www.avpreserve.com/wp-content/uploads/2014/04/EmbeddedMetadata.pdf

“The Web Archiving Life Cycle Model“, by Kristine Hanna; There is not yet a set of best practices for web archiving despite nearly a decade of work. The Web Archiing Life Cycle model attempts to rectify that. http://www.accioncultural.es/media/Default%20Files/activ/2014/multimedia/anuario%20ace/ACE_Digital_Cultural_Annual_Report_2014.pdf#page=80

“How to Preserve the Web’s Past for the Future“, by Hannah Kuchler; A look at some of the challenges and considerations involved in archiving web based material. The article also discusses some of the organization’s attempting to preserve the web. http://www.ft.com/cms/s/2/d87a33d8-c0a0-11e3-8578-00144feabdc0.html#axzz2yyswk4CZ

Weekly web archiving roundup: April 9, 2014

On April 9, 2014 By webarchrtIn International News, News, Weekly roundupLeave a comment

Another quick update for the weekly web archiving roundup, April 9, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“Vanderbilt-led effort to build new digital environments receives Mellon grant,” Ann Marie Deer Owens. The Andrew W. Mellon Foundation has awarded Vanderbilt University a grant to support the Committee on Coherence at Scale for Higher Education. http://news.vanderbilt.edu/2014/04/mellon-library-grant/

“Digital Repository of Ireland brings history to life online,” Claire O’Connell. The Digital Repository of Ireland seeks to preserve Irish history in a way which will ensure it remains accessible to future generations. http://www.siliconrepublic.com/innovation/item/36371-wit2014/

Weekly web archiving roundup: April 2, 2014

On April 2, 2014 By webarchrtIn News, Weekly roundupLeave a comment

Just a a short update for the weekly web archiving roundup for April 2, 2014–and remember, don’t be shy about posting additional items of interest to the [webarchiving] discussion list!

“If You Can’t Make It to the Lecture,” Jane L. Levere. Museums are holding online lectures and then archiving them for future viewing. http://www.nytimes.com/2014/03/20/arts/artsspecial/if-you-cant-make-it-to-the-lecture.html?_r=1

“Death to “Link Rot”: Here’s Where the Internet Goes to Live Forever,” Sarah Laskow. A startup based in the Harvard Law Library is attempting to address the issue of link rot. http://www.fastcompany.com/3028321/whos-next/death-to-link-rot-heres-where-the-internet-goes-to-live-forever