Memento for Chrome review: Guest Post by Cliff Hight

The following is a guest post by Cliff Hight, University Archivist, Kansas State University Libraries.

The Memento for Chrome extension allows users of Google’s web browser, Chrome, to see previous versions web pages. I enjoyed testing the tool and seeing how certain sites have changed over the years. By the way, do any of you remember how Yahoo! looked in 1996? With a few clicks of a mouse, you can now.

ExampleYahoo

To give you a flavor of how it works, I’ll walk you through (with pictures!) my experience.

1)      After installing the extension, I used my institution’s home page as a test bed.

Example1

2)      I clicked on the clock to the right of the browser’s address bar and set the web time to which I wanted to travel—arbitrarily selected as April 14, 2010.

Example2

3)      I opened the context menu by right-clicking (or control-click for Mac users) on the page, selecting “Memento Time Travel,” and clicking the “Get near Wed, 14 Apr 2010 18:54:30 GMT” option.

Example3

4)      Voila! A view of how the Kansas State University Libraries website looked in the Internet Archive’s Wayback Machine on March 6, 2009.

Example4

You might have noticed that the date I was seeking and the date of the archived site were not the same. As it turns out, the developers note on their page that the extension has two limitations: it cannot “obtain a prior version of a page when none have [sic] been archived and time travel into the future.” Because my institution’s site was not captured in the Wayback Machine on April 14, 2010, there was nothing from that date to show. Instead, the tool went with the next oldest date, which happened to be March 6, 2009.

Additionally, you may have seen additional options on the context submenu. Selecting “Get near current time” sends you the most recently archived version of the page. The “Get at current time” option takes you to the live version of the page, and the “Got” line tells you which page you are currently seeing.

This extension basically uses various web archives, such as the Wayback Machine and the British Library Web Archive, to provide easy access to earlier versions of webpages. It also claims to provide archived pages of Wikipedia in all available languages. In my use of the tool, it was more convenient than going to the Wayback Machine every time I wanted to see older version of websites.

Like most technology products, there are some bugs. In my tests, there were a couple of times on different websites when I set a date, looked at the current version of the page, clicked to see the older version, and waited while nothing happened. To get it working again, I went back to the date box, changed the date by a day, and had success in seeing the older version. I’m not sure why it had those hiccups (and it would not surprise me if there was user error), but know as you begin to use the tool that there might be some kinks to work through.

The developers of Memento include the Prototyping Team of the Research Library of the Los Alamos National Laboratory and the Computer Science Department of Old Dominion University. Based on information on the Memento website, the extension began development in 2009 and its most recent update was in November 2013. On the plugin page, the developers state that “Memento for Chrome allows you to seamlessly navigate between the present web and the web of the past. It turns your browser into a web time travel machine that is activated by means of a Memento sub-menu that is available on right-click.” And, to learn more about the technical side of the project, you can see their Memento Guide and Request for Comments pages.

The Memento for Chrome extension is a helpful tool that allows users to easily peruse websites and see how they have changed through the years. I would recommend adding it to your toolbox as you seek to view the history of the web.

Personal Digital (Web) Archiving: Guest Post by Nicholas Taylor

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Physical keepsakes like photos, letters, and vital documents have a thankfully long shelf-life with little affirmative effort. The susceptibility of the digital artifacts that increasingly replace them to loss or corruption has raised awareness about the heightened need to curate, manage, and protect these assets. Interest in “personal digital archiving” has grown significantly within the last few years, as demonstrated by three annual conferences on the topic, practical guidance developed by the National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, as well as attention from other educational and cultural heritage institutions.

exhibits / 2012 National Book Festival, by wlef70

exhibits / 2012 National Book Festival, by wlef70

The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.

NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.

Tools

The Web Archiving Integration Layer (WAIL) is a user-friendly interface in front of the tools used by many cultural heritage web archiving organizations: the Heritrix archival crawler and the Wayback web archive replay platform. WAIL supports one-click capture of an entire website and replay in a local instance of Wayback. Data is captured to the WARC format, which has the advantage of being the ISO standard web archiving preservation format of choice and allowing for a more faithful representation of the original website via the bundled Wayback. The downside is that WARC is a relatively opaque format to all but a few specialized applications. Given that WAIL has only one maintainer, in a personal archiving context it might make sense to also copy web content into more readily legible formats, in addition to WARC.

Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.

stranger 7/100 abdul hoque, by Hasin Hayder

stranger 7/100 abdul hoque, by Hasin Hayder

HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.

Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.

Platforms

Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.

Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.

Twitter provides a mechanism for downloading “Your Twitter Archive” which creates a zip file containing a standalone JavaScript-enabled application that evokes the experience of using the Twitter web service in the browser. At first glance, this resembles the format of Facebook’s data export, but a key differentiator is that the Twitter data export includes each of the individual tweets in JSON format and provides the standalone application as a convenience for browsing them. Since the exported data is separate from the presentation, it’s much easier to re-purpose or manipulate it with other tools.

Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.

It is Time to Embrace the Present: Guest Post by Deborah Kempe

“Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
Nor any drop to drink.”
― Samuel Taylor ColeridgeThe Rime of the Ancient Mariner

water

Colleagues, do you not share this sensation when it comes to navigating the ocean that is the web? When 57,865 returns in a Google search do not cut it for research purposes, where does one turn? When an important url is suddenly no longer findable, what does one do?  Unfortunately, the traditionally safe harbor of libraries and archives as a trustworthy repository of reliable information is no longer quite so secure a destination. At the same time, the traditionally held concept of what libraries and archives should be is undergoing radical reinterpretation.

fortune

It was amid this shifting landscape that the libraries of the New York Art Resources Consortium (NYARC) undertook a series of programmatic inquiries into the state of the web for research in art history.  Those explorations led to a major grant from the Andrew W. Mellon Foundation, awarded to NYARC in October, for a two-year program in support of preserving born digital resources for art research.

The emerging fields of web archiving and digital humanities are relatively new ones. Given that the field of art history and the art business community continue to produce steady streams of relevant print publications, the adoption of a contiguous program to select, capture, describe and preserve born digital resources will be a major disruption of traditional library practices.  Arriving at this point has been a delicate calculation of structured investigation and righteous determination that admittedly can be a bit uncomfortable.

****

It is time to embrace the present, let alone the future. The digital world is here to stay and constantly changing.  We have to not only embrace it but help to shape it.”

These words, expressed just over a year ago by James Cuno, the President and CEO of the J. Paul Getty Trust in a much re-circulated blog posting entitled How Art History is Failing at the Internet, capture the attitude that drove us forward into territory that challenged our comfort levels.  But the journey was a series of determined steps.  A bit of background…..

A presentation in 2010 by Kristine Hanna of the Internet Archive at an ARLIS/NY meeting at the Metropolitan Museum of Art was the first introduction for many of us to a new software service, called Archive-It, which could be used to curate and capture historical instances of websites.  Unlike the digital deluge that many disciplines were experiencing, the realm of art history was only beginning to produce a noticeable quantity of websites and digital publications with value for research.  With the onset of the “Great Recession,” the economic crisis that was all-too-real at the time but now feels like a chimera, many galleries, art dealers, auctioneers, and small museums made a sudden shift to digital publications.  The move from print to digital publishing, once driven by cost savings, led to a preference for digital as the platform of choice for many reasons beyond those of economy.

After the ARLIS/NY meeting, staff from the Frick Art Reference Library approached Archive-It to discuss the possibility of undertaking pilot projects to investigate archiving websites of auction houses and to capture and preserve links to digital information in the Archives Directory for the History of Collecting in America.  Archive-It generously facilitated these landmark projects, which allowed us to learn firsthand the challenges and promises of archiving highly visual collections on the web.

Eager to make further progress, NYARC approached the Andrew W. Mellon Foundation with a proposal to take our study to the next level.  By this time, national libraries and large universities were creating discipline or event-based web archives, but our research uncovered very little web archiving activity by special libraries. Although we continued to receive a steady flow of print publications, the number of digital publications was clearly increasing, and in many cases we were not collecting, describing, and preserving them for the long term as we did for printed documents. The clock was ticking, and we began to understand the threat of a digital black hole in our collections.  Large libraries had never collected the sort of so-called “ephemeral” resources such as auction, dealer, and small exhibition catalogs, and they were not going to do it for dynamic digital versions, either.  NYARC made the case of special needs for libraries whose chief missions are to serve art specialists.

moon

That the web has become the dominant channel for information-seeking in the 21st culture is a given, yet much of its digital content is fragile and ephemeral. The question for NYARC was no longer “Why archive the web,” but “How to archive the web,” “Who should archive the web,” and “How will users navigate web archives?”  The Mellon Foundation responded with support for “Reframing Collections for a Digital Age: A Preparatory Study for Collecting and Preserving Web-Based Art Research Materials”.  The one-year grant allowed us to bring in experts to assess the digital landscape of art information. The reports that followed allowed NYARC to envision a road map for creating a sustainable program of specialized web archiving.

“Go small, go simple, go now”
― Larry PardeyCruising in “Seraffyn”

While nothing about the web is simple, an incremental approach to problem-solving is effective.  Essentially, that is what our consultants advised.

****

With the recent award of our two-year implementation grant from the Andrew W. Mellon Foundation, NYARC is now in the beginning stages of building a program that will integrate web archiving into our core activity of building high-quality collections for use by art researchers and museum staff far beyond our reading rooms.  By calling our proposal Making the Black Hole Gray, we acknowledge the futility of fully closing the digital black hole, and that it will not be possible to capture every born-digital resource that we might wish to.  Instead, we will prioritize the harvest of digital resources that correspond to our traditional collection strengths, with the expectation that others will join a historical pattern of collaborative resource sharing to enable the creation of a lasting digital corpus that will invigorate the work of librarians, archivists, scholars, technologists, and the public in ways we have only begun to imagine.  Let the voyage continue.

travel

–Deborah Kempe, Chief, Collections Management & Access, Frick Art Reference Library of The Frick Collection (a member of the New York Art Resources Consortium), 12/6/2013

Archive-It Updates: Guest Post by Kristine Hanna (Director, Internet Archive)

A big year for web archiving

2013 has already been a big year for web archiving with over 250 partner organizations world-wide archiving the web using Archive-It. We are excited about the wide range of use cases and experience levels with web archiving represented in the Archive-It community. As we grow, we are focusing on bringing a stronger and more robust service offering to our partner organizations.

Archiving the web has come a long way since Archive-It was first deployed in early 2006; but we also realize how much more work needs to be done. The web remains “a mess” and we are grateful to the Archive-It community who continues to help us find solutions. A few months ago we published our web archiving white paper in an effort to build a framework and best practices around many of the use cases we have seen evolve inside the Archive-It community over the last seven years. http://www.archive-it.org/publications

Heritrix upgrade and 4.8

In February we upgraded to the newest version of the crawler software Heritrix, 3.1. For our partners this means faster crawls and more complete captures of websites, particularly for sites that contain Javascript.

In May we went live with our 4.8 release. Over 16 partner requested features were included. Here are a few:

  • New Wayback Quality Assurance Tool: Enables partners to find missing embedded URLs while browsing their archived content. A subsequent patch crawl of the missing URLs can dramatically improve the overall look and functionality of the archived site.
  • Increased functionality for archiving PDFs- A new feature enables partners to quickly find, view, and add metadata to new PDFs that were archived since their last crawl.
  • IP Authentification- Option to restrict access to archived websites to a particular IP address, such as a reading room or library.
  • Vanity URL- Customizable URLs for partner’s public home pages on http://www.archive-it.org.
  • Import Metdata- Ability to import a spreadsheet to add to or replace metadata for thousands of records at a time.

Archive-It 5.0: The next generation web archiving application

Later this year, the Archive-It team will be turning our focus to the next generation of the web application which we are calling 5.0. We expect the development to take 6-8 months and we will continue to run the existing 4.8 version while we work with partners on reinvigorating the web application. As always it will be a priority to incorporate partner’s feedback and requests into 5.0 to ensure that the service continues to be relevant and useful to the libraries and archiving communities.

So far, we are focusing our attention towards:

  • adding additional capture and display solutions for social media sites and dynamic content
  • transitioning full text search from NutchWAX to SOLR to accompany our metadata search (which made the transition to SOLR in 2011)
  • enhanced QA and crawl analysis tools
  • designing a more intuitive user interface

If you would like to learn more about 5.0, you can take a look at our wiki page here https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=49414243. We welcome additional feature requests or suggestions.

Get In Touch

Please come find the Archive-It team at the following conferences, where we will be presenting or showing a poster:

Association of Canadian Archivists General Meeting June 15 http://archivists.ca/content/program

Digital Preservation 2013, Alexandria VA July 23-25 http://www.digitalpreservation.gov/meetings/ndiipp13.html

Society of American Archivists August 15-17 http://www2.archivists.org/conference/2013/new-orleans

And our 2013 partner meeting is scheduled for November 12 in Salt Lake City Utah http://archiveitmeeting2013.wordpress.com/about/

We appreciate all the valuable feedback we receive from our partners and the library/archive community on ways to make the Archive-It service better.

Follow us on twitter @archiveitorg

Find us on facebook https://www.facebook.com/ArchiveIt

Our blog http://blog.archive-it.org