Web Archiving Service (WAS): Guest Post by Rosalie Lack

Thank you Teresa for asking me to write about CDL’s WAS. Below is a brief overview of WAS and a web archiving call to action!

What is WAS?

Web Archiving Service (WAS) is the California Digital Library’s web archiving service. WAS provides curator tools for capturing, managing and archiving websites and a public interface for the search and display of archived sites.

Web Archiving ServiceSince its launch in 2007, WAS has served the University of California libraries and affiliated institutions, as well as educational institutions across North America.

Who is WAS? 

You can’t talk about WAS without mentioning Tracy Seneca, now Digital Services Librarian at the University of Illinois, Chicago. Tracy was the WAS Service Manager from the very beginning until she left the CDL in 2012 for a new job opportunity and to return to her beloved Chicago. When Tracy left, I joined the CDL team as the WAS Service Manager. The technical team (on board since the beginning) includes Erik Hetzner (Technical Lead) and Scott Fisher (UX Developer). And, because it takes a village, we also have help from other members of the UC Curation Center (UC3) team as well as other staff across the CDL.

How is WAS used?

There are four ways to characterize how institutions are using WAS: (1) for capturing their institution’s websites with the goal of preserving institutional history and capturing university news and events; (2) for crawling with a geographical focus, such as to capture information on a particular city or county or region; (3) for topical crawls that support research collections; and (4) for event driven crawls around a particular event. All crawls are done within the context of each library’s collection development policy. The event crawls can be different from the others in that there is a usually a clear start and end date for crawls, and unplanned events can require sudden action. A list of archives that you can browse and search is available from the WAS home page – was.cdlib.org

Go forth and crawl!

And now for that call to action … I recently attended the IIPC General Assembly in Ljubljana, Slovenia. The conference brought together dedicated web archivists from all over the world to tackle tough archiving issues having to do with tools, standards, access, and more. This is an exciting time to be involved in web archiving – there are many challenges, but also many great opportunities. I left the conference re-energized about web archiving and reminded of how important it is that we’re not only saving important internet resources, but also building new collections for research and study. Indeed, go forth and crawl!

Rosalie Lack, WAS Service Manager
rosalie.lack@ucop.edu

 

Advertisements

Web Archives: Guest Post by Alexander Duryee

While the web archiving discussion tends to focus on subject- and Internet-scope efforts, the other end of the scale – small item-level archives – has also developed its set of techniques and tools.  Based out of the New Museum of Contemporary Art, Rhizome is the leading nonprofit dedicated to supporting emerging artistic practices that engage technology.  Part of our broader preservation mission is the continued conservation and access of artworks engaged with technology; as such, with 2,000 works, the Rhizome ArtBase is home to the world’s largest collection of its kind.  As an art collection, we approach digital preservation as a challenge of conservation instead of archiving.  Hence, our tools, philosophy, and methodologies differ from most web archival programs, as our goal – preserving a small art collection perfectly – requires techniques and perspectives not found in broader-scale collections.

Form Art
Form Art, Netscape 3.0 and Chrome 26

Compared to more general archival collections, the works in the ArtBase present a number of technical and contextual issues that resist industrial-scale crawling.  Digital artists tend to work on the edge of what is technically possible, finding creative uses for the most powerful tools available.  As such, works are rarely trivially crawlable; an artist may run their site through a variety of closed formats, dynamically- and responsively-generated content, and external services.  Despite this, given the demands of conservation, Rhizome must recreate both the file-level data and structure-level operability of each artwork in the ArtBase.  This friction between archive and archivist has shaped Rhizome’s web archiving program from the beginning.  In addition to issues arising from works-as-artifacts, a variety of issues stem from the nature of works-as-art.  The aesthetic environment and context of a work must be identified and documented, to preserve the artist’s intended experience; viewing Form Art (1997), for example, changes considerably as one moves forward in computer history.

The variety of conservatorial challenges that net art presents has led Rhizome to develop a specialized archival workflow and tool suite.  Large-scale crawling applications, such as Heritrix, are not well-suited to our web archiving needs – while they excel at crawling very large collections of documents, they tend to be ill-suited for very small targeted tasks.  Our crawls also require a high degree of interactivity, to ensure that nothing more or less than a given work is archived in the ArtBase.  This emphasis on tightly focused and monitored scoping is critical, as Rhizome must capture an entire work without going beyond its artist-stated limits.  As such, Rhizome uses a toolbox of standard open-source tools for our cloning efforts.  The workhorse of our web archiving is wget, for its power and flexibility: along with being able to set very specific rules for its crawls and behavior, it can be passed output from other tools via shell scripts.  This is crucial when dealing with non-HTML objects, such as Flash and complex JavaScript.  The variety of formats and structures that net art uses (and occasionally abuses) is reflected in our analytic tools, which consists of UNIX text utilities, Perl scripting, media-to-text interpreters (such as swfmill), and hex editors.  While such detailed investigation would be impossible in large-scale web archives, Rhizome’s small-but-demanding collection provides us with both the luxury and necessity of file-level focus.

Workflow chart
Rhizome’s web conservation workflow

Due to Rhizome’s emphasis on small-scale conservation and our particular collection, our workflow is highly iterative, with many quality control steps before the final archival package is developed.  Following the identification of a work and an initial analysis (in case a work is too damaged to continue, or is based on server-side technologies), a work is crawled once.  We then analyze the crawl, looking for gaps (typically non-HTML links, e.g. JavaScript or Flash, and applet dependencies), as well as documenting possible properties of interest (such as heavy software/hardware reliance).  These gaps are then analyzed for URLs (either hardcoded or generated on-the-fly), then which are fed back into wget for further crawling.  This process continues until as complete a copy as possible is present on Rhizome’s servers.  The complete list of URLs is then passed one final time into wget, this time generating a WARC file alongside the crawl.  This final step, while seemingly redundant, is crucial to Rhizome’s archival role: by providing provenance metadata along with the raw data of a crawl, the exact parameters, process, and post-crawl changes to a work (e.g. adjusting URLs) are available to future researchers.

Despite the individual care that we can afford each work in the ArtBase, not all works can be crawled and stored as static objects.  Many of the works in our collection are driven by server-side processes, ranging from early CGI and Perl scripting to sophisticated live data manipulation.  John Klima’s The Great Game, for example, consisted of a Java applet that pulled daily artist-generated data from a central server.  In cases such as this, no amount of crawling and quality control can create a functional copy of the work on Rhizome’s servers.  Instead, we directly contact the artist and request the necessary material to archive their work, as well as documentation to preserve its functionality.  This way, the ArtBase can ingest a completely functional copy, as well as ensure that the necessary technical documentation is present for potential future restoration.  In addition, by working with artists through the archival process, we raise awareness of the preservation needs of digital art.  By combining two conservatorial approaches – behind-the-scenes preservation and direct work with artists – Rhizome has found success in its archival mission.