Web Archives: Guest Post by Alexander Duryee

While the web archiving discussion tends to focus on subject- and Internet-scope efforts, the other end of the scale – small item-level archives – has also developed its set of techniques and tools.  Based out of the New Museum of Contemporary Art, Rhizome is the leading nonprofit dedicated to supporting emerging artistic practices that engage technology.  Part of our broader preservation mission is the continued conservation and access of artworks engaged with technology; as such, with 2,000 works, the Rhizome ArtBase is home to the world’s largest collection of its kind.  As an art collection, we approach digital preservation as a challenge of conservation instead of archiving.  Hence, our tools, philosophy, and methodologies differ from most web archival programs, as our goal – preserving a small art collection perfectly – requires techniques and perspectives not found in broader-scale collections.

Form Art
Form Art, Netscape 3.0 and Chrome 26

Compared to more general archival collections, the works in the ArtBase present a number of technical and contextual issues that resist industrial-scale crawling.  Digital artists tend to work on the edge of what is technically possible, finding creative uses for the most powerful tools available.  As such, works are rarely trivially crawlable; an artist may run their site through a variety of closed formats, dynamically- and responsively-generated content, and external services.  Despite this, given the demands of conservation, Rhizome must recreate both the file-level data and structure-level operability of each artwork in the ArtBase.  This friction between archive and archivist has shaped Rhizome’s web archiving program from the beginning.  In addition to issues arising from works-as-artifacts, a variety of issues stem from the nature of works-as-art.  The aesthetic environment and context of a work must be identified and documented, to preserve the artist’s intended experience; viewing Form Art (1997), for example, changes considerably as one moves forward in computer history.

The variety of conservatorial challenges that net art presents has led Rhizome to develop a specialized archival workflow and tool suite.  Large-scale crawling applications, such as Heritrix, are not well-suited to our web archiving needs – while they excel at crawling very large collections of documents, they tend to be ill-suited for very small targeted tasks.  Our crawls also require a high degree of interactivity, to ensure that nothing more or less than a given work is archived in the ArtBase.  This emphasis on tightly focused and monitored scoping is critical, as Rhizome must capture an entire work without going beyond its artist-stated limits.  As such, Rhizome uses a toolbox of standard open-source tools for our cloning efforts.  The workhorse of our web archiving is wget, for its power and flexibility: along with being able to set very specific rules for its crawls and behavior, it can be passed output from other tools via shell scripts.  This is crucial when dealing with non-HTML objects, such as Flash and complex JavaScript.  The variety of formats and structures that net art uses (and occasionally abuses) is reflected in our analytic tools, which consists of UNIX text utilities, Perl scripting, media-to-text interpreters (such as swfmill), and hex editors.  While such detailed investigation would be impossible in large-scale web archives, Rhizome’s small-but-demanding collection provides us with both the luxury and necessity of file-level focus.

Workflow chart
Rhizome’s web conservation workflow

Due to Rhizome’s emphasis on small-scale conservation and our particular collection, our workflow is highly iterative, with many quality control steps before the final archival package is developed.  Following the identification of a work and an initial analysis (in case a work is too damaged to continue, or is based on server-side technologies), a work is crawled once.  We then analyze the crawl, looking for gaps (typically non-HTML links, e.g. JavaScript or Flash, and applet dependencies), as well as documenting possible properties of interest (such as heavy software/hardware reliance).  These gaps are then analyzed for URLs (either hardcoded or generated on-the-fly), then which are fed back into wget for further crawling.  This process continues until as complete a copy as possible is present on Rhizome’s servers.  The complete list of URLs is then passed one final time into wget, this time generating a WARC file alongside the crawl.  This final step, while seemingly redundant, is crucial to Rhizome’s archival role: by providing provenance metadata along with the raw data of a crawl, the exact parameters, process, and post-crawl changes to a work (e.g. adjusting URLs) are available to future researchers.

Despite the individual care that we can afford each work in the ArtBase, not all works can be crawled and stored as static objects.  Many of the works in our collection are driven by server-side processes, ranging from early CGI and Perl scripting to sophisticated live data manipulation.  John Klima’s The Great Game, for example, consisted of a Java applet that pulled daily artist-generated data from a central server.  In cases such as this, no amount of crawling and quality control can create a functional copy of the work on Rhizome’s servers.  Instead, we directly contact the artist and request the necessary material to archive their work, as well as documentation to preserve its functionality.  This way, the ArtBase can ingest a completely functional copy, as well as ensure that the necessary technical documentation is present for potential future restoration.  In addition, by working with artists through the archival process, we raise awareness of the preservation needs of digital art.  By combining two conservatorial approaches – behind-the-scenes preservation and direct work with artists – Rhizome has found success in its archival mission.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s