Weekly web archiving roundup for the week of December 10, 2014:
- “Archives, Access, and the Sounds of New York City: An Interview with Kenneth Goldsmith“, from Brian Fauteux. Interview includes some interesting web archiving concepts.
- “Exploding ARC Files with Warcbase“, from Ian Milligan. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase.
- “URI agnostic deduplication on content discovered at crawl time“, from Kristinn Sigurðsson . In his last blog post, Kris showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl.