Web Archiving at the National Széchényi Library

This week’s post was written by Dr. Gyula Kalcsó, web archiving team leader at the National Széchényi Library in Budapest, Hungary.

The National Széchényi Library Web Archive was established in 2017 with the aim of providing a representative overview about online contents intended for the Hungarian public or related to Hungary as a part of our cultural heritage. The workflow was developed in several stages, testing a wide range of software and selecting the best suited for the task. This work has been facilitated by the fact that the NSZL Web Archive has been a member of the International Internet Preservation Coalition since 2018, where recommendations are continuously being made for each task, both in terms of the software to be used and the workflows.

A screenshot of the home page of the NSZL Web Archive.

The NSZL Web Archive collects information in three ways: from a selection of the most important Hungarian websites, from main news sources related to specific events, and from the Hungarian web space in general. A limited range of scientific, cultural, educational and public content is collected selectively. The general collection will cover public websites registered under the .hu domain or belonging to other domains but targeting a Hungarian audience. The web harvesting only covers servers from which it is technically possible to ensure automatic downloading of content. When harvesting, the library will take into account the restrictions set for the harvesting software by the owner of the site.

In case of archived web content, NSZL wants to establish a long-term preservation model. In order to respect privacy and copyright rules only a small part of the collection is publicly available. Websites are publicly available only if NSZL already has the permission of the content owner or if the content was made with public money. The rest of the archive will be only available within a dedicated network primarily for research purposes.

The websites that make up the thematic sub-collections are selected by librarians and the sites are being archived several times per year. They typically contain websites and blogs, that is they do not contain social media that cannot be harvested by a robot, as well as online periodicals because they are kept separate. In addition to the websites of institutions, organizations and companies, the pages of professionals and artists working on that topic can also be included in the sub-collections. Seed lists consist of several hundred to several thousand URL addresses. We are permanently updating, expanding these lists, and adding new topics to the archive every few months. The materials of these selective collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future. Only a small fraction of these selected websites is available through our open demo collection that we have permission for public access from the copyright owner or for which no individual contract is required.

A screenshot of the demo collection home page.

We also have sub-collections being setup focusing on main national or global events. The materials of these collections are based on selected articles/sections of the biggest news portals, websites of corresponding institutions, thematic homepages, blogs, Wikipedia articles etc. Harvests are usually started some weeks before the main event (if we know the exact date of an event) and ending when the press coverage of the event has mainly disappeared. Making a weekly harvest is usual from these websites. This collection is not publicly available and can only be used for research purposes in the NSZL building.

Beyond to selective (thematic or event-based) harvests we try to make snapshot harvests once or twice a year from 2018 about a representatively large part of the Hungarian web space. It means to harvest more than a million websites from the starting page at least to two level depth – excluding files by large size in order to spare storage space. The initial URLs can be collected from several resources: public lists of URL addresses from the Hungarian domain, those links that include Hungarian domains and sub-domains we could find by earlier harvests, the .hu “zonefile” from the Internet Archive, and those website addresses that have selected for thematic collections or recommended by the corresponding template (these include addresses beyond the .hu domain also). The materials of these archived collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future.

The public collection is harvested with the use of the Web Curator Tool, also supported by IIPC. The harvesting settings for these sites are much more finely adjusted, and we try to constantly update the settings, while constantly quality checking the content we collect, in order to deliver the highest quality material. The harvesting itself is done using a variety of software, depending mainly on which tool gives better results for the site we are saving. These harvests are made by Heritrix, Brozzler, Webrecorder, ArchiveWeb.Page or HTTrack software, usually in a limited depth of the original website. Display of the archived items are being made by OpenWayback, PyWb, SolrWayback software and/or by Conifer, the online version of Webrecorder. The archived items made by HTTrack in a file system structure can be seen through the webserver. We also provide screenshots of the original homepages, links to archived copies made by Internet Archive, and to the original site. By the SolrWayback software, full text search function of the archived websites is available. Sorting by domain names, file types and year of archiving can customize lists of hits further.

The material in the web archive is mainly used for social science and digital humanities research, typically by researchers interested in a particular topic, who consequently tend to search thematic collections. At the same time, the OSZK web archive has carried out a collection on the Russian-Ukrainian conflict with a full text search engine, and has also produced big data research on the basis of this material (mentioned in an IIPC blog post). The dataset is available through an interactive Power BI interface (unfortunately only in Hungarian). We are working to improve the conditions for the archive to be used for research purposes.

A screenshot of the Power BI interface of the Ukrainan war news collection.

Author Bio:

The author is the leader of the web archiving team of National Széchényi Library Digital Humanities Centre Department of Digital Philology and Web Archiving. He is also a university lecturer in linguistics. His main fields are born digital archiving, corpus building, natural language processing. He holds a PhD of linguistics. He has been publishing for 20 years on corpus building for linguistics, linguistic corpus analysis, digital humanities theory and practice.

Archiving the Web as Public Service

This week’s post was written by Daniel Gomes, Head of Arquivo.pt.

Arquivo.pt: a Searchable Web Archive

Arquivo.pt is a public and free service that enables anyone to search and access historical information preserved from the Web since the 1990s. Arquivo.pt contains billions of files collected from websites in several languages (about half of its users come from outside of Portugal).

Periodically, the Arquivo.pt system automatically collects and stores information published on the web. The Arquivo.pt hardware infrastructure is hosted at its own datacenter, and it is managed by full-time dedicated staff. 

The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search, and application programming interfaces (API) that facilitate the development of added-value applications by third parties.

Arquivo.pt is supported by the Ministry of Science and Higher Education of Portugal. 

Showing off the Value of Web Archives

Web archives preserve web documents for future access, but they must also demonstrate their value in everyday life situations.

Thematic exhibitions and collaborative collections have been developed to illustrate the utility of web archives as a source of historical documentation. A list of all the collections preserved by Arquivo.pt is publicly available. The data sets generated to create these exhibitions or derived from the operation of the service are openly available.

Arquivo.pt has been launching complementary services to engage individuals and organizations in web archiving.

SavePageNow: Archive a Web Page Immediately

Web pages change rapidly and sometimes web archives cannot find them to be preserved on time. Arquivo.pt provides a public form where users can suggest websites to be preserved

Arquivo.pt also launched SavePageNow that enables users to immediately archive a set of web pages in high quality. The user enters a web page, starts browsing and all the visited content is archived. This service enables users to archive a small website autonomously. 

The web archived content becomes later available in Arquivo.pt.

Complete Page: Crowdsourced Digital Curation

Web archives do the best they can to thoroughly archive web pages. However, sometimes users find missing content in web archived pages (e.g. missing embedded images).

Arquivo.pt provides the “Complete page” option at the replay user interface which automatically looks for missing content in external web archives and on the live web. 

The obtained content is later integrated in Arquivo.pt and becomes available for all the users. “Complete page” engages users in the curation of the web-archived collections. 

Arquivo404: Fix Broken Links

Link rot has been a prevalent problem since the early days of the web. Arquivo404 is a single-line javascript code to be installed on the “404 – Page not found” error pages that mitigates broken links. 

If a given page was not found, arquivo404 generates a message that suggests an alternative link to a web archived version of the broken URL preserved at Arquivo.pt. 

Notice that the message is displayed only if the page exists in Arquivo.pt. If it was not archived, the default “page not found” message error is presented. The list of web archives to be used is configurable.

Memorial: Preserve Your Old Website

There are many historical websites that provide valuable information but are no longer updated and require significant resources to be kept online. Moreover, costs grow as websites become older and dangerous security issues frequently occur. 

The Arquivo.pt Memorial offers high-quality storing of websites’ content with the possibility of maintaining their original domains. This way, the website content remains searchable through liveweb search engines. 

The links to internal pages on the website are also redirected to the correspondent webarchived pages to avoid the occurrence of broken links from external pages.

Training and Education on Web Preservation

Arquivo.pt has been raising awareness about the importance of web preservation. It issued a set of recommendations to develop preservable sites and has been promoting a free training programme, composed by 4 modules:

  • New ways of searching the past: presents the search and access available at Arquivo.pt and targets any Internet user;
  • Well publish to well preserve: discusses recommendations for publishing preservable websites and targets web authors;
  • Automatic processing of information preserved from the Web: presents  the Arquivo.pt APIs and targets web developers;
  • Web archiving – Do-it-yourself!: teaches how to adequately acquire, store, and replay web content and targets information professionals.

The Arquivo.pt Award annually distinguishes innovative works based on the historical information preserved by Arquivo.pt. The Arquivo.pt awards began in 2018, and the 15 works awarded so far clearly demonstrate the utility of web archives.

The members of the Arquivo.pt team have been publishing technical and scientific articles related to web archiving in open-access since 2008, including the book The Past Web: Exploring Web Archives (Green Open Access). All the developed software is available as free open source projects.

Main Challenge: Spread the Word About Arquivo.pt!

The Arquivo.pt project began in 2007, and it has been a public running service since 2013. However, most people in Portugal and all over the world have never heard about it. Getting people’s attention is a major challenge, especially in the online world. 

As most online information and services are apparently available for free, web archives must compete with the Internet giants (e.g. Google, Tik Tok, or Meta) for the web users attention. If you find Arquivo.pt to be useful and want to support it: Spread the word about Arquivo.pt!

Web Archiving Roundup: July, 2019

The Web Archiving and Metadata Digital Object Sections will hold a joint event during the SAA Annual Meeting in Austin, TX. Join us on Saturday, August 3rd for a debate on descriptive metadata and web archiving.

The 2019 Archive-It Partner Meeting coincides with SAA’s Annual Meeting, registration is still open.

Graphic Designer Sam Henri Gold has been archiving Apple ads from the 1970s to the present, you can take a look at the archive directly from the article.

ArchiveSpark 3.0 is now available, take a look a the updates in GitHub.

Check out this article about a High School student’s experience working for the Archives Unleashed team.

The latest issue of the Newsletter from the ESRC National Centre for Research Methods includes an article on research challenges using web archives for social research.

Registration is still open for the Specialized Data Curation Workshop hosted by the Data Curation Network at Washington University in St. Louis.

The Digital Preservation Coalition is crowd-sourcing a list of endangered digital materials. Nominations close on Friday August 30th, 2019.

Web Archiving Roundup: May, 2019

UPDATE – Join the ALCTS Metadata Interest Group Meeting during ALA Annual 2019 for a presentation and Q&A on the Library of Congress Web Archiving Program on Sunday, June 23, 2019, 9:00-10:00AM at the Marriott Marquis.

Now accepting nominations for the SAA Web Archiving Section’s 2019-2020 Steering Committee: https://www2.archivists.org/groups/web-archiving-section/now-accepting-2019-2020-steering-committee-nominations.

Registration for the IIPC Web Archiving Conference ends May 24. The conference will be hosted by the National and University Library of Croatia in Zagreb, which coincides with the 15h anniversary of the Croatian Web Archive (HAW).

For members of the Digital Preservation Coalition, the DPC Web Archiving & Preservation Task Force is inviting delegates to a meeting on July 18, in London. The meeting is free for DPC members, registration ends July 10.

IIPC Content Development Group is asking for contributions to their Climate Change Collection, and their Artificial Intelligence Collection.

Ben Els, Digital Curator at the National Library of Luxembourg, gives us a glimpse not the effort to capture the Luxembourg elections.

Seth Denbo, Director of Scholarly Communication and Digital Initiatives at the American Historical Association, strikes a cord on the challenges of scale in an article titled Data Overload.

The Atlantic has an article on the implication of AI vacuum cleaners from tech companies.

You can now read the paper presented at the 2018 World Library and Information Congress by the Library of Congress, the paper is titled Institutions as Social Media Collector: Lessons Learned from the Library of Congress.

The National Library of the Netherlands has recently launched a collection of archived websites from the Chinese Community in the Netherlands.

ECAL (École cantonale d’art de Lausanne) has launched a website called Information Mesh celebrating the 30th anniversary of the World Wide Web.

Web Archiving Roundup: March, 2019

Help the SAA Electronic Records Section find more about the most useful resources for the electronic records community. You can find the survey and a bit more about their project here.

Registration for Archivematica Camp in Vancouver, June 24-26, is still open.

Early bird registration for IIPC Web Archiving Conference is now open. You can also take a look at the program.

The International Journal of Digital Humanities has an article on web archiving initiatives in Europe. The article is titled Web Archives as Data Resource for Digital Scholars.

The Ivy Plus Libraries Confederation have launched the 2018 Brazilian Presidential Transition Web Archive.

Library of Congress Web Archives blog post from Jesse Johnston, Senior Digital Collections Specialist at LOC, gives a walkthrough into sorting through a set of US Government PDFs.

Celebrate the 30th anniversary of the world wide web exploring internet archives through emulated legacy browsers with Rhizome!

Another fun article to celebrate the web’s 30th anniversary looking at Australia’s ugly 90’s websites.

The National Library of Ireland recently announced their 2018 Web Archiving collection.

Web Archiving Roundup: February, 2019

You can still register for AASLH’s webinar Web Archiving: What, Why, and How, the webinar will take place on February 28 @3:00pm EST.

Archive-It will host an advanced training session on February 26 at 11:00 AM Pacific Time (US & Canada), the session will focus on Archive-It as a Reference Tool.

The National Videogame Foundation in collaboration with Bath Spa University and funded by the British Academy and Leverhulme Trust, released a White Paper titled Game Over? Curating, Preserving and Exhibiting Videogames.

Richard Ovenden, Bodley’s Librarian, has an article in The Economist about digital preservation.

Ilya Kreymer, Webrecorder Lead Developer, shares his Code4Lib 2019 presentation slides.

A new release of web crawler project Heritrix 3 is now available.

Middlebury Facebook group Middlebury Memes for Crunchy Teens to be archived by Special Collections.

PhD candidate Rhiannon Lewis writes a response to the DPC’s Briefing Day on web archiving for community and individual archives.

New version of Web Archiving Integration Layer (WAIL) for macOS is now available.

Shawn M. Jones writes a blog post for the Web Science and Digital Libraries Research Group at Old Dominion University regarding Google+ shutting down.

Stanford Libraries receives a $25 million grant to preserve Silicon Valley Archives

PANDORA, Australia’s Web Archive, initially established by the National Library of Australia celebrates its 10 year anniversary.

Web Archiving Roundup: January, 2019

Here is your first Web Archiving Roundup of 2019!

Web Archiving Roundup: November 21, 2018

Here’s your Web Archiving Roundup for November, 2018:

Weekly web archiving roundup: September 18, 2015

Weekly web archiving roundup for the week of September 18, 2015:

Weekly web archiving roundup: September 10, 2015

Weekly web archiving roundup for the week of September 10, 2015: