Web Archiving at the National Széchényi Library

On April 6, 2023April 17, 2023 By Allison FischbachIn Guest post, International News, Tools & Techniques1 Comment

This week’s post was written by Dr. Gyula Kalcsó, web archiving team leader at the National Széchényi Library in Budapest, Hungary.

The National Széchényi Library Web Archive was established in 2017 with the aim of providing a representative overview about online contents intended for the Hungarian public or related to Hungary as a part of our cultural heritage. The workflow was developed in several stages, testing a wide range of software and selecting the best suited for the task. This work has been facilitated by the fact that the NSZL Web Archive has been a member of the International Internet Preservation Coalition since 2018, where recommendations are continuously being made for each task, both in terms of the software to be used and the workflows.

A screenshot of the home page of the NSZL Web Archive.

The NSZL Web Archive collects information in three ways: from a selection of the most important Hungarian websites, from main news sources related to specific events, and from the Hungarian web space in general. A limited range of scientific, cultural, educational and public content is collected selectively. The general collection will cover public websites registered under the .hu domain or belonging to other domains but targeting a Hungarian audience. The web harvesting only covers servers from which it is technically possible to ensure automatic downloading of content. When harvesting, the library will take into account the restrictions set for the harvesting software by the owner of the site.

In case of archived web content, NSZL wants to establish a long-term preservation model. In order to respect privacy and copyright rules only a small part of the collection is publicly available. Websites are publicly available only if NSZL already has the permission of the content owner or if the content was made with public money. The rest of the archive will be only available within a dedicated network primarily for research purposes.

The websites that make up the thematic sub-collections are selected by librarians and the sites are being archived several times per year. They typically contain websites and blogs, that is they do not contain social media that cannot be harvested by a robot, as well as online periodicals because they are kept separate. In addition to the websites of institutions, organizations and companies, the pages of professionals and artists working on that topic can also be included in the sub-collections. Seed lists consist of several hundred to several thousand URL addresses. We are permanently updating, expanding these lists, and adding new topics to the archive every few months. The materials of these selective collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future. Only a small fraction of these selected websites is available through our open demo collection that we have permission for public access from the copyright owner or for which no individual contract is required.

A screenshot of the demo collection home page.

We also have sub-collections being setup focusing on main national or global events. The materials of these collections are based on selected articles/sections of the biggest news portals, websites of corresponding institutions, thematic homepages, blogs, Wikipedia articles etc. Harvests are usually started some weeks before the main event (if we know the exact date of an event) and ending when the press coverage of the event has mainly disappeared. Making a weekly harvest is usual from these websites. This collection is not publicly available and can only be used for research purposes in the NSZL building.

Beyond to selective (thematic or event-based) harvests we try to make snapshot harvests once or twice a year from 2018 about a representatively large part of the Hungarian web space. It means to harvest more than a million websites from the starting page at least to two level depth – excluding files by large size in order to spare storage space. The initial URLs can be collected from several resources: public lists of URL addresses from the Hungarian domain, those links that include Hungarian domains and sub-domains we could find by earlier harvests, the .hu “zonefile” from the Internet Archive, and those website addresses that have selected for thematic collections or recommended by the corresponding template (these include addresses beyond the .hu domain also). The materials of these archived collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future.

The public collection is harvested with the use of the Web Curator Tool, also supported by IIPC. The harvesting settings for these sites are much more finely adjusted, and we try to constantly update the settings, while constantly quality checking the content we collect, in order to deliver the highest quality material. The harvesting itself is done using a variety of software, depending mainly on which tool gives better results for the site we are saving. These harvests are made by Heritrix, Brozzler, Webrecorder, ArchiveWeb.Page or HTTrack software, usually in a limited depth of the original website. Display of the archived items are being made by OpenWayback, PyWb, SolrWayback software and/or by Conifer, the online version of Webrecorder. The archived items made by HTTrack in a file system structure can be seen through the webserver. We also provide screenshots of the original homepages, links to archived copies made by Internet Archive, and to the original site. By the SolrWayback software, full text search function of the archived websites is available. Sorting by domain names, file types and year of archiving can customize lists of hits further.

The material in the web archive is mainly used for social science and digital humanities research, typically by researchers interested in a particular topic, who consequently tend to search thematic collections. At the same time, the OSZK web archive has carried out a collection on the Russian-Ukrainian conflict with a full text search engine, and has also produced big data research on the basis of this material (mentioned in an IIPC blog post). The dataset is available through an interactive Power BI interface (unfortunately only in Hungarian). We are working to improve the conditions for the archive to be used for research purposes.

A screenshot of the Power BI interface of the Ukrainan war news collection.

Author Bio:

The author is the leader of the web archiving team of National Széchényi Library Digital Humanities Centre Department of Digital Philology and Web Archiving. He is also a university lecturer in linguistics. His main fields are born digital archiving, corpus building, natural language processing. He holds a PhD of linguistics. He has been publishing for 20 years on corpus building for linguistics, linguistic corpus analysis, digital humanities theory and practice.

Archiving the Web as Public Service

On November 18, 2022 By Amanda GreenwoodIn Guest post, International News, Tools & TechniquesLeave a comment

This week’s post was written by Daniel Gomes, Head of Arquivo.pt.

Arquivo.pt: a Searchable Web Archive

Arquivo.pt is a public and free service that enables anyone to search and access historical information preserved from the Web since the 1990s. Arquivo.pt contains billions of files collected from websites in several languages (about half of its users come from outside of Portugal).

Periodically, the Arquivo.pt system automatically collects and stores information published on the web. The Arquivo.pt hardware infrastructure is hosted at its own datacenter, and it is managed by full-time dedicated staff.

The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search, and application programming interfaces (API) that facilitate the development of added-value applications by third parties.

Arquivo.pt is supported by the Ministry of Science and Higher Education of Portugal.

Showing off the Value of Web Archives

Web archives preserve web documents for future access, but they must also demonstrate their value in everyday life situations.

Thematic exhibitions and collaborative collections have been developed to illustrate the utility of web archives as a source of historical documentation. A list of all the collections preserved by Arquivo.pt is publicly available. The data sets generated to create these exhibitions or derived from the operation of the service are openly available.

Arquivo.pt has been launching complementary services to engage individuals and organizations in web archiving.

SavePageNow: Archive a Web Page Immediately

Web pages change rapidly and sometimes web archives cannot find them to be preserved on time. Arquivo.pt provides a public form where users can suggest websites to be preserved.

Arquivo.pt also launched SavePageNow that enables users to immediately archive a set of web pages in high quality. The user enters a web page, starts browsing and all the visited content is archived. This service enables users to archive a small website autonomously.

The web archived content becomes later available in Arquivo.pt.

Complete Page: Crowdsourced Digital Curation

Web archives do the best they can to thoroughly archive web pages. However, sometimes users find missing content in web archived pages (e.g. missing embedded images).

Arquivo.pt provides the “Complete page” option at the replay user interface which automatically looks for missing content in external web archives and on the live web.

The obtained content is later integrated in Arquivo.pt and becomes available for all the users. “Complete page” engages users in the curation of the web-archived collections.

Arquivo404: Fix Broken Links

Link rot has been a prevalent problem since the early days of the web. Arquivo404 is a single-line javascript code to be installed on the “404 – Page not found” error pages that mitigates broken links.

If a given page was not found, arquivo404 generates a message that suggests an alternative link to a web archived version of the broken URL preserved at Arquivo.pt.

Notice that the message is displayed only if the page exists in Arquivo.pt. If it was not archived, the default “page not found” message error is presented. The list of web archives to be used is configurable.

Memorial: Preserve Your Old Website

There are many historical websites that provide valuable information but are no longer updated and require significant resources to be kept online. Moreover, costs grow as websites become older and dangerous security issues frequently occur.

The Arquivo.pt Memorial offers high-quality storing of websites’ content with the possibility of maintaining their original domains. This way, the website content remains searchable through liveweb search engines.

The links to internal pages on the website are also redirected to the correspondent webarchived pages to avoid the occurrence of broken links from external pages.

Training and Education on Web Preservation

Arquivo.pt has been raising awareness about the importance of web preservation. It issued a set of recommendations to develop preservable sites and has been promoting a free training programme, composed by 4 modules:

New ways of searching the past: presents the search and access available at Arquivo.pt and targets any Internet user;
Well publish to well preserve: discusses recommendations for publishing preservable websites and targets web authors;
Automatic processing of information preserved from the Web: presents the Arquivo.pt APIs and targets web developers;
Web archiving – Do-it-yourself!: teaches how to adequately acquire, store, and replay web content and targets information professionals.

The Arquivo.pt Award annually distinguishes innovative works based on the historical information preserved by Arquivo.pt. The Arquivo.pt awards began in 2018, and the 15 works awarded so far clearly demonstrate the utility of web archives.

The members of the Arquivo.pt team have been publishing technical and scientific articles related to web archiving in open-access since 2008, including the book The Past Web: Exploring Web Archives (Green Open Access). All the developed software is available as free open source projects.

Main Challenge: Spread the Word About Arquivo.pt!

The Arquivo.pt project began in 2007, and it has been a public running service since 2013. However, most people in Portugal and all over the world have never heard about it. Getting people’s attention is a major challenge, especially in the online world.

As most online information and services are apparently available for free, web archives must compete with the Internet giants (e.g. Google, Tik Tok, or Meta) for the web users attention. If you find Arquivo.pt to be useful and want to support it: Spread the word about Arquivo.pt!

Web Archiving Roundup: July, 2019

On July 25, 2019July 25, 2019 By ElisaIn Conferences, Events, International News, Monthly roundup, News, PublicationsLeave a comment

The Web Archiving and Metadata Digital Object Sections will hold a joint event during the SAA Annual Meeting in Austin, TX. Join us on Saturday, August 3rd for a debate on descriptive metadata and web archiving.

The 2019 Archive-It Partner Meeting coincides with SAA’s Annual Meeting, registration is still open.

Graphic Designer Sam Henri Gold has been archiving Apple ads from the 1970s to the present, you can take a look at the archive directly from the article.

ArchiveSpark 3.0 is now available, take a look a the updates in GitHub.

Check out this article about a High School student’s experience working for the Archives Unleashed team.

The latest issue of the Newsletter from the ESRC National Centre for Research Methods includes an article on research challenges using web archives for social research.

Registration is still open for the Specialized Data Curation Workshop hosted by the Data Curation Network at Washington University in St. Louis.

The Digital Preservation Coalition is crowd-sourcing a list of endangered digital materials. Nominations close on Friday August 30th, 2019.

Web Archiving Roundup: May, 2019

On May 20, 2019May 28, 2019 By ElisaIn Blog news, Conferences, Elections, Events, IIPC, International News, Monthly roundup, News, PublicationsLeave a comment

UPDATE – Join the ALCTS Metadata Interest Group Meeting during ALA Annual 2019 for a presentation and Q&A on the Library of Congress Web Archiving Program on Sunday, June 23, 2019, 9:00-10:00AM at the Marriott Marquis.

Now accepting nominations for the SAA Web Archiving Section’s 2019-2020 Steering Committee: https://www2.archivists.org/groups/web-archiving-section/now-accepting-2019-2020-steering-committee-nominations.

Registration for the IIPC Web Archiving Conference ends May 24. The conference will be hosted by the National and University Library of Croatia in Zagreb, which coincides with the 15h anniversary of the Croatian Web Archive (HAW).

For members of the Digital Preservation Coalition, the DPC Web Archiving & Preservation Task Force is inviting delegates to a meeting on July 18, in London. The meeting is free for DPC members, registration ends July 10.

IIPC Content Development Group is asking for contributions to their Climate Change Collection, and their Artificial Intelligence Collection.

Ben Els, Digital Curator at the National Library of Luxembourg, gives us a glimpse not the effort to capture the Luxembourg elections.

Seth Denbo, Director of Scholarly Communication and Digital Initiatives at the American Historical Association, strikes a cord on the challenges of scale in an article titled Data Overload.

The Atlantic has an article on the implication of AI vacuum cleaners from tech companies.

You can now read the paper presented at the 2018 World Library and Information Congress by the Library of Congress, the paper is titled Institutions as Social Media Collector: Lessons Learned from the Library of Congress.

The National Library of the Netherlands has recently launched a collection of archived websites from the Chinese Community in the Netherlands.

ECAL (École cantonale d’art de Lausanne) has launched a website called Information Mesh celebrating the 30th anniversary of the World Wide Web.

Web Archiving Roundup: March, 2019

On March 18, 2019March 18, 2019 By ElisaIn Conferences, Events, IIPC, International News, Monthly roundup, News, PublicationsLeave a comment

Help the SAA Electronic Records Section find more about the most useful resources for the electronic records community. You can find the survey and a bit more about their project here.

Registration for Archivematica Camp in Vancouver, June 24-26, is still open.

Early bird registration for IIPC Web Archiving Conference is now open. You can also take a look at the program.

The International Journal of Digital Humanities has an article on web archiving initiatives in Europe. The article is titled Web Archives as Data Resource for Digital Scholars.

The Ivy Plus Libraries Confederation have launched the 2018 Brazilian Presidential Transition Web Archive.

Library of Congress Web Archives blog post from Jesse Johnston, Senior Digital Collections Specialist at LOC, gives a walkthrough into sorting through a set of US Government PDFs.

Celebrate the 30th anniversary of the world wide web exploring internet archives through emulated legacy browsers with Rhizome!

Another fun article to celebrate the web’s 30th anniversary looking at Australia’s ugly 90’s websites.

The National Library of Ireland recently announced their 2018 Web Archiving collection.

Web Archiving Roundup: February, 2019

On February 25, 2019February 25, 2019 By ElisaIn Blog news, Conferences, Events, International News, Monthly roundupLeave a comment

You can still register for AASLH’s webinar Web Archiving: What, Why, and How, the webinar will take place on February 28 @3:00pm EST.

Archive-It will host an advanced training session on February 26 at 11:00 AM Pacific Time (US & Canada), the session will focus on Archive-It as a Reference Tool.

The National Videogame Foundation in collaboration with Bath Spa University and funded by the British Academy and Leverhulme Trust, released a White Paper titled Game Over? Curating, Preserving and Exhibiting Videogames.

Richard Ovenden, Bodley’s Librarian, has an article in The Economist about digital preservation.

Ilya Kreymer, Webrecorder Lead Developer, shares his Code4Lib 2019 presentation slides.

A new release of web crawler project Heritrix 3 is now available.

Middlebury Facebook group Middlebury Memes for Crunchy Teens to be archived by Special Collections.

PhD candidate Rhiannon Lewis writes a response to the DPC’s Briefing Day on web archiving for community and individual archives.

New version of Web Archiving Integration Layer (WAIL) for macOS is now available.

Shawn M. Jones writes a blog post for the Web Science and Digital Libraries Research Group at Old Dominion University regarding Google+ shutting down.

Stanford Libraries receives a $25 million grant to preserve Silicon Valley Archives

PANDORA, Australia’s Web Archive, initially established by the National Library of Australia celebrates its 10 year anniversary.

Web Archiving Roundup: January, 2019

On January 28, 2019January 28, 2019 By ElisaIn Conferences, Events, International News, Monthly roundup, News, PublicationsLeave a comment

Here is your first Web Archiving Roundup of 2019!

Archive-It releases add scoping rules in bulk to web archiving seeds.
Archive-It is hiring. See job postings for a Partner Coordinator and a Technical Support Specialist here.
Anthony Vaver, Local History Librarian and Archivist at Westborough Public Library, recounts the web archiving experience at Westborough.
Natalie Baur gives an inside scoop on Piloting the 2018 Government Web Archive.
Wikipedia is 18 years old and to celebrate we take back to their effort of rescuing broken links through the Wayback Machine.
Cobweb will develop web archive collections with their new platform.
Technology policy reporter Cat Zakrezwski addresses the issue of social media and politics.
Valerie Schafer and Benjamin G. Thierry take us back to the 90s as the turning decade for Internet and the Web.
Gethin Rees, Lead Curator of Digital Mapping at the British Library, talks about extracting place names from Web Archives at Archives Unleashed Vancouver.
The Web Science and Digital Libraries Research Group at Old Dominion University have a great recap of their 2018 activities, such as the Joint Conference on Digital Libraries (JCDL ’18) and several web archiving articles.

Web Archiving Roundup: November 21, 2018

On November 21, 2018December 17, 2018 By ElisaIn Conferences, Events, IIPC, International News, Monthly roundup, NewsLeave a comment

Here’s your Web Archiving Roundup for November, 2018:

The IIPC Web Archiving Conference is accepting papers for the 2019 conference. The conference will be hosted by the National and University Library of Croatia in Zagreb, which coincides with the 15h anniversary of the Croatian Web Archive (HAW). Deadline for submissions is January 7^th, 2019.
Registration is still open for the Archive-It Mid-Winter Partner Meeting on January 29, 2019 in Seattle, WA. The meeting will coincide with ALA Mid-Winter.
The Library of Congress is sending a call to anyone interested in web archive data sets by providing access through their Web Archive Data Sets Experiments page hosted by the Library of Congress Labs.
The National Library Board of Singapore announces project to archive 180,000 Singapore websites.
Japan’s National Diet Library announces English language user interface to their web archiving project (WARP).
Brooklyn Public Library’s Brooklyn Collection is currently asking for Brooklyn local content suggestions as part of the Internet Archive’s Community Webs program (https://communitywebs.archive-it.org/).

Weekly web archiving roundup: September 18, 2015

On September 18, 2015 By webarchrtIn International News, News, Uncategorized, Weekly roundupLeave a comment

Weekly web archiving roundup for the week of September 18, 2015:

Writing WARCs (Kristinn Sigurðsson): A call for better library support for the WARC format
TVEyes Wide Shut: Ruling on Broadcast Archiving Service Undermines Fair Use (Vera Ranieri): A court decision in Fox News’s case against TVEyes has troubling implications for innovative uses of media
Copyright Fair Use: 1 Win, 1 Maybe and Two Losses for TVEyes (Ira Sacks and Erika Stallings): Another take on the TVEyes court case decision
10 Years of the Web Archive–What have we saved: A talk given by Andy Jackson, Web Archiving Technical Lead at the IIPC General Assembly 2015
The Internet Is Failing The Website Preservation Test (Ron Miller): A journalist explores the impact of “Internet memory loss”
When using an archive could put it in danger (Peter Webster): A Conservative party episode in the UK illustrates how a wider understaning of web archiving could make archived material more vulnerable
IIT Humanities Professor Discusses What Happens To Our Data Once We Die (Karis Hustad): Professor Mel Hogan will teach a class this fall on death, memories, and decay in the digital world
What does the web remember of its deleted past? (Dr. Anat Ben-David): “On March 30 2010, the country-code top-level domain of the former Yugoslavia, .yu, was deleted from the Internet…”

Weekly web archiving roundup: September 10, 2015

On September 10, 2015 By webarchrtIn Blog news, International News, News, Weekly roundupLeave a comment

Weekly web archiving roundup for the week of September 10, 2015:

Playing at Web Archiving: Why not use a interactive fiction engine to built a “web archiving simulator” that takes you through the core web archiving life-cycle?
Cooking Up a Solution to Link Rot: Sixty-six to seventy-three percent of web addresses in the footnotes of three Harvard law journals and nearly 50 percent of web addresses in U.S. Supreme Court decisions from 1996 to 2012 suffered from reference rot.
The Story of How the Internet Came to India–An Insider’s Account: On August 15, 1995, Videsh Sanchar Nigam Limited (VSNL) launched public Internet access in India. IBNLive is commemorating 20 years of the Internet in India with a special series.
Beginner’s Guide to Web Archives, Part 3: Coming to the end of his short time working on web archives at the British Library, science-policy intern Peter Spooner reflects on the process of creating a web archive special collection.
So You Want to Get Started in Web Archiving? Fear not, there are a few places to visit to get a quick sense of what’s going on!
LANL’s Time Travel Portal, Part 1 and Part 2: The Time Travel portal, launched in February 2015, provides cross-system discovery of Mementos.
Viral Content in the UK Domain: A look at how the web archive of the UK Domain approaches malware.
Using Mathematica to Plot Locations Mentioned in Web Archives: What we could do with Mathematica 10’s (relatively) new geographic visualization services?
Soft Launch of WebArchives.ca: WebArchives.ca provides access to the University of Toronto’s Archive-It Collection of Canadian Political Parties and Political Interest Groups, which they have been collecting since late 2005.