This week’s post was written by Dr. Gyula Kalcsó, web archiving team leader at the National Széchényi Library in Budapest, Hungary.
The National Széchényi Library Web Archive was established in 2017 with the aim of providing a representative overview about online contents intended for the Hungarian public or related to Hungary as a part of our cultural heritage. The workflow was developed in several stages, testing a wide range of software and selecting the best suited for the task. This work has been facilitated by the fact that the NSZL Web Archive has been a member of the International Internet Preservation Coalition since 2018, where recommendations are continuously being made for each task, both in terms of the software to be used and the workflows.
A screenshot of the home page of the NSZL Web Archive.
The NSZL Web Archive collects information in three ways: from a selection of the most important Hungarian websites, from main news sources related to specific events, and from the Hungarian web space in general. A limited range of scientific, cultural, educational and public content is collected selectively. The general collection will cover public websites registered under the .hu domain or belonging to other domains but targeting a Hungarian audience. The web harvesting only covers servers from which it is technically possible to ensure automatic downloading of content. When harvesting, the library will take into account the restrictions set for the harvesting software by the owner of the site.
In case of archived web content, NSZL wants to establish a long-term preservation model. In order to respect privacy and copyright rules only a small part of the collection is publicly available. Websites are publicly available only if NSZL already has the permission of the content owner or if the content was made with public money. The rest of the archive will be only available within a dedicated network primarily for research purposes.
The websites that make up the thematic sub-collections are selected by librarians and the sites are being archived several times per year. They typically contain websites and blogs, that is they do not contain social media that cannot be harvested by a robot, as well as online periodicals because they are kept separate. In addition to the websites of institutions, organizations and companies, the pages of professionals and artists working on that topic can also be included in the sub-collections. Seed lists consist of several hundred to several thousand URL addresses. We are permanently updating, expanding these lists, and adding new topics to the archive every few months. The materials of these selective collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future. Only a small fraction of these selected websites is available through our open demo collection that we have permission for public access from the copyright owner or for which no individual contract is required.
A screenshot of the demo collection home page.
We also have sub-collections being setup focusing on main national or global events. The materials of these collections are based on selected articles/sections of the biggest news portals, websites of corresponding institutions, thematic homepages, blogs, Wikipedia articles etc. Harvests are usually started some weeks before the main event (if we know the exact date of an event) and ending when the press coverage of the event has mainly disappeared. Making a weekly harvest is usual from these websites. This collection is not publicly available and can only be used for research purposes in the NSZL building.
Beyond to selective (thematic or event-based) harvests we try to make snapshot harvests once or twice a year from 2018 about a representatively large part of the Hungarian web space. It means to harvest more than a million websites from the starting page at least to two level depth – excluding files by large size in order to spare storage space. The initial URLs can be collected from several resources: public lists of URL addresses from the Hungarian domain, those links that include Hungarian domains and sub-domains we could find by earlier harvests, the .hu “zonefile” from the Internet Archive, and those website addresses that have selected for thematic collections or recommended by the corresponding template (these include addresses beyond the .hu domain also). The materials of these archived collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future.
The public collection is harvested with the use of the Web Curator Tool, also supported by IIPC. The harvesting settings for these sites are much more finely adjusted, and we try to constantly update the settings, while constantly quality checking the content we collect, in order to deliver the highest quality material. The harvesting itself is done using a variety of software, depending mainly on which tool gives better results for the site we are saving. These harvests are made by Heritrix, Brozzler, Webrecorder, ArchiveWeb.Page or HTTrack software, usually in a limited depth of the original website. Display of the archived items are being made by OpenWayback, PyWb, SolrWayback software and/or by Conifer, the online version of Webrecorder. The archived items made by HTTrack in a file system structure can be seen through the webserver. We also provide screenshots of the original homepages, links to archived copies made by Internet Archive, and to the original site. By the SolrWayback software, full text search function of the archived websites is available. Sorting by domain names, file types and year of archiving can customize lists of hits further.
The material in the web archive is mainly used for social science and digital humanities research, typically by researchers interested in a particular topic, who consequently tend to search thematic collections. At the same time, the OSZK web archive has carried out a collection on the Russian-Ukrainian conflict with a full text search engine, and has also produced big data research on the basis of this material (mentioned in an IIPC blog post). The dataset is available through an interactive Power BI interface (unfortunately only in Hungarian). We are working to improve the conditions for the archive to be used for research purposes.
A screenshot of the Power BI interface of the Ukrainan war news collection.
Author Bio:
The author is the leader of the web archiving team of National Széchényi Library Digital Humanities Centre Department of Digital Philology and Web Archiving. He is also a university lecturer in linguistics. His main fields are born digital archiving, corpus building, natural language processing. He holds a PhD of linguistics. He has been publishing for 20 years on corpus building for linguistics, linguistic corpus analysis, digital humanities theory and practice.