Archiving the Web as Public Service

This week’s post was written by Daniel Gomes, Head of Arquivo.pt.

Arquivo.pt: a Searchable Web Archive

Arquivo.pt is a public and free service that enables anyone to search and access historical information preserved from the Web since the 1990s. Arquivo.pt contains billions of files collected from websites in several languages (about half of its users come from outside of Portugal).

Periodically, the Arquivo.pt system automatically collects and stores information published on the web. The Arquivo.pt hardware infrastructure is hosted at its own datacenter, and it is managed by full-time dedicated staff. 

The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search, and application programming interfaces (API) that facilitate the development of added-value applications by third parties.

Arquivo.pt is supported by the Ministry of Science and Higher Education of Portugal. 

Showing off the Value of Web Archives

Web archives preserve web documents for future access, but they must also demonstrate their value in everyday life situations.

Thematic exhibitions and collaborative collections have been developed to illustrate the utility of web archives as a source of historical documentation. A list of all the collections preserved by Arquivo.pt is publicly available. The data sets generated to create these exhibitions or derived from the operation of the service are openly available.

Arquivo.pt has been launching complementary services to engage individuals and organizations in web archiving.

SavePageNow: Archive a Web Page Immediately

Web pages change rapidly and sometimes web archives cannot find them to be preserved on time. Arquivo.pt provides a public form where users can suggest websites to be preserved

Arquivo.pt also launched SavePageNow that enables users to immediately archive a set of web pages in high quality. The user enters a web page, starts browsing and all the visited content is archived. This service enables users to archive a small website autonomously. 

The web archived content becomes later available in Arquivo.pt.

Complete Page: Crowdsourced Digital Curation

Web archives do the best they can to thoroughly archive web pages. However, sometimes users find missing content in web archived pages (e.g. missing embedded images).

Arquivo.pt provides the “Complete page” option at the replay user interface which automatically looks for missing content in external web archives and on the live web. 

The obtained content is later integrated in Arquivo.pt and becomes available for all the users. “Complete page” engages users in the curation of the web-archived collections. 

Arquivo404: Fix Broken Links

Link rot has been a prevalent problem since the early days of the web. Arquivo404 is a single-line javascript code to be installed on the “404 – Page not found” error pages that mitigates broken links. 

If a given page was not found, arquivo404 generates a message that suggests an alternative link to a web archived version of the broken URL preserved at Arquivo.pt. 

Notice that the message is displayed only if the page exists in Arquivo.pt. If it was not archived, the default “page not found” message error is presented. The list of web archives to be used is configurable.

Memorial: Preserve Your Old Website

There are many historical websites that provide valuable information but are no longer updated and require significant resources to be kept online. Moreover, costs grow as websites become older and dangerous security issues frequently occur. 

The Arquivo.pt Memorial offers high-quality storing of websites’ content with the possibility of maintaining their original domains. This way, the website content remains searchable through liveweb search engines. 

The links to internal pages on the website are also redirected to the correspondent webarchived pages to avoid the occurrence of broken links from external pages.

Training and Education on Web Preservation

Arquivo.pt has been raising awareness about the importance of web preservation. It issued a set of recommendations to develop preservable sites and has been promoting a free training programme, composed by 4 modules:

  • New ways of searching the past: presents the search and access available at Arquivo.pt and targets any Internet user;
  • Well publish to well preserve: discusses recommendations for publishing preservable websites and targets web authors;
  • Automatic processing of information preserved from the Web: presents  the Arquivo.pt APIs and targets web developers;
  • Web archiving – Do-it-yourself!: teaches how to adequately acquire, store, and replay web content and targets information professionals.

The Arquivo.pt Award annually distinguishes innovative works based on the historical information preserved by Arquivo.pt. The Arquivo.pt awards began in 2018, and the 15 works awarded so far clearly demonstrate the utility of web archives.

The members of the Arquivo.pt team have been publishing technical and scientific articles related to web archiving in open-access since 2008, including the book The Past Web: Exploring Web Archives (Green Open Access). All the developed software is available as free open source projects.

Main Challenge: Spread the Word About Arquivo.pt!

The Arquivo.pt project began in 2007, and it has been a public running service since 2013. However, most people in Portugal and all over the world have never heard about it. Getting people’s attention is a major challenge, especially in the online world. 

As most online information and services are apparently available for free, web archives must compete with the Internet giants (e.g. Google, Tik Tok, or Meta) for the web users attention. If you find Arquivo.pt to be useful and want to support it: Spread the word about Arquivo.pt!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s