This week’s post was written by Daniela Major, Early Stage Researcher in Digital Humanities at the School of Advanced Study.
How the Use of Web Sources Fosters Collaboration
My journey with web archives began in 2018 when, as part of the free, open access infrastructure hub ROSSIO project, I started working as a research scholar for Arquivo.pt, the Portuguese web archive. Back then, my knowledge of web archives was superficial at best; I came from a very solid history background. My focus was on the 18th and 19th centuries. I had spent my BA and masters pouring over a multitude of sources, from 17th century treatises on kingship to 19th century periodicals. Considering what I had been taught, little connected web sources to historical sources. I had been taught that history ended more or less in the 1970s. Or, perhaps in the 1980s with a little stretch. The web from the 1990s and the early 2000s finds itself in an uncomfortable place; it is no longer recognizable to many internet users today, but it is still part of the same sense of “modernity” that prevents it from being analyzed as a historical source. For historians and history students, the internet still feels “new”, even though the first website was launched 30 years ago in 1993. Historians do study events from 30 years ago. They study the fall of the Berlin Wall, the end of the Soviet Union, and the Bosnian War. Even so, web sources have gone somewhat under the radar for many historians, especially in many history degrees. I studied “computing applied to history” in my BA, but I learned how to apply its tools to paper or digitized sources because digital tools can greatly enhance the study of those historical sources. The web itself is a historical source which in many ways is similar to traditional sources, while posing very different challenges from those historians are used to.
I would argue that historians are exceptionally well-equipped to deal with a variety of sources. Historians learn through courses such as “History of the Book.” They know about paper quality, watermarks, and folios. They learn paleography to be able to read scripts. They learn ancient languages. The study of web sources simply poses different problems: the retrieval and analysis of the HTML, the issues surrounding metadata, and how to deal with a great quantity of sources. The web provides the possibility to perform large-scale analysis in a way that previous sources rarely allow. Dealing with a greater volume of data will be a common occurrence that historians will have to face from now on. This opens up a world of possibility for historians, but it also forces us to ask difficult questions: What knowledge should be imparted to history students? Should history students learn how to code? If they do, will they enjoy it? As a rule, young adults don’t go into history because they want to learn how to code. They are curious about the past, and they want to study it. Many are fascinated by the physicality of historical sources. Coding, if taught, has to be seen as an auxiliary method to the research, and not the point of research itself.
However, I believe these efforts by historians must be accompanied by other options. Acquiring new skills involves vital conversations about learning curves and time-management. Students, whether they are studying for their BA, writing a masters’ dissertation or doing a PhD thesis, have limited time they can dedicate to learning new skills. Coding, in particular, can be time consuming for humanities students because so much of it relies on self-teaching and DIY, which is very different from the reading and discussion model that humanities’ students are used to. As a result, one of the main challenges of web archives, as the main provider of born-digital sources, is to create an environment that flattens the learning curve, by providing more and better access to the material contained in archives.
In many ways, we are on the right path. SHINE, a historical search engine developed by the UK web archive, allows users to perform text-search and download a list of website links as a CSV file. The data comes from the Internet Archive. SHINE also allows for trend analysis using keywords. Arquivo.pt is another web archive which allows users to perform text-search and to download a list of URLs. Crucially, Arquivo.pt includes a link to the text file of each URL.
There are still problems with these models. URLs are often repeated because some pages have been crawled at different points in time. Retrieval of specific parts of the HTML, including text, continues to be challenging. However, these are very encouraging signs that web archivists are focusing more on accessibility.
This is a continued effort. Above all, it means collaboration between researchers and the archives and libraries. Digital archivists and curators, librarians, and archivists cannot be excluded from the conversations on what historians want from born-digital data. Additionally, technical leads, web developers, and data analysts have to be brought into the fore in humanities research. To make the most of web sources and explore them to their full potential, knowledge cannot be gatekept.
Daniela Major is an Early Stage Researcher in Digital Humanities at the School of Advanced Study, where she is working on a PhD project on the Media coverage of the European Union. Previously, she studied history and worked as a researcher at Arquivo.pt, the Portuguese Web Archive. She is very interested in the preservation and curation of digital materials.