Introducing the DocNow App

This week’s post was written by Zakiya Collier, Community Manager at Documenting the Now.

This week the Documenting the Now project announces the release of DocNow, an application for appraising, collecting, and gathering consent for Twitter content. DocNow reimagines the relationship between content creators and social media analysts (archivists and researchers) by addressing two of the most challenging issues of social media archiving practice—the issues of consent and appraisal.

The Documenting the Now Project is happy to release version 1.0 of our open-source tool freely for anyone to use. Read all about the app, what it does, and how to engage with the DocNow project team for support and providing feedback.

Over the last seven years, Documenting the Now has helped to foster an environment where a more broad sector of cultural memory workers can learn about web archiving tools and practices and can become involved with web archiving networks. This has largely been achieved by practicing transparency and inviting people who have traditionally been left out of established web content archiving networks into the project to participate, namely students, activists, and archivists who represent marginalized communities and who work in community-centered organizations, HBCUs, public libraries, community-based archives, and tribal libraries and archives.

Documenting the Now was a response to the need among scholars, activists, archivists, and other memory workers for new tools that would provide easily-accessible and user-friendly means to collect, visualize, analyze, and preserve web and social media content to better document public events. In addition, it aimed to respond to questions and concerns related to ethics, safety, intellectual property, and access issues for the collection, preservation, and dissemination of Twitter data in particular.

Documenting the Now has also developed community-centered web and social media archiving tools that both prioritize care for content creators and robust functionality for users:

  • Twarc – a command line tool and Python library for collecting tweet data from Twitter’s official API
  • Hydrator – a desktop application for turning Tweet ID datasets back into tweet data to use in your research
  • Social Humans – a label system to specify the terms of consent for social media content
  • The Catalog – a community-sourced clearinghouse to access and share tweet identifier datasets

In continuing to support and develop tools that embody ethical practices for social media archiving, the DocNow app joins this suite of tools. DocNow is an application for appraising, collecting, and gathering consent for Twitter content and includes several new features including:

  • Trends tab to view trending topics across the globe in real time
  • Explore tab to view content by users, media, URLs, and related hashtags all on one screen
  • Live testing and refining of collecting parameters on recent tweets
  • Tweets per hour calculator to easily identify Twitter bot accounts
  • Search and Collect tweets back in time via Search API and forwards with Stream API
  • Activate toggle to start collecting tweets and send a notification tweet to encourage transparency and communication in Twitter data collection
  • Collections tab to share information about your collection with the public
  • “Find Me” and Insights Overview features to specify and gather consent using Social Humans labels
  • Download Tweet ID archive for sharing following Twitter’s terms of service

The DocNow app also works in concert with other Documenting the Now tools, creating for users, a 4-step social media archiving journey:

Step 1: Collect content with the DocNow App by activating a search. Set collection limits and explore insights as your collection grows.
Step 2: Download your archive from the DocNow App, which includes a Tweet Viewer, Tweet IDs, and media files.
Step 3: Hydrate your Tweet IDs from the archive’s tweets.csv file back into full
tweets using DocNow’s Hydrator desktop application.
Step 4: Describe your collection and share your Tweet IDs with other researchers by adding them to the DocNow Catalog.

Ways to Use DocNow
There are 3 different ways to use DocNow including joining the community instance, running DocNow locally on a computer, and installing an instance of DocNow in the cloud. The Community Instance is a great way to get familiar with the tool before committing to running an instance but those with development skills may want to administer their own instance of DocNow locally or in the cloud.

  1. Join Documenting the Now‘s hosted community instance
  2. Run DocNow locally on your machine
  3. Install your own instance in the cloud

For help with installation and getting started, the Documenting the Now team will host community conversations. Dates will be announced soon! More information about the DocNow App can be found here.

Documenting the Now is seeking community input on all of our features as we continue to develop DocNow. Please join our slack channel by going to our website or email us at info@docnow.io.

Learning How to Web Archive: A Graduate Student Perspective

This week’s post was written by Amanda Greenwood, Project Archivist at Union College’s Schaffer Library in Schenectady, New York.

While it is not a new aspect of archival work, web archiving is an important practice that has advanced significantly in the past ten years. Developers are creating new web archiving tools and improving the current ones, but the process can be challenging because of dynamic web content, changing web technology, and ethical concerns. However, the need for preserving web-based material has become a priority for improving preservation, creating “greater equity and justice in our preservation practices, and [finding] ways to safeguard the existence of historical records that will allow us in future to bear witness, with fairness and truth and in a spirit of reconciliation, to our society’s response to COVID-19” and other social justice issues. The explosion of web archiving initiatives in various institutions and organizations has created web archivists of us all; however, how difficult is it for someone with zero experience to learn this important skill?

From October 2020-May 2021, I was awarded the Anna Radkowski-Lee Web Archives Graduate Assistantship at the University of Albany, State University of New York. My responsibility was to manage the web archives collections with Archive-It through web crawling, scheduling, reading host reports, and rescoping. These activities would culminate with a meeting at the end of my assistantship to discuss appraisal of the active collections. While I was excited to learn this archival skill, I did not have any experience with web development, so translating the host reports took a few months to learn because I did not understand some elements of the URLs that were in the host reports. For example, I did not know that some parts of a URL told me the website was a WordPress site, and I had never heard of “robots.txt” before. Thus, my supervisor, University Archivist Greg Wiedeman, spent a lot of time at the beginning of the assistantship teaching me how to translate the host reports. I needed to learn HTTP codes and other parameters that would help me make sense of how to rescope the crawls more efficiently.

I really appreciated the support from everyone at the Archive-It Help Center because they were instrumental in helping me solve a lot of problems related to the crawls. However, I felt the Help Center website instructions and tutorials were a bit difficult to follow at times. I think they are probably easier to for more experienced users to understand, or at least for users who are familiar with web development. The other frustrating element was that managing the collections via web crawling was extremely time-consuming, but I was only allowed to work 20 hours a week per the assistantship. It was quite a substantial role, and I realized the job required someone to work on this full-time and not only 4-5 hours a day.

The unpredictability, frequency, and length of website updates also proved to be a large obstacle. I would work hard to efficiently scope one collection and schedule it, but the host report would come back with an error because the website was being updated. I often had to put those collections aside and return to them at a later date, but then the update would yield a whole new set of problems with the host report, and I would spend more time rescoping and rerunning test crawls for those collections. Test crawls would also require multiple runs and constant rescoping, which meant that I had to run longer test crawls, and some of those would take four weeks. The host reports were difficult to decipher because of crawler traps or long strings of characters. Learning how to read the host reports took the longest amount of time.

Additionally, I would get so excited at a potential strong test crawl, but after QA I would notice dynamic content like interactive web elements were not captured. Thus, I would need to rescope and rerun the test crawls and use other web crawling tools like Conifer. I looked into using other open source tools such as WARCreate, WAIL, and Wget, but Conifer ultimately helped me with what I needed at the time.

Moreover, we often were presented with impromptu requests to archive a faculty website in a timely fashion because of retirement or the website host was taking down the website, so it was stressful prioritizing those new projects. Some of the faculty websites utilized advanced web development elements, so I needed to use the aforementioned open source web crawling tools to capture the site and upload the WARC file into Archive-It.

In sum, having zero web developing and coding experience did not impede my ability to learn web archiving, although having it would have helped me significantly. Having a subscription to Archive-It can be helpful because of the technical support and video tutorials, but open access tools and software are plentiful, and the community is helpful and supportive in terms of training and instructive documentation. With my web archiving training, I was able to help other institutions initiate their own web archiving programs, and I hope to continue on this trajectory in the future.

Source:

Jones EW, Sweeney S, Milligan I, Bak G, and McCutcheon J-A. 2021. Remembering is a form of honouring: preserving the COVID-19 archival record. FACETS 6: 545–568. doi:10.1139/facets2020-0115

Community Input for Web Archiving at Small and Rural Libraries

This week’s post was written by Grace McGann (Moran), Teen Librarian, Tipp City Public Library.

Before I begin, I want to acknowledge that Tipp City Public Library is on the unceded lands of the Kaskaskia, Shawnee, Hopewell, and Myaamia.

The Tipp City Public Library has been open for nearly a century now, in a town rich with history. We are located in the downtown historic district of a city whose population falls just below 10,000. Because we are such an ingrained part of the community, it is difficult to work here and not notice how our patrons care about our city’s history. This blog post briefly examines the importance of involving our respective communities in the collection development process.

My previous experience in web archiving at the University of Illinois drove me to apply to the Community Webs program at the Internet Archive. This program was built for small cultural heritage institutions to create a diverse web history while using Archive-it technology at zero cost.

Having never built a web archive from the ground up, one of the first questions I faced was: where to start? I find that with any collection, the problem isn’t necessarily struggling to find materials, it is choosing a specific policy of selection and collection. In a previous blog post, I made an argument for separate collection development policies for web archiving. Having one in place makes evaluating websites simpler, especially when the discovery process relies on outside entities such as community members.

After joining Community Webs just a few months ago, I had a curated list of websites for capture. However, these only came from my cursory searches around the internet. In order to move beyond those first websites and create a more representative archive, I leveraged the local network. First, I reached out to a sociology professor at the University of Dayton, Dr. Leslie Picca. I knew that she was doing research on race and could possibly have connections to colleagues in the digital humanities. She led me to Dr. Todd Uhlman of the history department at UD.

Connecting with Dr. Uhlman changed my thinking about how to build this web archive. In learning about the digital humanities work he is doing in the Greater Dayton Area, I found valuable websites to be captured and preserved.

After speaking to Dr. Uhlman and agreeing to capture his content, I was contacted by community members who wanted websites to be preserved. After evaluating and capturing the websites, I created a community input form. While I am still waiting for this form to gain more traction, my theory is that community input is crucial for web archiving at small cultural heritage institutions.

I am not alone in this assertion. Papers about community archival practices demonstrate an urgent need for this sort of involvement. Zavala et. al. (2017)(though speaking more about physical archives) shared this:

There is no reason why government or university archives could not engender post-custodial practice, foster community autonomy and promote shared governance, if only they are willing to share power and authority with the communities they have historically left out.

By changing the way we engage in collection development, we challenge the systems of oppression that have been institutionalized within record-keeping institutions (whether we are aware of them or not).

I have a lot left to say about web archiving, but I want to drive this point home. Archives, whatever form they take, provide cultural value. Culture does not exist without community. Therefore, it’s actually pretty simple: communities should help create archives.

Web Archiving Many Voices: Documenting COVID-19 and Marginalized Communities at Arizona State University

This week’s post was written by Shannon Walker, University Archivist, Arizona State University.

The COVID-19 pandemic gives us, as memory workers, the unusual opportunity to document a crisis as it is occurring. As with many institutions, University Archives staff at Arizona State University quickly recognized the need to document a rapidly unfolding history and its impact on our institution and community through the use of our web archiving tools.

Several factors highlighted the urgency of the task. Firstly, this was likely a once in a lifetime occurrence. We did not have time to develop a new process or procedure for “how to document a pandemic.” Secondly the crisis itself has a temporal nature, it was important to capture sites that were continuously updating, knowing that in six months or a year they might no longer refer to COVID-19 (we hope!). 

Deciding What to Capture

While there is no official mission statement for the web archiving program @ ASU, we have developed a draft of general guidelines for prioritizing the websites we capture, preserve and make available for research. Essentially:

The purpose of the Web Archiving Program at the ASU Library is to develop best practices for capturing born-digital, web-based information produced by the University. The first priority is identifying sites that complement the University Archives’ Collection Development Policy. The Web Archiving Program will also attempt to capture web-based publications for the broader ASU umbrella, especially as mandated by the University Archives Records Management Program, as well as current events and specialized collections.

In addition to being guided by our internal guidelines, we also had in mind ASU’s Charter

ASU is a comprehensive public research university, measured not by whom it excludes, but by whom it includes and how they succeed; advancing research and discovery of public value; and assuming fundamental responsibility for the economic, social, cultural and overall health of the communities it serves.

So, with both principles guiding our web archiving selection criteria we sought to address some of the following questions:

  • As we document COVID-19 at ASU, how can we be inclusive?
  • As we select sites to capture, do they document the more prominent voices on campus or marginalized communities or both?
  • Who might we be excluding? Can excluded groups be included through web archiving?
  • When a researcher is using these materials 10 years down the road what will they see? What will they not see?

Creating the COVID-19 @ ASU Response collection

The initial phase of this effort included identifying, reviewing and capturing our own institution’s websites. We wanted to document the School’s official response as well as the experiences of employees and students. We sought out and captured seeds from the Office of the President, the Office of the Provost and ASU Now, one of our campus news resources. We also captured a site from the University’s governing body, the Arizona Board of Regents

Additionally, we sought to document the impact of the pandemic on the student experience. We were able to identify seeds from the Admissions Office, Financial Aid, University Housing, and Greek Life (fraternities/sororities). In addition, a few individual schools created lists of resources for their students. Many of these lists focused on the mental, emotional, and financial consequences of the pandemic, recognizing the impact on the whole student, far beyond the logistics of classes and technology.

As we were identifying seeds to capture we were fortunate to locate a few that provided the opportunity to document the voice and experiences of underrepresented groups on campus. We say fortunate because we know that marginalized groups do not always have a prominent presence on the website, but they are an important part of what we hoped to document. Some of the sites included in this group were the Alliance of Indigenous Peoples, ASU’s American Indian Policy Institute, the International Student and Scholars Center, and the American Indian Student Support Services. 

Importantly, many of these sites addressed issues of equity and the digital divide, which was a significant issue for students here at Arizona State University as we went to online classes. 

Challenges and Unexpected Discoveries

As with any project, there were challenges that made us profoundly aware of the limits of web archiving. For one, not all campus experiences are captured on published campus websites. We know that many of the raw experiences of students, staff and faculty were being documented on social media sites. However, we chose not to capture these sites because of privacy concerns.  In addition, some website captures did not go well. After reviewing them, we could see that elements of the original site were not captured. We continue to work on improving those crawls. 

As an added bonus to our efforts, we were able to identify and collect materials in other digital formats (namely PDF or JPG) that further document COVID-19 at ASU. An additional unexpected discovery was that a few of the sites we targeted for crawls were already being well captured by the Internet Archive. In that case, we needed to figure out how to point to them as part of our curated collection.

Conclusion

Our efforts to identify, capture and collect websites related to COVID-19 at Arizona State University are only a part of the pandemic story for future researchers. We hope that the sites we chose, and were successfully able to capture, will present a varied and diverse perspective considering the limits of web archiving technology. It is not a perfect tool but it is timely and nimble, and becoming an increasingly important part of our toolkit as we seek to expand the narrative of our school’s history.

Pitching Web Archiving to your Institution: Where to Begin?

This post was written by guest blogger Andrea Belair, Archives and Special Collections Librarian at Union College in Schenectady, NY.

Does your organization engage in web archiving? Today, many organizations now collect websites in some capacity. Over the years, I think it’s become easier to make an argument that web content should be collected. Most of us, even the least tech-savvy of us, understand that web pages are not permanent and get taken down or deleted when not maintained.  Many hard lessons have already been learned by large amounts of content that has been lost overnight, a tweet deleted by its author, or a page that is no longer available on a site. Regularly, now, the public is pointed to the address of a website to get vital information, much to the detriment of those who might not have access to a computer; however, the website is the public record, the source of information. We have seen this recently with the COVID-19 epidemic and most information about it, as well as its vaccination. 

If you intend to make a pitch to start a web archiving program at your institution, of course, you need to consider your audience: What do they care about? More importantly, though, do you belong to a public or private institution? It is likely that your organization is creating web content, and, similarly, it’s likely that this web content can be considered institutional records of your organization. 

At a former job, during a meeting about web archiving, one attendee told me that I needed to take off my records management hat and put on my archivist hat. The university archivist came just after that statement, running a little late. Having missed the rest of the conversation, he sat down, and he stated, “This is really about records management.” In terms of collecting and archiving web content, there is the collecting side and an argument to be made there, and there is the side from the vantage point of institutional records. It can be hard to show scholarly uses for web collecting, although now it’s an easier case to make when you look at the amazing web collections of the Library of Congress, for example. However, if you’re pitching a web archiving program, think about your stakeholders. Chances are that if your stakeholders are, say, executive administration in a private college, you might want to reconsider your pitch if you focus on history, or some variation of history and why history is important. Stakeholders know that history is important, and executives know that, but they consider it the job of the library to take care of that and they are not really going to think that justifies more of a budget allocation. I recommend that, instead, you look at the requirements of your accrediting institution, if you are a private college. If you are a public college, look at your local, state and federal regulations and mandates on records. Web pages are official documents of the institution, and they often fall within mandated records retention schedules. For private colleges, it is very possible that the accrediting organization to which your institution belongs has records requirements as well, although I am not familiar with all accrediting institutions, unfortunately. If there is a legal mandate to be met, go that route. If there is an accrediting mandate to be met, go that route. 

It’s not that the case for history and its importance can never be made, however. If there is a new mission statement or vision statement at your institution, and you can connect it to your pitch, by all means: now is your chance. If there is an upcoming inauguration or similar major event, that is another opportunity, because we all want to make sure that this important event and person is written into history.

What do you get when you hand a Master’s student a web archiving program?: Part Two

The following is a guest post by Grace Moran, Graduate Web Archiving Assistant – Library Administration, University of Illinois Library

In the previous installment of this series, I explored the complementary issues of metadata and access for web archives. In this second and final installment, the more human aspects of this year’s endeavor come to the fore: policy and personnel. I will also briefly describe how the current pandemic has informed web-archiving efforts at the University of Illinois.

If you did not read the previous blog post, let me revisit my background. I am the Graduate Web Archiving Assistant, working for Dr. Christopher Prom, the Associate Dean for Digital Strategies at the University of Illinois. This year, I have been charged with participating in day-to-day activities related to web archiving such as running crawls and doing quality assurance. I also engage in high-level organizational thinking about the future stewardship of the library’s burgeoning web archiving program. The program began in 2015, and since then the University of Illinois has captured 5 TB worth of data using our Archive-It subscription. Multiple units take part in the curatorial side of this endeavor: the University Archives, the American Library Association Archives, Faculty Papers, and the International & Area Studies Library. We are hoping to start running crawls for the Illinois History & Lincoln Collections in the near future.

As I noted, this post is focused on policy and personnel. These are two areas that currently present a challenge for my institution. We do not have centralized documentation, a web archiving-specific collection development policy, or a position other than my own dedicated to web archiving. The following is what I envision the program looking like in the future and will be highlighted in my final report to my supervisor at the end of the academic year.

What does policy mean for web archiving? I believe institutional web archiving policies and procedures should be composed of the following:

  • A collection development policy unique to the institution’s web archiving program (a general organization-wide collection development policy does not suffice given the unique nature of the content being collected)
  • A clear, centralized workflow outlining how crawls are to be run, troubleshooting documentation, and chain-of-command for web archiving
  • A statement on copyright and ethics in web archiving (Niu in “An Overview of Web Archiving,” cited below, touches on copyright)

Though it may be extremely obvious to some readers, it is worth saying: policy should be public. As someone who works for a public university, I am painfully aware of the importance of policy accessibility for our stakeholders.

What about personnel? Who should be running a web-archiving program? How many people should be involved? Of course, this is something that varies from institution to institution; however, my experience has made clear the need for a dedicated point person. This person could be:

  • A graduate web archiving position, like my own, working 20 hours a week to coordinate crawls across units, run Quality Assurance, and populate metadata fields.
  • A civil service or academic professional position with at least a 50% appointment to web archiving. If an institution is looking to grow their web archiving program, they should consider making this a 100% appointment for the first couple years and then having the point person slowly transition towards additional activities related to digital strategies of the library.

I should note here that a graduate web archiving assistant is a great way to support your web archiving program (yes, I am biased) but there are some drawbacks to placing this responsibility on the shoulders of a temporary employee. If you are just beginning your program, you may find that a part-time position does not fulfill the needs of your program. Additionally, there are advantages to long-term employees who have institutional knowledge and memory and therefore, understand the administrative history of digital programs within your organization. Time is lost when it is necessary to re-train someone for a position annually or bi-annually.

Side Note: I want to make clear that graduate employees are so important. They bring a fresh set of eyes to problems and the opportunity to learn from a graduate position can be absolutely priceless for someone like me. Please consider funding your graduate students, there are a great number out there pursuing an unfunded MLIS and paying off student debt for years to come.

Finally, I want to highlight a unique opportunity given to web archiving programs this past year. COVID-19 has devastated lives the world over; it has also provided inspiration for innovation and creativity. This has been true at the University of Illinois at Urbana-Champaign. From a novel saliva-based PCR test, to rigorous testing protocols, to creating a ventilator in 12 days, the institution has tackled this problem head-on. Bethany Anderson, the Natural & Applied Sciences Archivist, and I have collaborated over the past year to run crawls to document COVID-19 at the university and celebrate what we have accomplished in one of the darkest years we have seen. To check out pages we have documented, you can visit https://archive-it.org/collections/13880. This is a great example of how web archiving allows us to document important moments now and preserve the historical record (which is increasingly electronic) for the benefit of future researchers.

I hope that you have identified with some part of this blog series; my hope is that if we create a dialogue about our triumphs and struggles, we can all learn something.

Sources Mentioned

Niu, Jinfang, “An Overview of Web Archiving” (2012). School of Information Faculty Publications. 308.
https://scholarcommons.usf.edu/si_facpub/308

“COVID-19 Response at the University of Illinois” (2021). The Board Trustees of the University of Illinois Urbana-Champaign. https://archive-it.org/collections/13880

For further questions, I can be reached at gmoran6@illinois.edu

What do you get when you hand a Master’s student a web archiving program?: A Two-Part Series

The following is a guest post by Grace Moran, Graduate Web Archiving Assistant – Library Administration, University of Illinois Library

Websites are fleeting; finding a way to preserve them and ensure accessibility for end-users is one of the greatest challenges facing record-keeping institutions today. The window for capturing a website is minute, with many pages disappearing within 2-4 years of their creation (See Ben-David, “2014 Not Found,” in the further reading list). This two-part guest blog series documents my efforts over this academic year to develop, standardize, and envision a future for the University of Illinois web archives collections through our subscription to Archive-It. This first post will explore the deeply related issues of metadata and access for web archives, as well as how they influence my proposal for the future of our web collections.

About me. I am a graduate student getting my MS in Library & Information Sciences at the University of Illinois, graduating this May. I have been working for my supervisor, Dr. Chris Prom (Associate Dean, Office of Digital Strategies) since the Spring semester of my senior year of my bachelor’s. Last year, he gave me the option of transitioning over from digital special collections and workflows and focusing on our young web-archiving program. This opportunity has taught me so much, and I want to share that with others. 

Now, for a bit of background on the program. The University of Illinois has been capturing web pages through Archive-It since 2015, but the program itself has been inconsistent in its stewardship. This year, my charge is the evaluation of the status of the University of Illinois subscription to Archive-It and the creation of a plan for its continued success. The culmination of the role will be a final report outlining my accomplishments for the year, evaluating programmatic needs for a web-archiving program, conversations I have had with stakeholders in the program, and recommendations for the future of web archiving at the University of Illinois. The challenges facing a burgeoning web-archiving program are numerous and varied, ranging from policy to accessibility. Preferably, by the end of the year, I will have identified opportunities for growth and layout a plan for the future, ideally making a case for a dedicated web-archiving position when the budget allows. With 5 TB of data archived since 2015, the care and keeping of these collections is of utmost importance; the time invested by the University of Illinois Library must not be wasted.  

My experience thus far has shown me that managing a web archive is no small feat. (Is this an obvious statement? Yes, but that doesn’t make it any less true.) Two of the most urgent issues I have seen are metadata and access; it is no secret that these two go hand-in-hand. The debate around metadata is not new. The OCLC Research Library Partnership Metadata Working Group identified this same problem in their 2017 report “Developing Web Archiving Metadata Best Practices to Meet User Needs” (in the further reading list below). As it stands now, we don’t have a standard for documenting archived websites. Standards such as DACS and MARC don’t fully account for the unique nature of websites, so some institutions have adopted a hybrid approach, using both archival and bibliographic metadata. The question remains whether or not a hybrid approach is sufficient. Is it necessary to create a new standard? Would this over-complicate things?

On the Archive-It platform, there are three possible levels of metadata: 

  1. Collection-level
  2. Seed-level
  3. Document-level

Our collections at the University of Illinois tend to have collection-level and seed-level metadata. This is due to a couple of factors. First, the labor of editing seed metadata takes enough hours without also creating document metadata; law of diminishing returns. Second, what is/isn’t a document is a bit difficult to understand on the Archive-It platform, making it unclear what benefit there is in creating document-level metadata. For now, the goal is to make collection- and seed-level metadata consistent across collections and make it standard practice to describe websites when they are crawled.

Deeply tied into the issue of metadata is that of access; how do we make sure that content reaches end-users? If a tree falls in the forest…..no I’m not going to finish that; too cliché. You get the idea. My point is this: taking the time to preserve digital materials is nearly meaningless if no one gets to enjoy the fruits of the labor. 

What does access look like for University of Illinois web collections? Right now, an end-user would have to navigate to the public-facing Archive-It website and either search “University of Illinois” or stumble across one of our collections because of a keyword search. We are not in any way alone in this; institutions all over are struggling with the idea of making collections searchable.

So what’s the solution? How do we implement an access system without relying on end-users to know we participate in web archiving and have collections on a website referenced nowhere in our systems? In my final report to the library, I am going to recommend both short- and long-term solutions. Short-term, Archive-It allows subscribers to add a search bar to their discovery systems allowing them to search through their Archive-It collections. This simple, elegant, and quick solution would make a significant difference for researchers. A long-term, more involved solution would be to index all of our pages locally and work to allow full-text search of collections. In the era of Google, our users are used to full-text search; discovery systems should fit that behavior.

If you are working with web archives, I hope that something in this post resonated with you. I hope that institutions will begin to collaborate more and find standard solutions so we can truly harness this technology to our advantage.

In my next blog post a month from now, I’ll talk about the more human side of things: policy and personnel. I also want to share a bit about our efforts to document COVID-19 at the University of Illinois. Stay safe, stay healthy, and stay tuned!

Further Reading:

  1. Belovari, S. (2017). Historians and web archives. Archivaria, 83(1), 59-79.
  2. Ben-David, A. (2019). 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories, 3(3-4), 316-342.
  3. Bragg, M., & Hanna, K. (2013). The web archiving life cycle model. Archive-It), https://archive-it. org/static/files/archiveit_life_cycle_ model. pdf (visited 14.2. 14).
  4. Brunelle, J. F., Kelly, M., Weigle, M. C., & Nelson, M. L. (2016). The impact of JavaScript on archivability. International Journal on Digital Libraries, 17(2), 95-117.
  5. Condill, K. (2017). The Online Media Environment of the North Caucasus: Issues of Preservation and Accessibility in a Zone of Political and Ideological Conflict. Preservation, Digital Technology & Culture, 45(4), 166.
  6. Costa, M., Gomes, D., & Silva, M. J. (2017). The evolution of web archiving. International Journal on Digital Libraries, 18(3), 191-205.
  7. Dooley, J. M., & Bowers, K. (2018). Descriptive metadata for web archiving: Recommendations of the OCLC research library partnership web archiving metadata working group. OCLC Research.
  8. Dooley, J. M., Farrell, K. S., Kim, T., & Venlet, J. (2018). “Descriptive Metadata for Web Archiving: Literature Review of User Needs. OCLC Research.
  9. Dooley, J. M., Farrell, K. S., Kim, T., & Venlet, J. (2017). Developing web archiving metadata best practices to meet user needs. Journal of Western Archives, 8(2), 5.
  10. Dooley, J., Samouelian, M (2018). Descriptive Metadata for Web Archiving: Review of Harvesting Tools. OCLC Research.
  11. Farag, M. M., Lee, S., & Fox, E. A. (2018). Focused crawler for events. International Journal on Digital Libraries, 19(1), 3-19.
  12. Graham, P. M. (2017). Guest Editorial: Reflections on the Ethics of Web Archiving. Journal of Archival Organization, 14(3-4): 103-110.
  13. Grotke, A. (2011). Web Archiving at the Library of Congress. Computers in Libraries, 31(10), 15-19.
  14. Jones, Gina M. and Neubert, Michael (2017) “Using RSS to Improve Web Harvest Results for News Web Sites,” Journal of Western Archives: Vol. 8 : Iss. 2 , Article 3. 
  15. Littman, J., Chudnov, D., Kerchner, D., Peterson, C., Tan, Y., Trent, R., … & Wrubel, L. (2018). API-based social media collecting as a form of web archiving. International Journal on Digital Libraries, 19(1), 21-38.
  16. Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If these crawls could talk: Studying and documenting web archives provenance. Journal of the Association for Information Science and Technology, 69(10), 1223-1233.
  17. Masanès, J. (2005). Web archiving methods and approaches: A comparative study. Library trends, 54(1), 72-90.
  18. Pennock, Maureen (2013, March). Web-Archiving: DPC Technology Watch Report. Digital Preservation Coalition.
  19. Summers, E. (2020). Appraisal Talk in Web Archives. Archivaria, 89(1), 70-102.
  20. Thomson, Sara Day (2016, February). Preserving Social Media: DPC Technology Watch Report. Digital Preservation Coalition. 
  21. Weber, M. (2017). The tumultuous history of news on the web. In Brügger N. & Schroeder R. (Eds.), The Web as History: Using Web Archives to Understand the Past and the Present (pp. 83-100). London: UCL Press. doi:10.2307/j.ctt1mtz55k.10
  22. Webster, Peter (2020). How Researchers use the Archived Web: DPC Technology Watch Guidance Note. Digital Preservation Coalition.
  23. Wiedeman, Gregory (2019) “Describing Web Archives: A Computer-Assisted Approach,” Journal of Contemporary Archival Studies: Vol. 6, Article 31. https://elischolar.library.yale.edu/jcas/vol6/iss1/31.

Memento for Chrome review: Guest Post by Cliff Hight

The following is a guest post by Cliff Hight, University Archivist, Kansas State University Libraries.

The Memento for Chrome extension allows users of Google’s web browser, Chrome, to see previous versions web pages. I enjoyed testing the tool and seeing how certain sites have changed over the years. By the way, do any of you remember how Yahoo! looked in 1996? With a few clicks of a mouse, you can now.

ExampleYahoo

To give you a flavor of how it works, I’ll walk you through (with pictures!) my experience.

1)      After installing the extension, I used my institution’s home page as a test bed.

Example1

2)      I clicked on the clock to the right of the browser’s address bar and set the web time to which I wanted to travel—arbitrarily selected as April 14, 2010.

Example2

3)      I opened the context menu by right-clicking (or control-click for Mac users) on the page, selecting “Memento Time Travel,” and clicking the “Get near Wed, 14 Apr 2010 18:54:30 GMT” option.

Example3

4)      Voila! A view of how the Kansas State University Libraries website looked in the Internet Archive’s Wayback Machine on March 6, 2009.

Example4

You might have noticed that the date I was seeking and the date of the archived site were not the same. As it turns out, the developers note on their page that the extension has two limitations: it cannot “obtain a prior version of a page when none have [sic] been archived and time travel into the future.” Because my institution’s site was not captured in the Wayback Machine on April 14, 2010, there was nothing from that date to show. Instead, the tool went with the next oldest date, which happened to be March 6, 2009.

Additionally, you may have seen additional options on the context submenu. Selecting “Get near current time” sends you the most recently archived version of the page. The “Get at current time” option takes you to the live version of the page, and the “Got” line tells you which page you are currently seeing.

This extension basically uses various web archives, such as the Wayback Machine and the British Library Web Archive, to provide easy access to earlier versions of webpages. It also claims to provide archived pages of Wikipedia in all available languages. In my use of the tool, it was more convenient than going to the Wayback Machine every time I wanted to see older version of websites.

Like most technology products, there are some bugs. In my tests, there were a couple of times on different websites when I set a date, looked at the current version of the page, clicked to see the older version, and waited while nothing happened. To get it working again, I went back to the date box, changed the date by a day, and had success in seeing the older version. I’m not sure why it had those hiccups (and it would not surprise me if there was user error), but know as you begin to use the tool that there might be some kinks to work through.

The developers of Memento include the Prototyping Team of the Research Library of the Los Alamos National Laboratory and the Computer Science Department of Old Dominion University. Based on information on the Memento website, the extension began development in 2009 and its most recent update was in November 2013. On the plugin page, the developers state that “Memento for Chrome allows you to seamlessly navigate between the present web and the web of the past. It turns your browser into a web time travel machine that is activated by means of a Memento sub-menu that is available on right-click.” And, to learn more about the technical side of the project, you can see their Memento Guide and Request for Comments pages.

The Memento for Chrome extension is a helpful tool that allows users to easily peruse websites and see how they have changed through the years. I would recommend adding it to your toolbox as you seek to view the history of the web.

Personal Digital (Web) Archiving: Guest Post by Nicholas Taylor

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Physical keepsakes like photos, letters, and vital documents have a thankfully long shelf-life with little affirmative effort. The susceptibility of the digital artifacts that increasingly replace them to loss or corruption has raised awareness about the heightened need to curate, manage, and protect these assets. Interest in “personal digital archiving” has grown significantly within the last few years, as demonstrated by three annual conferences on the topic, practical guidance developed by the National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, as well as attention from other educational and cultural heritage institutions.

exhibits / 2012 National Book Festival, by wlef70
exhibits / 2012 National Book Festival, by wlef70

The recent focus on personal digital archiving as such only belatedly acknowledges the proliferation of personal digital assets worth archiving. The Web adds another layer of complexity, as files that were previously stored locally are scattered across the servers of many different service providers, each with varying degrees of commitment to data longevity to say nothing of their own long-term viability. As content that is natively web-based, social media and personal websites don’t decompose as intuitively into discrete files or with the same degree of fidelity to the original experience. As more of our data is hosted remotely, we need new approaches to continue to maintain our digital personal archives.

NDIIPP offers some great tool- and platform-agnostic advice for preserving personal websites, blogs, and social media, focusing on identifying potential web content to preserve, deciding what’s actually important to preserve, copying the content, organizing it, and creating and managing distributed backups. I wanted to expand on the “copying” step, with attention to specific tools and platforms. There are a small but growing number of tools that are well-suited to copying simple websites and blogs, and popular platforms are increasingly providing mechanisms for exporting data in self-contained and structured formats.

Tools

The Web Archiving Integration Layer (WAIL) is a user-friendly interface in front of the tools used by many cultural heritage web archiving organizations: the Heritrix archival crawler and the Wayback web archive replay platform. WAIL supports one-click capture of an entire website and replay in a local instance of Wayback. Data is captured to the WARC format, which has the advantage of being the ISO standard web archiving preservation format of choice and allowing for a more faithful representation of the original website via the bundled Wayback. The downside is that WARC is a relatively opaque format to all but a few specialized applications. Given that WAIL has only one maintainer, in a personal archiving context it might make sense to also copy web content into more readily legible formats, in addition to WARC.

Wget is a mature, multi-platform command-line utility for which a number of GUI wrappers are available. Wget is highly configurable but can be used to copy a website with only minimal parameters specified. Copied content is stored by default in a local folder hierarchy mirroring the website structure. Wget 1.13+ additively supports storing copied content in the WARC format—the WARCs are created in parallel with the copying of the website files into the folder hierarchy. The dual-format data capture facilitates easy and relatively future-safe browsing and creation of a suitable preservation format. The downsides are that Wget generally requires comfort with the command-line (there are GUI wrappers, but I’ve yet to find one that supports the WARC parameters) and there’s no easy way to replay or access the contents of the created WARC files.

stranger 7/100 abdul hoque, by Hasin Hayder
stranger 7/100 abdul hoque, by Hasin Hayder

HTTrack is a multi-platform command-line and GUI tool built specifically for mirroring websites. Due to HTTrack’s more narrow purpose, the documentation and official forum are likely to be more relevant to a personal digital archivist looking to conduct small-scale web archiving. Like Wget, copied content is stored in a local folder hierarchy mirroring the website structure, making it easy to browse content. The command-line version of the tool allows for automation and flexibility, while the GUI version is more user-friendly. The main downside is that if desktop tools for handling WARC files later become available, this would likely be the preferable format to have archived web content in.

Warrick is a *nix command-line utility for re-web archiving—creating a local copy of a website based on content in a web archive (e.g., the Internet Archive Wayback Machine). Built-in Memento support might eventually allow it to reconstitute a website from content hosted in multiple web archives. Like HTTrack and Wget, copied content is stored in a local folder hierarchy mirroring the website structure. Unlike those tools, it’s only designed to retrieve content from web archives.

Platforms

Facebook provides a mechanism for “Downloading Your Info” which creates a zip file containing static web pages that cover your profile, contact information, Wall, photos (only those that you’ve posted yourself, apparently), friends, messages, pokes, events, settings, and ads selectors. While comprehensive and self-contained, there is no option to retrieve the data in more structured formats, like vCard for contacts, iCal for events, or mbox for messages. Facebook is an especially poor place from which to recover photos, as all embedded metadata (including EXIF, IPTC IIM, and IPTC XMP headers) is stripped out and images larger than 100 KB are compressed on upload.

Google Takeout allows for exporting data from an expanding number of Google services, including Mail, Calendar, Contacts, Drive, and Google Plus, in standard formats. This means that the exported data is good for both long-term preservation and continued access using other applications that support those standards. Google Plus supports full-size photo uploads and, therefore, downloads (limited only by Google Drive storage quota) and doesn’t destroy embedded metadata in uploaded photos.

Twitter provides a mechanism for downloading “Your Twitter Archive” which creates a zip file containing a standalone JavaScript-enabled application that evokes the experience of using the Twitter web service in the browser. At first glance, this resembles the format of Facebook’s data export, but a key differentiator is that the Twitter data export includes each of the individual tweets in JSON format and provides the standalone application as a convenience for browsing them. Since the exported data is separate from the presentation, it’s much easier to re-purpose or manipulate it with other tools.

Mainstream content management systems may have extensions that support exporting data in structured formats, though I’m not familiar with any specifically. Explore WordPress plugins or Drupal Modules. “Backup” is probably the term to search for there, as “archive” typically has the connotation of previously-published content that remains accessible through the live website.

It is Time to Embrace the Present: Guest Post by Deborah Kempe

“Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
Nor any drop to drink.”
― Samuel Taylor ColeridgeThe Rime of the Ancient Mariner

water

Colleagues, do you not share this sensation when it comes to navigating the ocean that is the web? When 57,865 returns in a Google search do not cut it for research purposes, where does one turn? When an important url is suddenly no longer findable, what does one do?  Unfortunately, the traditionally safe harbor of libraries and archives as a trustworthy repository of reliable information is no longer quite so secure a destination. At the same time, the traditionally held concept of what libraries and archives should be is undergoing radical reinterpretation.

fortune

It was amid this shifting landscape that the libraries of the New York Art Resources Consortium (NYARC) undertook a series of programmatic inquiries into the state of the web for research in art history.  Those explorations led to a major grant from the Andrew W. Mellon Foundation, awarded to NYARC in October, for a two-year program in support of preserving born digital resources for art research.

The emerging fields of web archiving and digital humanities are relatively new ones. Given that the field of art history and the art business community continue to produce steady streams of relevant print publications, the adoption of a contiguous program to select, capture, describe and preserve born digital resources will be a major disruption of traditional library practices.  Arriving at this point has been a delicate calculation of structured investigation and righteous determination that admittedly can be a bit uncomfortable.

****

It is time to embrace the present, let alone the future. The digital world is here to stay and constantly changing.  We have to not only embrace it but help to shape it.”

These words, expressed just over a year ago by James Cuno, the President and CEO of the J. Paul Getty Trust in a much re-circulated blog posting entitled How Art History is Failing at the Internet, capture the attitude that drove us forward into territory that challenged our comfort levels.  But the journey was a series of determined steps.  A bit of background…..

A presentation in 2010 by Kristine Hanna of the Internet Archive at an ARLIS/NY meeting at the Metropolitan Museum of Art was the first introduction for many of us to a new software service, called Archive-It, which could be used to curate and capture historical instances of websites.  Unlike the digital deluge that many disciplines were experiencing, the realm of art history was only beginning to produce a noticeable quantity of websites and digital publications with value for research.  With the onset of the “Great Recession,” the economic crisis that was all-too-real at the time but now feels like a chimera, many galleries, art dealers, auctioneers, and small museums made a sudden shift to digital publications.  The move from print to digital publishing, once driven by cost savings, led to a preference for digital as the platform of choice for many reasons beyond those of economy.

After the ARLIS/NY meeting, staff from the Frick Art Reference Library approached Archive-It to discuss the possibility of undertaking pilot projects to investigate archiving websites of auction houses and to capture and preserve links to digital information in the Archives Directory for the History of Collecting in America.  Archive-It generously facilitated these landmark projects, which allowed us to learn firsthand the challenges and promises of archiving highly visual collections on the web.

Eager to make further progress, NYARC approached the Andrew W. Mellon Foundation with a proposal to take our study to the next level.  By this time, national libraries and large universities were creating discipline or event-based web archives, but our research uncovered very little web archiving activity by special libraries. Although we continued to receive a steady flow of print publications, the number of digital publications was clearly increasing, and in many cases we were not collecting, describing, and preserving them for the long term as we did for printed documents. The clock was ticking, and we began to understand the threat of a digital black hole in our collections.  Large libraries had never collected the sort of so-called “ephemeral” resources such as auction, dealer, and small exhibition catalogs, and they were not going to do it for dynamic digital versions, either.  NYARC made the case of special needs for libraries whose chief missions are to serve art specialists.

moon

That the web has become the dominant channel for information-seeking in the 21st culture is a given, yet much of its digital content is fragile and ephemeral. The question for NYARC was no longer “Why archive the web,” but “How to archive the web,” “Who should archive the web,” and “How will users navigate web archives?”  The Mellon Foundation responded with support for “Reframing Collections for a Digital Age: A Preparatory Study for Collecting and Preserving Web-Based Art Research Materials”.  The one-year grant allowed us to bring in experts to assess the digital landscape of art information. The reports that followed allowed NYARC to envision a road map for creating a sustainable program of specialized web archiving.

“Go small, go simple, go now”
― Larry PardeyCruising in “Seraffyn”

While nothing about the web is simple, an incremental approach to problem-solving is effective.  Essentially, that is what our consultants advised.

****

With the recent award of our two-year implementation grant from the Andrew W. Mellon Foundation, NYARC is now in the beginning stages of building a program that will integrate web archiving into our core activity of building high-quality collections for use by art researchers and museum staff far beyond our reading rooms.  By calling our proposal Making the Black Hole Gray, we acknowledge the futility of fully closing the digital black hole, and that it will not be possible to capture every born-digital resource that we might wish to.  Instead, we will prioritize the harvest of digital resources that correspond to our traditional collection strengths, with the expectation that others will join a historical pattern of collaborative resource sharing to enable the creation of a lasting digital corpus that will invigorate the work of librarians, archivists, scholars, technologists, and the public in ways we have only begun to imagine.  Let the voyage continue.

travel

–Deborah Kempe, Chief, Collections Management & Access, Frick Art Reference Library of The Frick Collection (a member of the New York Art Resources Consortium), 12/6/2013