What Is Web Archiving? Understanding the Format That Preserves the Internet

Jul 15

Web archiving might sound like a technical niche or a futuristic concern, but in reality, it’s already shaping how history is remembered. From preserving deleted political speeches to capturing how institutions responded to the COVID-19 pandemic, web archives have become a crucial piece of our cultural memory. But what exactly is a web archive? How did it come to exist? And what makes it archival?

This post walks through the background and basics of web archiving. It’s the first in a three-part series. If you're interested in creating your own web archives or learning how to help researchers use them, keep an eye out for the next two sessions. If you want to skip ahead, you can also watch the recorded version of this webinar [insert link here].

From Clay Tablets to Web Crawlers

To understand web archiving, it helps to zoom out. Archives as a concept are ancient. Clay tablets were used to record information thousands of years ago. Papyrus, parchment, and paper followed. Physical archives eventually gave way to card catalogs, and then digital systems.

Machine-readable cataloging was developed in 1983. That’s only forty years ago. The modern web came online in the mid-1990s. Almost immediately, people started asking how to preserve it. The shift from physical to digital happened fast. It’s been less than two centuries since photography and telegraphs first appeared, and already we’re trying to figure out how to preserve tweets and TikToks.

The idea of digital preservation is relatively new. One of the earliest references to it came from research at Cornell and Xerox in 1990. Even then, it was still framed as an experiment, something outside the scope of traditional archival practice. Now, it’s essential.

What Makes a Web Archive Different

When people hear the word archive, they often imagine boxes of letters or maybe digitized images of old documents. But a web archive isn’t just a screenshot or a saved webpage. It’s an attempt to preserve the experience of using a website. That means archiving all the pieces that make up a page—images, text, videos, scripts, and styles—and preserving them in a way that users can still interact with.

A JPEG file, for example, is relatively simple. It’s one object, readable by many programs. A web archive, by contrast, is stored in a format called WARC, short for Web ARChive. A WARC file contains many parts. It has to, because websites are dynamic. They’re not a single object. They’re a series of files working together, and often pulling information in real time from other sources. Preserving that is complicated.

You can’t just double-click a WARC file and expect it to open. You need a specialized system that can recreate the environment of the live web. Replay tools like OpenWayback or Pywb act as interpreters. They take the archived pieces and render them in a way that mimics how they looked and behaved when they were live. It’s a multi-step process, and it isn’t perfect. But when it works, it allows researchers to interact with preserved websites almost as if they were still online.

The Problem with Link Rot

In the print world, citations and footnotes were stable. Even if a book went out of print, you could probably find it in a library. On the web, that’s not the case. Pages disappear. Links break. Content gets overwritten. In 2015, Jill Lepore wrote about this in her New Yorker article “The Cobweb,” calling attention to what she described as the destruction of the footnote. The web is fragile. Its impermanence undermines scholarship, accountability, and memory.

This issue is often called link rot. You click a link in an article or academic paper, and instead of seeing the original source, you get a 404 error. The page is gone. Or worse, the page is still there, but the content has changed. This is called content drift. A citation may point to a headline that no longer matches the original story. Or it may direct you to a site that now sells something entirely unrelated.

Web archiving exists to fight both link rot and content drift. It offers a way to preserve what a page looked like at a specific moment in time. It creates a stable version of a site that researchers, lawyers, journalists, and the public can refer back to.

The Rise of the Wayback Machine

The most well-known web archive is the Wayback Machine, part of the Internet Archive. It was founded in 1996 by Brewster Kahle, and it has grown into an enormous resource. As of this year, it contains more than 860 billion web pages, representing over 99 petabytes of data. That’s more than 99 million gigabytes.

The Wayback Machine uses web crawlers—automated programs that browse and download web content. These crawlers, sometimes called spiders, follow links across the web and take snapshots of pages. The archive makes those snapshots available to the public. Unlike search engines like Google, which simply point you to live pages, the Wayback Machine actually saves and serves up the content.

But it’s not perfect. The crawlers don’t capture every page, and they don’t always get complete versions. You might find missing images or broken formatting. Sometimes a capture looks fine. Sometimes it’s just a jumble of text. If you’re doing research in a web archive, you have to try multiple dates and dig through versions. The randomness is part of the process.

Web Archiving Beyond the Internet Archive

Although the Wayback Machine is the biggest player, it’s not the only one. Countries around the world have created national web archives. Australia began archiving .au domains in 1996. The UK passed a legal deposit law in 2003 that includes digital content. New Zealand was one of the earliest adopters. Many of these countries archive government websites and significant public-facing content as a matter of national policy.

In the United States, we don’t have a single national web archive. The Library of Congress does some targeted preservation, but there’s no federal requirement to deposit digital content the way there is for books. As a result, most web archiving here happens through independent institutions, universities, and nonprofits.

Projects like Archive-It, operated by the Internet Archive, allow organizations to curate their own web archives. Schools, libraries, museums, and even community groups can subscribe and manage collections that reflect their specific missions.

Web Archiving as a Tool for Justice and Memory

Web archives are useful for research, but they also play a growing role in accountability. When politicians delete statements from websites, web archives can reveal what used to be there. When government agencies quietly change public guidance, archived versions can show the difference. When publishers update articles without disclosing the changes, archives can provide a record.

Projects like Perma.cc, developed by the Harvard Law School Library, allow users to create stable citations for legal and academic use. Studies have shown that more than half of all links in Supreme Court decisions no longer lead to the original source. Web archiving is one of the few tools we have to correct for that.

And it’s not just about officials or institutions. Projects like Community Webs and the K–12 Web Archiving Program invite public libraries and school students to participate in preservation. Students have reflected on the experience of being able to choose what gets archived. One said, “Without this project, there would be no record of us.” That’s the power of letting people see themselves in history.

When the Web Is at Risk

In times of crisis, web archiving can move fast. When war broke out in Ukraine, a volunteer-led initiative called SUCHO (Saving Ukrainian Cultural Heritage Online) sprang into action. Academics and digital preservationists began capturing websites from Ukrainian museums, libraries, and archives before they could be taken offline or destroyed. In a matter of weeks, they preserved tens of thousands of sites and terabytes of data. Their work continues today, not just as a mirror of Ukrainian culture, but with the intent to return that data to Ukraine when it is safe and stable to do so.

During the COVID-19 pandemic, similar efforts took place across the United States. Universities and local archives captured health announcements, campus response pages, student initiatives, and policy changes. This was the kind of material that may never have made it into a printed newsletter or official report. It existed only on the web. Web archiving made it possible to preserve that experience in real time.

The Loss of GeoCities and the Fight to Save Digital History

Sometimes, web archives are all we have left. GeoCities, once one of the most popular web hosting platforms on the internet, was shut down by Yahoo in 2009. Millions of user-created websites disappeared overnight. An independent group called Archive Team managed to save much of the content. They acted quickly, capturing as much as they could before the deletion. Today, those pages live on as a kind of time capsule. They show not just the content of the early web, but the creativity, messiness, and joy of its users.

The loss of GeoCities is still cited as one of the largest intentional deletions of digital content. And it changed how people think about preservation. When companies control the platforms and the content, history can disappear quickly. Web archives offer a way to push back.

Web Archives as a Reflection of Values

At the heart of web archiving is a question of values. What do we choose to preserve? Who gets to decide what is saved? And how will those decisions shape what the future knows about the past?

Paul Koerbin, one of the pioneers of web archiving in Australia, put it simply: “What we choose to collect will frame how the future looks at the past.” That’s a responsibility and an opportunity. It’s a chance to preserve voices, stories, and perspectives that might otherwise vanish.

You Can Start Right Now

If this feels like a big topic, that’s because it is. But you don’t have to wait for a perfect policy or expensive software to begin. You can go to archive.org, find the “Save Page Now” tool, and preserve a website immediately. You can create an account and start saving your own pages or projects. You can start small, and start today.

In the next session of this series, I’ll walk through how small organizations, researchers, or even individuals can start doing web archiving in a sustainable and accessible way. We’ll talk tools, storage, planning, and ethics. And in the final session, we’ll look at how web archives are used by historians, researchers, genealogists, and memory workers of all kinds.

Watch the Full Webinar

If you want to hear more or see examples in action, the recorded webinar is available here:

And if your organization is ready to start thinking about web archiving, Backlog can help. Whether you need a plan, a pilot crawl, or a full preservation strategy, we’re happy to work with you.

Sarah Weeks

Sarah is a big-picture thinker who also relishes attending to the little details. In over 20 years of work in libraries and archives, she has promoted a user-centered philosophy in diverse and unique roles at universities, corporations, and nonprofits. She brings her passion for connecting humans with information to Backlog, where she advises on digital tools, processes, and workflows.

Currently, Sarah is the Web and Email Archives Coordinator at Washington University in St. Louis. In 2020, Sarah was transferred from her role managing public services at WashU’s Art and Architecture Library to Special Collections, where she began assisting with digital archiving. Her focus on setting up sustainable and robust systems from scratch led to her current role as a digital archivist, formalizing the first web and email archiving programs at the university. Her background includes a stint as a corporate librarian at Anheuser-Busch, metadata work at Getty Images, as well as many years spent in public service in academic libraries.

Sarah holds an MLIS from the University of Washington in Seattle, where she volunteered or interned at organizations, including the Museum of History and Industry, the Seattle Art Museum, and the Zine Archive at Richard Hugo House. Her dedication to sharing knowledge led her to teach ESL classes at the Seattle Public Library and conduct children’s garden tours at Seattle Tilth.

Back in her hometown of St. Louis, one of Sarah’s longstanding passions is her work with the National Building Arts Center (NBAC). There, she co-created the website, assists with tours and events, and consults on library processes.

https://www.linkedin.com/in/sarah-weeks-0648b82a/