How to Start Web Archiving: A Practical Guide

Web archiving can feel intimidating at first. It’s technical, it’s evolving, and the stakes are high. If your institution isn’t saving its web presence, you could lose key records of events, programs, leadership, and community engagement that only ever lived online. This post walks through the basics of how to actually do web archiving, breaking down the tools, steps, and decision-making involved.

Whether you’re just getting started or trying to standardize your approach, this is meant to give you a solid foundation for building a sustainable web archiving practice.

What Is Web Archiving?

Web archiving is the process of collecting, preserving, and providing access to web-based content. Think of it like taking snapshots of websites and social media posts that your organization controls, or that relate to your community or mission. These snapshots, or captures, are saved and made available long-term, even after the live versions are gone.

Archiving web content helps document your institution’s history and public presence, but it also supports reference work, transparency, and digital preservation goals. If you’ve ever tried to cite something that’s since been taken down, you know why this matters.

Where to Start: Your Own Content

Begin with what you control. That means your main website, blogs, social media, event pages, online newsletters, and anything hosted on platforms like WordPress or Squarespace. Archive these first before branching out to content from other organizations.

If your institution is new to web archiving, stick to a narrow scope for your first few collections. For example, a “Leadership Announcements” collection that captures pages about staff transitions and board updates, or a “Public Programs” collection that saves event pages and online exhibitions.

Tools to Capture Web Content

You don’t have to be a developer or buy expensive software to archive websites. There are a few go-to tools used by LAM professionals:

  • Webrecorder tools (like ArchiveWeb.page): These allow you to capture and replay sites manually in your browser.

  • Conifer (formerly Webrecorder.io): Good for high-fidelity captures of complex, dynamic websites.

  • Archive-It: A subscription-based service from Internet Archive, designed for institutional-level web archiving with scheduling, metadata, and public access.

  • Browsertrix Crawler: An open-source crawling framework that supports automated captures and large-scale archiving, especially if you have an IT team or developer support.

For most beginners, starting with a tool like ArchiveWeb.page is a great way to learn how capturing works without getting lost in code or settings.

Capture Tips and Best Practices

Websites are complicated and not every part gets captured perfectly. Here are some things to keep in mind:

  • Use a consistent naming convention for captures. Include date, content type, and site name so files are easy to manage.

  • Do test crawls before committing to a schedule. You’ll catch issues like login walls, dynamically loaded content, or media that fails to render.

  • Consider frequency. For static sites, annual captures might be fine. For fast-changing content like news or event pages, monthly or even weekly captures may be needed.

  • Use metadata. Even minimal description helps contextualize captures. Who created the content? What’s the significance? When was it captured?

Organizing and Preserving Captures

Web archive captures are often saved as WARC (Web ARChive) files. These files bundle together all the data from a capture session, including HTML, images, scripts, and more.

You’ll want to store these files in a secure, backed-up environment, just like other digital preservation materials. Ideally, they’re part of your digital preservation system and referenced in your collection management system or finding aids.

Labeling and metadata go a long way in making your captures findable later. At a minimum, you should record:

  • Title of the content

  • URL

  • Date captured

  • Capturing tool

  • Notes about what was included or not

Ethical Considerations

Always think about privacy, permissions, and public expectations. Just because content is publicly viewable doesn’t mean it should be archived without care. Be cautious with sensitive content, personal information, and anything involving minors.

When possible, get institutional signoff on a web archiving policy. This helps clarify what gets archived, how often, and who decides.

Building a Habit

Web archiving is easiest when it becomes part of your workflows. Schedule captures around known events or updates. Set quarterly or monthly check-ins to archive major changes. Keep a spreadsheet or database to track what you’ve captured and when.

If you’re on a small team, it’s OK to start small. One collection, one tool, one goal. The important thing is to begin and to keep going.

Sarah Weeks

Sarah is a big-picture thinker who also relishes attending to the little details. In over 20 years of work in libraries and archives, she has promoted a user-centered philosophy in diverse and unique roles at universities, corporations, and nonprofits. She brings her passion for connecting humans with information to Backlog, where she advises on digital tools, processes, and workflows.

Currently, Sarah is the Web and Email Archives Coordinator at Washington University in St. Louis. In 2020, Sarah was transferred from her role managing public services at WashU’s Art and Architecture Library to Special Collections, where she began assisting with digital archiving. Her focus on setting up sustainable and robust systems from scratch led to her current role as a digital archivist, formalizing the first web and email archiving programs at the university. Her background includes a stint as a corporate librarian at Anheuser-Busch, metadata work at Getty Images, as well as many years spent in public service in academic libraries.

Sarah holds an MLIS from the University of Washington in Seattle, where she volunteered or interned at organizations, including the Museum of History and Industry, the Seattle Art Museum, and the Zine Archive at Richard Hugo House. Her dedication to sharing knowledge led her to teach ESL classes at the Seattle Public Library and conduct children’s garden tours at Seattle Tilth.

Back in her hometown of St. Louis, one of Sarah’s longstanding passions is her work with the National Building Arts Center (NBAC). There, she co-created the website, assists with tours and events, and consults on library processes.

https://www.linkedin.com/in/sarah-weeks-0648b82a/
Previous
Previous

What Is MODS and Why Should You Care?

Next
Next

What Is Web Archiving? Understanding the Format That Preserves the Internet