How to Start Web Archiving: A Practical Guide
Web archiving can feel intimidating at first. It’s technical, it’s evolving, and the stakes are high. If your institution isn’t saving its web presence, you could lose key records of events, programs, leadership, and community engagement that only ever lived online. This post walks through the basics of how to actually do web archiving, breaking down the tools, steps, and decision-making involved.
Whether you’re just getting started or trying to standardize your approach, this is meant to give you a solid foundation for building a sustainable web archiving practice.
What Is Web Archiving?
Web archiving is the process of collecting, preserving, and providing access to web-based content. Think of it like taking snapshots of websites and social media posts that your organization controls, or that relate to your community or mission. These snapshots, or captures, are saved and made available long-term, even after the live versions are gone.
Archiving web content helps document your institution’s history and public presence, but it also supports reference work, transparency, and digital preservation goals. If you’ve ever tried to cite something that’s since been taken down, you know why this matters.
Where to Start: Your Own Content
Begin with what you control. That means your main website, blogs, social media, event pages, online newsletters, and anything hosted on platforms like WordPress or Squarespace. Archive these first before branching out to content from other organizations.
If your institution is new to web archiving, stick to a narrow scope for your first few collections. For example, a “Leadership Announcements” collection that captures pages about staff transitions and board updates, or a “Public Programs” collection that saves event pages and online exhibitions.
Tools to Capture Web Content
You don’t have to be a developer or buy expensive software to archive websites. There are a few go-to tools used by LAM professionals:
Webrecorder tools (like ArchiveWeb.page): These allow you to capture and replay sites manually in your browser.
Conifer (formerly Webrecorder.io): Good for high-fidelity captures of complex, dynamic websites.
Archive-It: A subscription-based service from Internet Archive, designed for institutional-level web archiving with scheduling, metadata, and public access.
Browsertrix Crawler: An open-source crawling framework that supports automated captures and large-scale archiving, especially if you have an IT team or developer support.
For most beginners, starting with a tool like ArchiveWeb.page is a great way to learn how capturing works without getting lost in code or settings.
Capture Tips and Best Practices
Websites are complicated and not every part gets captured perfectly. Here are some things to keep in mind:
Use a consistent naming convention for captures. Include date, content type, and site name so files are easy to manage.
Do test crawls before committing to a schedule. You’ll catch issues like login walls, dynamically loaded content, or media that fails to render.
Consider frequency. For static sites, annual captures might be fine. For fast-changing content like news or event pages, monthly or even weekly captures may be needed.
Use metadata. Even minimal description helps contextualize captures. Who created the content? What’s the significance? When was it captured?
Organizing and Preserving Captures
Web archive captures are often saved as WARC (Web ARChive) files. These files bundle together all the data from a capture session, including HTML, images, scripts, and more.
You’ll want to store these files in a secure, backed-up environment, just like other digital preservation materials. Ideally, they’re part of your digital preservation system and referenced in your collection management system or finding aids.
Labeling and metadata go a long way in making your captures findable later. At a minimum, you should record:
Title of the content
URL
Date captured
Capturing tool
Notes about what was included or not
Ethical Considerations
Always think about privacy, permissions, and public expectations. Just because content is publicly viewable doesn’t mean it should be archived without care. Be cautious with sensitive content, personal information, and anything involving minors.
When possible, get institutional signoff on a web archiving policy. This helps clarify what gets archived, how often, and who decides.
Building a Habit
Web archiving is easiest when it becomes part of your workflows. Schedule captures around known events or updates. Set quarterly or monthly check-ins to archive major changes. Keep a spreadsheet or database to track what you’ve captured and when.
If you’re on a small team, it’s OK to start small. One collection, one tool, one goal. The important thing is to begin and to keep going.