Skip to main content

Python WayBack Machine for web archive replay

Project description

https://travis-ci.org/ikreymer/pywb.svg?branch=0.9.0b https://coveralls.io/repos/ikreymer/pywb/badge.svg?branch=0.9.0b https://img.shields.io/gratipay/ikreymer.svg

pywb is a python implementation of web archival replay tools, sometimes also known as ‘Wayback Machine’.

pywb allows high-quality replay (browsing) of archived web data stored in standardized ARC and WARC.

The replay system is designed to accurately replay complex dynamic sites, including video and audio content and sites with complex JavaScript.

The software can run as a traditional web application or an HTTP or HTTPS proxy server, and has been tested on Linux, OS X and Windows platforms.

pywb is also fully compliant with the Memento protocol (RFC-7089).

Getting Started – Run your own Wayback Machine

With release 0.9.0, pywb provides new simplified, directory-based init system to create and run your own Wayback Machine directly from archive collections on disk.

A new utility, wayback-manager performs the most common collection management tasks from the command line.

  1. Ensure that Python 2.6 or 2.7 is installed on your machine (Python 2.7.3+ strongly recommended).

  2. (Optional) For best results, setup a clean environment with virtualenv: virtualenv /tmp/pywb-env; source /tmp/pywb-env/bin/activate

  3. pip install pywb==0.9.0b1

  4. Create a new directory for your archive, eg: mkdir ~/myarchive; cd ~/myarchive

  5. Init a collection: wayback-manager init my_coll

  6. (Optional) If you do not have any archive files, (WARCs or ARCs), you may create one by using the free https://webrecorder.io service. For example, you may visit https://webrecorder.io/record/http://example.com, then (after a few seconds) click “Download -> Web Archive (WARC)” to get the WARC file (.warc.gz)

  7. If you have any existing archive files (WARCs or ARCs), add them to your collection with: wayback-manager add /path/to/mywarc.warc.gz

  8. Run wayback

  9. Point your browser to http://localhost:8080/my_coll/<url>/ where <url> is a url in your WARC file.

    (If you just recorded http://example.com/, you should be able to view http://localhost:8080/my_coll/http://example.com/)

  10. If all worked well, you should see replay of <url>.

Congrats, you are now running your own Wayback Machine!

A more detailed tutorial is available on the wiki.

Legacy installation instructions contain additional information and testing examples, and use a custom config.yaml file. These instructions are from previous releases but still apply for pywb 0.9.0.

Running Samples and Tests

To run the bundled samples (also used by test suite), you’ll need to clone pywb locally:

  1. git clone -b 0.9.0b https://github.com/ikreymer/pywb.git

  2. python setup.py install

  3. wayback to run samples

  4. Browse to http://localhost:8080/pywb/*/example.com to see capture of http://example.com

To run tests on your system, you may run python setup.py test

(A few tests require the optional pyopenssl package and are skipped if the package is not installed)

Additional Samples and Other Projects

Additional (older) samples can be found in the pywb-samples repository.

For additional reference on how pywb is being used, you may check some of the public projects using with pywb

Desktop Web Archive Player

There is now alos a downloadable point-and-click Web Archive Player which provides a native OS X and Windows application for browsing web archives, built using pywb.

You can use this tool to quickly check the contents of any WARC or ARC file through a standard point-and-click GUI interface, no command line tools needed.

pywb Tools Overview

In addition to the standard Wayback Machine, pywb tool suite includes a number of useful command-line and web server tools. The tools should be available to run after running python setup.py install:

  • live-rewrite-server – a demo live rewriting web server which accepts requests using wayback machine url format at /rewrite/ path, eg, /rewrite/http://example.com/ and applies the same url rewriting rules as are used for archived content. This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data. The webrecorder.io service is built using this tool.

  • cdx-indexer – a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and non-SURT based cdx files and optional sorting. See cdx-indexer -h for all options. for all options.

  • cdx-server – a CDX API only server which returns a responses about CDX captures in bulk. Includes most of the features of the original cdx server implementation, updated documentation coming soon.

  • proxy-cert-auth – a utility to support proxy mode. It can be used in CA root certificate, or per-host certificate with an existing root cert.

  • wayback – The Wayback Machine application itself.

  • wayback-manager – A command-line utility for managing collections, adding WARC/ARC files, metadata and UI templates. See wayback-manager --help for an up-to-date listing of commands and options.

Latest Changes

See CHANGES.rst for an up-to-date changelist.

Running as Rewriting Live Web Proxy

In addition to replaying archived web content, pywb can serve as a rewriting proxy to the live web. This allows pywb to server live content, and inject customize web pages on the fly. This allow for a variety of use cases beyond archive replay.

For example, the pywb-webrecorder demonstrates a way to use pywb live web rewriting together with a recording proxy (warcprox) to record content while browsing.

The via.hypothes.is project uses pywb to inject annotations into any live web page.

Running in HTTP/HTTPS Proxy Mode

pywb can also be used as an actual HTTP and/or HTTPS proxy server. See pywb Proxy Mode Usage for more details on configuring proxy mode.

To run as an HTTPS proxy server, pywb provides a facility for generating a custom self-signed root certificate, which can be used to replay HTTPS content from the archive. (The certificate should be used with caution within a controlled setting). Using these features requiring an extra dependency: the pyopenssl library must be installed via pip install pyopenssl

For more info, see Proxy Mode Usage

The pywb-proxy-demo project also contains a working configuration of proxy mode deployment.

WSGI Container

The default wayback application starts pywb in a single-threaded single-process reference WSGI container.

For production use, running in a different container, such as uWSGI is strongly recommended.

For example, the uwsgi.ini and ``run-uwsgi.sh scripts in this repo provides examples of running pywb with uWSGI.

The module pywb.apps.wayback may be used as the entry point for WSGI.

pywb should run in any standards (PEP-333 and PEP-3333) compatible WSGI container.

Custom UI and User Metadata

pywb makes it easy to customize most aspects of the UI around archived content, including a custom banner insert, query calendar, search and home pages, via HTML Jinja2 templates.

You can see a list of all available UI templates by running: wayback-manager template --list

To copy a default template to the file system (for modification), you can run wayback-manager template <coll> --add <template_name>

pywb now also supports custom user metadata for each collection. The metadata may be specified in the metadata.yaml in each collection’s directory.

The metadata is accessible to all UI templates and may be displayed to the user as needed.

See the Wayback Manager Tutorial and the and UI Customization page for more details.

About Wayback Machine

pywb is compatible with the standard Wayback Machine url format:

Replay: http://<host>/<collection>/<timestamp>/<original url>

Query Listing: http://<host>/<collection>/*/<original url>

Additional Reference

  • The wiki will have additional technical documentation about various aspects of pywb

  • The sample config.yaml file, although not required, will provide a listing of various advanced configuration options: config.yaml

Contributions & Bug Reports

Users are encouraged to fork and contribute to this project to improve any and all aspects of web archival replay and web proxy services.

Please take a look at list of current issues and feel free to open new ones.

https://cdn.rawgit.com/gratipay/gratipay-badge/2.0.1/dist/gratipay.png

Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywb-0.9.0b1.tar.gz (302.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page