Skip to main content

Python WayBack Machine for web archive replay

Project description

https://travis-ci.org/ikreymer/pywb.png?branch=master https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=master

pywb is a new Python implementation of the Wayback Machine software and tools.

At its core, it provides a web app which ‘replays’ archived web data stored in ARC and WARC files and provides metadata about the archived captures.

Latest Changes

The basic feature set of web replay is nearly complete in this version.

pywb now features new domain-specific rules which are applied to certain difficult and dynamic content in order to make web replay work.

This rules set will be under constant iteration to deal with new challenges as the web evoles.

Wayback Machine

pywb is compatible with the standard Wayback Machine url format:

http://<host>/<collection>/<timestamp>/<original url>

Some examples of this url from other wayback machines (not implemented via pywb):

http://web.archive.org/web/20140312103519/http://www.example.com http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/

A listing of archived content, often in calendar form, is available when a * is used instead of timestamp.

The Wayback Machine often uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.

pywb provides these features as a starting point.

Requirements

pywb has tested in python 2.6, 2.7 and pypy.

It runs best in python 2.7 currently.

pywb tool suite provides several WSGI applications, which have been tested under wsgiref and uWSGI.

For best results, the uWSGI container is recommended.

Support for Python 3 is planned.

Sample Data

pywb comes with a a set of sample archived content, also used by the test suite.

The data can be found in sample_archive and contains warc and cdx files.

The sample archive contains recent captures from http://example.com and http://iana.org

Runnable Apps

The pywb tool suite currently includes two runnable applications, installed as command-line scripts via setuptools

  • wayback or python -m pywb.apps.wayback – start the full wayback on port 8080

  • cdx-server or python -m pywb.apps.cdx_server – start standalone cdx server on port 8090

Step-By-Step Installation

To start a pywb with sample data:

  1. Clone this repo

  2. Install with python setup.py install

  3. Run wayback (shorthand for python -m pywb.apps.wayback) to start the pywb wayback server with reference WSGI implementation.

OR run run-uwsgi.sh to start with uWSGI (see below for more info).

  1. Test pywb in your browser! (pywb is set to run on port 8080 by default).

If everything worked, the following pages should be loading (served from sample_archive dir):

Original Url

Latest Capture

List of All Captures

http://example.com

http://localhost:8080/pywb/example.com

http://localhost:8080/pywb/*/example.com

http://iana.org

http://localhost:8080/pywb/iana.org

http://localhost:8080/pywb/*/iana.org

uWSGI startup script

A sample uWSGI start up script, run-uwsgi.sh which assumes a default uWSGI installation is provided as well.

Currently, uWSGI is not installed automatically with this distribution, but it is recommended for production environments.

Please see uWSGI Installation for more details on installing uWSGI.

Vagrant

pywb comes with a Vagrantfile to help you set up a VM quickly for testing and deploy pywb with uWSGI.

If you have Vagrant and VirtualBox installed, then you can start a test instance of pywb like so:

git clone https://github.com/ikreymer/pywb.git
cd pywb
vagrant up

After pywb and all its dependencies are installed, the uWSGI server will startup

spawned uWSGI worker 1 (and the only) (pid: 123, cores: 1)

At this point, you can open a web browser and navigate to the examples above for testing.

Test Suite

Currently pywb includes a full (and growing) suite of unit doctest and integration tests.

Top level integration tests can be found in the tests/ directory, and each subpackage also contains doctests and unit tests.

The full set of tests can be run by executing:

python setup.py test

which will run the tests using py.test.

The py.test coverage plugin is used to keep track of test coverage.

Sample Setup

pywb is configurable via yaml.

The simplest config.yaml is roughly as follows:

collections:
   pywb: ./sample_archive/cdx/


archive_paths: ./sample_archive/warcs/

This sets up pywb with a single route for collection /pywb

(The the latest version of config.yaml contains additional documentation and specifies all the optional properties, such as ui filenames for Jinja2/html template files.)

For more advanced use, the pywb init path can be customized further:

  • The PYWB_CONFIG_FILE env can be used to set a different yaml file.

  • Custom init app (with or without yaml) can be created. See wayback.py and pywb_init.py for examples of existing initialization paths.

Configuring PyWb With Archived Data

Please see the PyWb Configuration for latest instructions on how to setup pywb to run with your existing WARC/ARC collections.

Additional Documentation

  • For additional/up-to-date configuration details, consult the current config.yaml

  • The wiki will have additional technical documentation about various aspects of pywb

Contributions

You are encouraged to fork and contribute to this project to improve web archiving replay

Please take a look at list of current issues and feel free to open new ones

Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywb-0.2.0.tar.gz (65.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page