Skip to main content

Summarize web archive capture index (CDX) files

Project description

CDX Summary

Summarize web archive capture index (CDX) files.

Installation

$ pip install cdxsummary

Alternatively, install from the source.

$ python3 setup.py install

To run the tool as a one-off Docker container, build the image as following, which will place the cdxsummary executable as the entrypoint script of the container.

$ docker image build -t cdxsummary .
$ docker container run -it --rm cdxsummary

Usage

$ cdxsummary --help
usage: cdxsummary [-h] [-a [HOST:PORT]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]

Summarize web archive capture index (CDX) files.

positional arguments:
  input                 CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')

optional arguments:
  -h, --help            show this help message and exit
  -a [HOST:PORT], --api [HOST:PORT]
                        Run a CDX summarizer API server on the given host and port (default: 0.0.0.0:5000)
  -i, --item            Treat the input argument as a Petabox item identifier instead of a file path
  -j, --json            Generate summary in JSON format
  -l, --load            Load JSON report instead of CDX
  -o [FILE], --out [FILE]
                        Write output to the given file (default: STDOUT)
  -r, --report          Generate non-summarized JSON report
  -s [N], --samples [N]
                        Number of sample memento URLs in summary (default: 10)
  -t [N], --tophosts [N]
                        Number of hosts with maximum captures in summary (default: 10)
  -v, --version         Show version number

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdxsummary-0.1.1b1.tar.gz (18.7 kB view hashes)

Uploaded Source

Built Distribution

cdxsummary-0.1.1b1-py3-none-any.whl (20.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page