Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.

These details have been verified by PyPI

Maintainers

heinrichreimer hscells lgienapp mam10eks

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

📜 The Archive Query Log

Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.

Start now by running your custom analysis/experiment, scraping your own query log, or just look at our example files.

Integrations
Crawling
Development
Contribute
Abstract

Integrations

Running Experiments on the AQL

The data in the Archive Query Log is highly sensitive (still, you can re-crawl everything from the Wayback Machine). For that reason, we ensure that custom experiments or analyses can not leak sensitive data (please get in touch if you have questions) by using TIRA as a platform for custom analyses/experiments. In TIRA, you submit a Docker image that implements your experiment. Your software is then executed in sandboxed mode (without internet connection) to ensure that your software does not leak sensitive information. After your software execution finished, administrators will review your submission and unblind it so that you can access the outputs.
Please refer to our dedicated TIRA tutorial as starting point for your experiments.

Crawling

For running the CLI and crawl a query log on your own machine, please refer to the instructions for single-machine deployments. If instead you want to scale up and run the crawling pipelines on a cluster, please refer to the instructions for cluster deployments.

Single-Machine (PyPi/Docker)

To run the Archive Query Log CLI on your machine, you can either use our PyPi package or the Docker image. (If you absolutely need to, you can also install the Python CLI or the Docker image from source.)

Installation (PyPi)

First you need to install Python 3.10 and pipx (this allows you to install the AQL CLI in a virtual environment). Then, you can install the Archive Query Log CLI by running:

pipx install archive-query-log

Now you can run the Archive Query Log CLI by running:

aql --help

Installation (Python from source)

First install Python 3.10, and clone this repository. From inside the repository directory, create a virtual environment and activate it:

python3.10 -m venv venv/
source venv/bin/activate

Now you can install the Archive Query Log CLI by running:

pip install -e .

Note: The commands below use the syntax of the PyPi installation. To run the same commands with the local Python installation, replace aql with python -m archive_query_log, for example:

python -m archive_query_log --help

Installation (Docker)

You only need to install Docker.

Note: The commands below use the syntax of the PyPi installation. To run the same commands with the Docker installation, replace aql with docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml ghcr.io/webis-de/archive-query-log, for example:

docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml ghcr.io/webis-de/archive-query-log --help

Installation (Docker from source)

First install Docker, and clone this repository. From inside the repository directory, build the Docker image like this:

docker build -t aql .

docker run -it -v "$(pwd)"/config.override.yml:/workspace/config.override.yml aql --help

Configuration

Crawling the Archive Query Log requires access to an Elasticsearch cluster. To configure access to the Elasticsearch cluster, add a config.override.yml file in the current directory, with the following contents. Replace the placeholders with your actual credentials:

es:
  host: "<HOST>"
  port: 9200
  username: "<USERNAME>"
  password: "<PASSWORD>"

Add an archive service

aql archives add

Add a search provider

aql providers add

Build source pairs

aql sources build

Fetch captures

aql captures fetch

Cluster (Helm/Kubernetes)

Running the Archive Query Log on a cluster is recommended for large-scale crawls. We provide a Helm chart that automatically starts crawling and parsing jobs for you and stores the results in an Elasticsearch cluster.

Installation

Just install Helm and configure kubectl for your cluster.

Configuration

Crawling the Archive Query Log requires access to an Elasticsearch cluster. Configure the Elasticsearch credentials in a values.override.yaml file like this:

elasticsearch:
  host: "<HOST>"
  port: 9200
  username: "<USERNAME>"
  password: "<PASSWORD>"

Deployment

Let's deploy the Helm chart on the cluster (we're testing first with --dry-run to see if everything works):

helm upgrade --install --values helm/archive-query-log/values.override.yaml --dry-run archive-query-log helm/archive-query-log

If everything worked and the output looks good, you can remove the --dry-run flag to actually deploy the chart.

Uninstall

If you no longer need the chart, you can uninstall it:

helm uninstall archive-query-log

Citation

If you use the Archive Query Log dataset or the crawling code in your research, please cite the following paper describing the AQL and its use-cases:

Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives. In Hsin-Hsi Chen et al., editors, 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 2848–2860, July 2023. ACM.

You can use the following BibTeX entry for citation:

@InProceedings{reimer:2023,
    author = {{Jan Heinrich} Reimer and Sebastian Schmidt and Maik Fr{\"o}be and Lukas Gienapp and Harrisen Scells and Benno Stein and Matthias Hagen and Martin Potthast},
    booktitle = {46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)},
    doi = {10.1145/3539618.3591890},
    editor = {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete},
    ids = {potthast:2023u},
    isbn = {9781450394086},
    month = jul,
    numpages = 13,
    pages = {2848--2860},
    publisher = {ACM},
    site = {Taipei, Taiwan},
    title = {{The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives}},
    url = {https://dl.acm.org/doi/10.1145/3539618.3591890},
    year = 2023
}

Development

Refer to the local Python installation instructions to set up the development environment and install the dependencies.

After having implemented a new feature, you should the check code format, inspect common LINT errors, and run all unit tests with the following commands:

flake8 archive_query_log  # Code format
pylint archive_query_log  # LINT errors
mypy archive_query_log  # Static typing
bandit -c pyproject.toml -r archive_query_log  # Security
pytest archive_query_log  # Unit tests

Add new tests for parsers

At the moment, our workflow for adding new tests for parsers goes like this:

Select the number of tests to run per service and the number of services.
Auto-generate unit tests and download WARCs with generate_tests.py
Run the tests.
Failing tests will open a diff editor with the approval and a web browser tab with the Wayback URL.
Use the web browser dev tools to find the query input field and search result CSS paths.
Close diffs and tabs and re-run tests.

Contribute

If you've found an important search provider to be missing from this query log, please suggest it by creating an issue. We also very gratefully accept pull requests for adding search providers or new parser configurations!

If you're unsure about anything, post an issue, or contact us:

We're happy to help!

License

This repository is released under the MIT license. Files in the data/ directory are exempt from this license. If you use the AQL in your research, we'd be glad if you'd cite us.

Abstract

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

Project details

These details have been verified by PyPI

Maintainers

heinrichreimer hscells lgienapp mam10eks

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

1.2.26

Nov 21, 2023

0.1.33

Nov 30, 2023

0.1.32

Nov 27, 2023

0.1.31

Nov 27, 2023

0.1.30

Nov 27, 2023

0.1.29

Nov 24, 2023

0.1.28

Nov 24, 2023

0.1.26

Nov 21, 2023

0.1.25

Nov 21, 2023

0.1.24

Nov 20, 2023

0.1.23

Nov 20, 2023

0.1.22

Nov 20, 2023

0.1.20

Nov 20, 2023

0.1.19

Nov 19, 2023

0.1.18

Nov 19, 2023

0.1.17

Nov 16, 2023

0.1.16

Nov 16, 2023

0.1.15

Nov 16, 2023

0.1.14

Nov 15, 2023

0.1.12

Nov 15, 2023

0.1.10

Nov 14, 2023

0.1.9

Nov 14, 2023

0.1.8

Nov 14, 2023

This version

0.1.7

Nov 13, 2023

0.1.6

Nov 13, 2023

0.1.5

Nov 3, 2023

0.1.3

Aug 28, 2023

0.1.2

Aug 28, 2023

0.1.1

Aug 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive-query-log-0.1.7.tar.gz (37.0 MB view hashes)

Uploaded Nov 13, 2023 Source

Built Distribution

archive_query_log-0.1.7-py3-none-any.whl (160.8 kB view hashes)

Uploaded Nov 13, 2023 Python 3

Hashes for archive-query-log-0.1.7.tar.gz

Hashes for archive-query-log-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`156fed6d3801b21b5da6f483372cc4dd5fb8c6cc90277b5caf12916edfc0e4d7`
MD5	`cd156962c435ebbadcaf1659c7d3b9e1`
BLAKE2b-256	`4d70757cac2700dd435f25d9dedd03e4f1b8c858078bba570b3e9d60f27438b3`

Hashes for archive_query_log-0.1.7-py3-none-any.whl

Hashes for archive_query_log-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37450b62c228e9a4701b60fb4242af383f62b8e962fe16a7232ca3758f4ca269`
MD5	`7523b56b3a38699a852866e87e334ff8`
BLAKE2b-256	`e8d232af042baee8043a2393f5677e2333c8bd1fcf14914c6fb8ff0d0992b7f6`

archive-query-log 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

📜 The Archive Query Log

Contents

Integrations

Running Experiments on the AQL

Crawling

Single-Machine (PyPi/Docker)

Installation (PyPi)

Installation (Python from source)

Installation (Docker)

Installation (Docker from source)

Configuration

Add an archive service

Add a search provider

Build source pairs

Fetch captures

Cluster (Helm/Kubernetes)

Installation

Configuration

Deployment

Uninstall

Citation

Development

Add new tests for parsers

Contribute

License

Abstract

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution