Skip to main content

Easily convert common crawl to image caption set using pyspark

Project description

cc2imgcap

pypi Open In Colab Try it on gitpod

Easily convert common crawl to image caption set using pyspark.

Common crawl has 7.5M warc files. They provide links of the web. This simple tool allows you to process one warc in about 20s and get image link along with the alt text.

It also runs deduplication against url+text in order to save on output space and speed up the process.

This makes it possible to do the first step of building a dataset like laion5B in 30k cpu core hours. (5*10^6*20/(3600)) That's $1.2k using aws EC2 (0.04$/core hour)

cpu128-dy-c6i-32xlarge instances are advised.

Install

pip install cc2imgcap

Python examples

Checkout these examples:

If you have a slurm cluster, refer to https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 to start a spark cluster there.

API

This module exposes a single function cc2imgcap which takes the same arguments as the command line tool:

  • output_path the output path, should probably start with s3://. (required)
  • wat_index_count the number of wat index files to read, can be None for all. (default 1)
  • wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)
  • master the spark master url. (default local)
  • num_cores the number of cores of each spark executor. (default 128)
  • mem_gb the memory of each spark executor. (default 256)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cc2imgcap-1.1.0.tar.gz (6.4 kB view hashes)

Uploaded Source

Built Distribution

cc2imgcap-1.1.0-py3-none-any.whl (8.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page