blip-ci

BLIP library for use with CLIP Interrogator

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Announcement: BLIP is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!

This is the PyTorch code of the BLIP paper [blog]. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

Inference demo
Pre-trained and finetuned checkpoints
Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
Pre-training code
Zero-shot video-text retrieval
Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for:

Image captioning
Open-ended visual question answering
Multimodal / unimodal feature extraction
Image-text matching

Try out the Web demo, integrated into Huggingface Spaces 🤗 using Gradio.

Replicate web demo and Docker image is also available at

Pre-trained checkpoints:

Num. pre-train images	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
14M	Download	-	-
129M	Download	Download	Download

Finetuned checkpoints:

Task	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
Image-Text Retrieval (COCO)	Download	-	Download
Image-Text Retrieval (Flickr30k)	Download	-	Download
Image Captioning (COCO)	-	Download	Download
VQA	Download	Download	-
NLVR2	Download	-	-

Image-Text Retrieval:

Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

Image-Text Captioning:

Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate

To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_vqa.py

NLVR2:

Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
To evaluate the finetuned BLIP model, run

python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py

Finetune with ViT-L:

In order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). Gradient checkpoint can also be activated in the config file to reduce GPU memory usage.

Pre-train:

Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

Zero-shot video-text retrieval:

Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.
Install decord with
```
pip install decord
```
To perform zero-shot evaluation, run

python -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source	Filtered web caption	Filtered synthetic caption by ViT-B	Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU	Download	Download	Download
LAION115M	Download	Download	Download

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.5

Jun 11, 2023

0.0.4

Jun 10, 2023

0.0.3

Feb 23, 2023

0.0.2

Feb 18, 2023

0.0.1

Feb 11, 2023

0.0.0

Feb 11, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blip-ci-0.0.5.tar.gz (43.1 kB view hashes)

Uploaded Jun 11, 2023 Source

Built Distribution

blip_ci-0.0.5-py3-none-any.whl (55.5 kB view hashes)

Uploaded Jun 11, 2023 Python 3

Hashes for blip-ci-0.0.5.tar.gz

Hashes for blip-ci-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`32d745e0c5364659b277dfe47ffaaa8c5865e121bde737cd237ce3e68075dbd6`
MD5	`bb2801d9d4d0859b1536780d19b03de3`
BLAKE2b-256	`5ba00bf1a7890f186f5ccafb8319ba5cb616ebe4f87255991cd7643a1fa113f1`

Hashes for blip_ci-0.0.5-py3-none-any.whl

Hashes for blip_ci-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`04445ffcb71576c90d16a63931ab3d0a62ee5b1d7590dbc6a0343aef48e0662f`
MD5	`43bdf86f02654a65ba566f3e9ff38c33`
BLAKE2b-256	`89b1da5ce2f6b341ae091e84c43ea700a6b94375a29f04b9dd968802135021d3`