A webscraper to automate retrieving specific data from CAZy andbuild a local CAZyme SQL database, for throughly interrogating the data. Also, automate retrieving protein sequences, EC numbers and structure files for specific datasets in the CAZyme database.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
- Python :: 3.8
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

cazy_webscraper

cazy_webscraper version 1 is depracted. Please ensure you are using cazy_webscraper version 2 or newer.

Beta Version 2

This release of cazy_webscraper is a beta release. The main component of cazy_webscraper (for the downloading of data from CAZy) is stable. Expanding the data set to incorproate data from UniProt, GenBank and PDB still require some work to workout all the remaining bugs.

The documentation is being actively updated to match the new cazy_webscraper version 2.

The bioconda installation method is not currently support, but we aim to get this fixed soon. For now please install via pypi or from source.

New features in version 2:

Faster scraping: The entirtity of CAZy can be scraped in 15 minutes
Retrieval of UniProt data: UniProt accessions, EC numbers, protein sequences, and PDB accessions can be retrieved from UniProt and added to the local CAZyme database
Addition of an API: As well as retrieving data from the local CAZyme database via an SQL interface, cazy_webscraper can retrieved user-specfied data (e.g. the GenBank protein accession and EC number annotations) for proteins matching user-specified critieria. The extracted data can be written to a JSON and/or CSV file, to facilitate inclusion in downstream analyses.
Caching: Data downloaded from CAZy is not only parsed and written to a local CAZyme database. The raw data files are written to cache. Data can be scraped directly from a cache (ideal if CAZy updates during the retrieval of multiple datasets from the CAZy database).

Future work for version 2:

Fix any remaining bugs we can find (if you find a bug, please report it!)
Update the unit tests to work with the new cazy_webscraper architecture
Update the documentation
Create video tutorials

cazy_webscraper

cazy_webscraper is an application and Python3 package for the automated retrieval of protein data from the CAZy database. The code is distributed under the MIT license.

cazy_webscraper retrieves protein data from the CAZy database into a local SQLite3 database. This enables users to integrate the dataset into analytical pipelines, and interrogate the data in a manner unachievable through the CAZy website.

Using the expand subcommand, a user can retrieve:

CAZyme protein sequence data from GenBank
Protein structure files from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB)
EC number and Uniprot protein IDs from the UniProtKB database

cazy_webscraper can recover specified CAZy Classes and/or CAZy families. These queries can be filtered by taxonomy at Kingdoms, genus, species or strain level. Successive CAZy queries can be collated into a single local database. A log of each query is recorded in the database for transparency, reproducibility and shareablity.

Citation

If you use cazy_webscraper, please cite the following publication:

Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021): cazy_webscraper Microbiology Society Annual Conference 2021 poster. FigShare. Poster. https://doi.org/10.6084/m9.figshare.14370860.v7

cazy_webscraper
Citation
Best practice
Documentation
- Installation
- Quick start
Creating a local CAZyme database
- Combining configuration filters
- Default CAZy class synonyms
Retrieve data from UniProt
- Configuring UniProt data retrieval
Retrieving protein sequences from GenBank
- Configuring GenBank protein sequence data retrieval
Extracting protein sequences from the local CAZyme database and building a BLAST database
- Configuring extracting sequences from a local CAZyme db
Retrieving protein structure files from PDB
- Configuring PDB protein structure file retrieval
The cazy_webscraper API or Interrogating the local CAZyme database
Configuring cazy_webscraper using a YAML file
CAZy coverage of GenBank
- Configure calculating CAZy coverage of GenBank
Contributions
License and copyright

Best practice

When performing a series of many automated calls to a server it is best to do this when traffic is lowest, such as at weekends or overnight at the server.

Documentation

Please see the full documentation at ReadTheDocs.

Installation

cazy_webscraper can be installed via conda or pip:

conda install -c bioconda cazy_webscraper

Please see the conda documentation and bioconda documentation for further details.

pip install cazy_webscraper

Please see the pip documentation for further details.

Quickstart

We have produced a "Getting Started With cazy_webscraper" poster.

To download all of CAZy and save the database in the default location (the cwd) with the default name (cazy_webscraper_<date>_<time>.db) use the following command:

cazy_webscraper <user_email>

Creating a local CAZyme database

Command line options for cazy_webscraper, which is used to scrape CAZy and compile a local SQLite database. Options are written in alphabetical order.

email - [REQUIRED] User email address. This is required by NCBI Entrez for querying the Entrez server.

--cache_dir - Path to cache dir to be used instead of default cache dir path.

--cazy_data - Path to a txt file downloaded from CAZy containing a CAZy db data dump.

--cazy_synonyms - Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.

--classes - list of classes from which all families are to be scrape.

--config, -c - Path to a configuration YAML file. Default: None.

--citation, -C - Print the cazy_webscraper citation. When called, the program terminates after printng the citation and CAZy is not scraped.

--database, -d - Path to an existings local CAZyme database to add newly scraped too. Default: None.

Do not use --db_output and --database at the same time.

If --db_output and --database are not called, cazy_webscraper write out a local CAZyme database to the cwd with the standardised name cazy_webscraper_<date>_<time>.db

--delete_old_relationships - Detele old CAZy family annotations of GenBank accessions. These are CAZy family annotations of a given GenBank accession are in the local database but the accession is not longer associated with those CAZy families, so delete old accession-family relationships.

--families - List of CAZy (sub)families to scrape.

--force, -f - force overwriting existing output file. Default: False.

--genera - List of genera to restrict the scrape to. Default: None, filter not applied to scrape.

--log, -l - Target path to write out a log file. If not called, no log file is written. Default: None (no log file is written out).

--nodelete_cache - When called, content in the existing cache dir will not be deleted. Default: False (existing content is deleted).

--nodelete_log - When called, content in the existing log dir will not be deleted. Default: False (existing content is deleted).

--retries, -r - Define the number of times to retry making a connection to CAZy if the connection should fail. Default: 10.

--sql_echo - Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.

--subfamilies, -s - Enable retrival of CAZy subfamilies, otherwise only CAZy family annotations will be retrieved. Default: False.

--species - List of species written as Genus Species) to restrict the scraping of CAZymes to. CAZymes will be retrieved for all strains of each given species.

--strains - List of specific species strains to restrict the scraping of CAZymes to.

--timeout, -t - Connection timout limit (seconds). Default: 45.

--validate, - Retrieve CAZy family population sizes from the CAZy website and check against the number of family members added to the local CAZyme database, as a method for validating the complete retrieval of CAZy data.

--verbose, -v - Enable verbose logging. This does not set the SQLite engine echo parameter to True. Default: False.

--version, -V - Print cazy_webscraper version number. When called and the version number is printed, cazy_webscraper is immediately terminated.

Combining configuration filters

cazy_webscraper applies filters in a successive and layered structure.

CAZy class and family filters are applied first.

Kingdom filters are applied second.

Lastly, taxonomy (genus, species and strain) filters are applied.

Default CAZy class synonyms

CAZy classes are accepted in the written long form (such as Glycoside Hydrolases) and in their abbreviated form (e.g. GH).

Both the plural and singular abbreviated form of a CAZy class name is accepted, e.g. GH and GHs.

Spaces, hythens, underscores and no space or extract character can be used in the CAZy class names. Therefore, Glycoside Hydrolases, Glycoside-Hydrolases, Glycoside_Hydrolases and GlycosideHydrolases are all accepted.

Class names can be written in all upper case, all lower case, or mixed case, such as GLYCOSIDE-HYDROLASES, glycoside hydrolases and Glycoside Hydrolases. All lower or all upper case CAZy class name abbreviations (such as GH and gh) are accepted.

Retrieve data from UniProt

[UniProtKB] is one of the largest protein database, incorporating data from the [PDB] structure database and other protein annotation databases.

cazy_webscraper can retrieve protein data from UniProt for proteins catalogued in a local CAZyme database created using cazy_webscraper. Specifically, for each protein, cazy_webscraper can retrieve:

The UniProt accession
PDB accessions of associated structure files from the PDB database
EC number annnotations
Protein sequence from the UniProt

cazy_webscraper always retrieves the UniProt accession, but the retrieval of PDB accession, EC numbers and protein sequences is optional.

Data can be retrieived for all proteins in the local CAZyme database, or a specific subset. CAZy class, CAZy family, genus, species, strains, kingdom and EC number filters can be defined in order to define a dataset to retrieve protein data from UniProt for.

To retrieve all UniProt data for all proteins in a local CAZyme datbase, using the following command:

cw_get_uniprot_data <path_to_local_CAZyme_db> --ec --pdb --seq

Configuring UniProt data retrieval

Below are listed the command-line flags for configuring the retrieval of UniProt data.

database - [REQUIRED] Path to a local CAZyme database to add UniProt data to.

--bioservices_batch_size - Change the query batch size submitted via bioservices to UniProt to retrieve protein data. Default is 150. bioservices recommands queries not larger than 200 objects.

--cache_dir - Path to cache dir to be used instead of default cache dir path.

--cazy_synonyms - Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.

--classes - List of classes to retrieve UniProt data for.

--config, -c - Path to a configuration YAML file. Default: None.

--ec, -e - Enable retrieval of EC number annotations from UniProt

--ec_filter - Limist retrieval of protein data to proteins annotated with a provided list of EC numbers. Separate the EC numbers bu single commas without spaces. Recommend to wrap the entire str in quotation marks, for example:

cw_get_uniprot_data my_cazyme_db/cazyme_db.db --ec_filter 'EC1.2.3.4,EC2.3.1.-'

--families - List of CAZy (sub)families to scrape.

--genera - List of genera to restrict the scrape to. Default: None, filter not applied to scrape.

--log, -l - Target path to write out a log file. If not called, no log file is written. Default: None (no log file is written out).

--nodelete_cache - When called, content in the existing cache dir will not be deleted. Default: False (existing content is deleted).

--nodelete_log - When called, content in the existing log dir will not be deleted. Default: False (existing content is deleted).

--retries, -r - Define the number of times to retry making a connection to CAZy if the connection should fail. Default: 10.

--sequence, -s - Retrieve protein amino acid sequences from UniProt

--sql_echo - Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.

--species - List of species written as Genus Species) to restrict the scraping of CAZymes to. CAZymes will be retrieved for all strains of each given species.

--strains - List of specific species strains to restrict the scraping of CAZymes to.

--timeout, -t - Connection timout limit (seconds). Default: 45.

--uniprot_batch_size - Size of an individual batch query submitted to the UniProt REST API to retrieve the UniProt accessions of proteins identified by the GenBank accession. Default is 150. The UniProt API documentation recommands batch sizes of less than 20,000 but batch sizes of 1,000 often result in HTTP 400 errors. It is recommend to keep batch sizes less than 1,000, and ideally less than 200.

--seq_update - If a newer version of the protein sequence is available, overwrite the existing sequence for the protein in the database. Default is false, the protein sequence is not overwritten and updated.

--verbose, -v - Enable verbose logging. This does not set the SQLite engine echo parameter to True. Default: False.

Retrieveing protein seqences from GenBank

Protein amino acid sequences can be retrieved for proteins in a local CAZyme database using cazy_webscraper. Protein sequences can be retrieved for a specific subset of proteins, identified through the use of CAZy class, CAZy family, taxonomy (kingdom, genus, species and strain) filters, and EC number filters. The retrieved protein sequences are written to the local CAZyme database.

Extracting protein sequences from the local CAZyme database and writing them to a BLAST database and/or FASTA file(s) is covered in the next section.

To retrieve all GenBank protein seuqneces for all proteins in a local CAZyme datbase, using the following command:

cw_get_genbank_seqs <path_to_local_CAZyme_db>

cazy_webscraper produces to cache files, which are written to the cache dir:

no_seq_retrieved.txt which lists the GenBank accessions for which no sequence could be retrieved from GenBank
seq_retrieved.txt which list GenBank accessiosn for which a sequence was retrieved from GenBank

Configuring GenBank protein sequence retrieval

Below are listed the command-line flags for configuring the retrieval of protein sequences from GenBank.

database - [REQUIRED] Path to a local CAZyme database to add UniProt data to.

email - [REQUIRED] User email address, required by NCBI Entrez.

--cache_dir - Path to cache dir to be used instead of default cache dir path.

--cazy_synonyms - Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.

--config, -c - Path to a configuration YAML file. Default: None.

--classes - List of classes from which all families are to be scrape.

cw_get_uniprot_data my_cazyme_db/cazyme_db.db --ec_filter 'EC1.2.3.4,EC2.3.1.-'

--entrez_batch_size - Change the query batch size submitted via Entrez to retrieve protein sequences from GenBank data. Default is 150. Entrez recommands queries not larger than XXX objects in length.

--families - List of CAZy (sub)families to scrape.#

--kingdoms - List of taxonomy kingdoms to retrieve UniProt data for.