Convert chemical molecule data CSV files to structured data formats
Project description
Molstruct is a lightweight Python CLI tool that converts chemical molecule data Comma Separated Values (CSV) files to structured data formats - JSON-LD, RDFa and Microdata. Molstruct has a lot of customization options that you can but don't have to use. Python 3.2+ are supported and no dependencies are required. Sounds good so far? What would you say to a really tiny Molstruct Docker container? Just try Molstruct!
What are structured data
Structured data are additional data placed on websites. They are not visible to ordinary internet users, but can be easily processed by machines. There are 3 formats that we can use to save structured data - JSON-LD, RDFa and Microdata. Molstruct supports them all and use MolecularEntitly type.
Where to find a CSV file with molecule data
There are many possibilities. The easiest way is to download a CSV file from one of the chemical databases, e.g. DrugBank. You can also create the CSV file yourself.
Installation
You can install the Molstruct from PyPI:
pip install molstruct
Python 3.2 and above are supported. No additional dependencies are required. To use Molstruct just type the molstruct
command in terminal.
Docker image
If you have Docker installed, you can use tiny Molstruct image from Docker Hub.
Because the tool is closed inside the container, you have to mount local directory with your input file. The default working directory of the image is /app
. You need to mount your local directory inside it (e.g. /app/input
):
docker run -it --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest
In this case, the local directory /home/user/input
has been mounted under /app/input
.
You can also simply mount current working directory using $(pwd)
sub-command:
docker run -it --rm --name molstruct-app --mount type=bind,source="$(pwd)",target=/app/input,readonly lszeremeta/molstruct:latest
Other options
You may want to run Molstruct from sources or build a Docker image yourself. In most cases, one of the methods mentioned in the sections above should be sufficient and convenient for you.
Run Molstruct from sources
- Clone this repository:
git clone https://github.com/lszeremeta/molstruct.git
If you don't want or can't use git, you can download the zip archive and extract it.
- Go to the project directory and run Molstruct:
cd molstruct
python -m molstruct
Local Docker build
You need Docker installed.
- Clone this repository:
git clone https://github.com/lszeremeta/molstruct.git
If you don't want or can't use git, you can download the zip archive and extract it.
- Go to the project directory and build Docker image:
cd molstruct
docker build -t molstruct .
- Run Docker container:
docker run -it --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly molstruct
In this case, your local directory /home/user/input
has been mounted under /app/input
.
Usage
usage: molstruct [-h] [--version] -f {jsonld_html,jsonld,rdfa,microdata} [-i IDENTIFIER]
[-n NAME] [-ink INCHIKEY] [-in INCHI] [-s SMILES] [-u URL]
[-iu IUPACNAME] [-mf MOLECULARFORMULA] [-w MOLECULARWEIGHT]
[-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]
[-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-at ADDITIONALTYPE]
[-an ALTERNATENAME] [-sa SAMEAS] [-c] [-b BASEURI] [-l LIMIT]
file
Supported MolecularEntitly properties that corresponds to default CSV column names: identifier
, name
, inChIKey
, inChI
, smiles
, url
, iupacName
, molecularFormula
, molecularWeight
, monoisotopicMolecularWeight
, description
, disambiguatingDescription
, image
, additionalType
, alternateName
and sameAs
. You can rename the columns if needed (see Column name change arguments below).
Informative arguments
-h
,--help
show help message and exit--version
show program version and exit
Required arguments
-f {jsonld_html,jsonld,rdfa,microdata}
,--format {jsonld_html,jsonld,rdfa,microdata}
output formatfile
CSV file path with molecule data to convert
Remember about the appropriate file path when using Docker image. Suppose you mounted your local directory /home/user/input
under /app/input
and the path to the CSV file you want to use in molstruct is /home/user/input/file.csv
. In this case, enter the path /app/input/file.csv
or input/file.csv
as file
argument value.
Column name change arguments
Arguments for changing the default column names
-i IDENTIFIER
,--identifier IDENTIFIER
identifier column name (identifier by default), Text-n NAME
,--name NAME
name column name (name by default), Text-ink INCHIKEY
,--inChIKey INCHIKEY
inChIKey column name (inChIKey by default), Text-in INCHI
,--inChI INCHI
inChI column name (inChI by default), Text-s SMILES
,--smiles SMILES
smiles column name (smiles by default), Text-u URL
,--url URL
url column name (url by default), URL type-iu IUPACNAME
,--iupacName IUPACNAME
iupacName column name (iupacName by default), Text-mf MOLECULARFORMULA
,--molecularFormula MOLECULARFORMULA
molecularFormula column name (molecularFormula by default), Text-w MOLECULARWEIGHT
,--molecularWeight MOLECULARWEIGHT
molecularWeight column name (molecularWeight by default), Mass e.g. 0.01 mg)-mw MONOISOTOPICMOLECULARWEIGHT
,--monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHT
monoisotopicMolecularWeight column name (monoisotopicMolecularWeight by default), Mass e.g. 0.01 mg-d DESCRIPTION
,--description DESCRIPTION
description column name (description by default), Text-dd DISAMBIGUATINGDESCRIPTION
,--disambiguatingDescription DISAMBIGUATINGDESCRIPTION
disambiguatingDescription column name (disambiguatingDescription by default), Text-img IMAGE
,--image IMAGE
image column name (image by default), URL-at ADDITIONALTYPE
,--additionalType ADDITIONALTYPE
additionalType column name (additionalType by default), URL-an ALTERNATENAME
,--alternateName ALTERNATENAME
alternateName column name (alternateName by default), Text-sa SAMEAS
,--sameAs SAMEAS
sameAs column name (sameAs by default), URL
Additional settings arguments
-c, --columns
use only columns with renamed names-b BASEURI
,--baseURI BASEURI
base URI of molecule (http://example.com/molecule/ by default)-l LIMIT
,--limit LIMIT
maximum number of results
Available options may vary depending on the version. To display all available options with their descriptions use molstruct -h
.
Examples
molstruct -f rdfa data.csv
Returns simple HTML with added RDFa. Assumes that the column names in CSV file are the default ones.
molstruct -f microdata -mf "formula" data.csv
Returns simple HTML with added Microdata. Assumes that the column names in CSV file are the default ones but replaces default molecularformula
column name by formula
.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv"
Returns simple HTML with added Microdata. When generating a file, only selected columns will be taken into account. A limit of 50 molecules has been specified.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv" > output.html
Do the same as example above but save results to output.html
.
docker run -it --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "input/drugbank vocabulary.csv" > output.html
Do the same as example above (run from pre-build Docker image).
Returns simple HTML with added Microdata and redirect output to molecules.html
file. Run from pre-build Docker image.
Contribution
Would you like to improve this project? Great! We are waiting for your help and suggestions. If you are new in open source contributions, read How to Contribute to Open Source.
License
Distributed under MIT license.
See also
These projects can also be useful:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for molstruct-2.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62f25aa1af47859a5f58487ea3d3663ab1b4b790d479e23912388e777c9b601d |
|
MD5 | a3462d4116bd6f3e78584234340eb719 |
|
BLAKE2b-256 | 349b28b848c1bc1af07b9113152c9c4247c66b794f8c990fb242a547f69bdfad |