ScienceBeam Parser, parse scientific documents.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ScienceBeam Parser

ScienceBeam Parser allows you to parse scientific documents. Initially is starting as a partial Python variation of GROBID and allows you to re-use some of the models. However, it may deviate more in the future.

Pre-requisites

Docker containers are provided that can be used on multiple operating systems. It can be used as an example setup for Linux / Ubuntu based systems.

Otherwise the following paragraphs list some of the pre-requisits when not using Docker:

This currently only supports Linux due to the binaries used (pdfalto, wapiti). It may also be used on other platforms without Docker, provided matching binaries are configured.

For Computer Vision PyTorch is required.

For OCR, tesseract needs to be installed. On Ubuntu the following command can be used:

apt-get install libtesseract4 tesseract-ocr-eng libtesseract-dev libleptonica-dev

The Word* to PDF conversion requires LibreOffice.

Development

Create Virtual Environment and install Dependencies

make dev-venv

Configuration

There is no implicit "grobid-home" directory. The only configuration file is the default config.yml.

Paths may point to local or remote files. Remote files are downloaded and cached locally (urls are assumed to be versioned).

You may override config values using environment variables. Environment variables should start with SCIENCEBEAM_PARSER__. After that __ is used as a section separator. For example SCIENCEBEAM_PARSER__LOGGING__HANDLERS__LOG_FILE__LEVEL would override logging.handlers.log_file.level.

Generally, resources and models are loaded on demand, depending on the preload_on_startup configuration option (SCIENCEBEAM_PARSER__PRELOAD_ON_STARTUP environment variable). Models will be loaded "eagerly" at startup, by setting the configuration option to true.

Run tests (linting, pytest, etc.)

make dev-test

Start the server

make dev-start

Run the server in debug mode (including auto-reload and debug logging):

make dev-debug

Run the server with auto reload but no debug logging:

make dev-start-no-debug-logging-auto-reload

Submit a sample document to the server

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/pdfalto"

Submit a sample document to the header model

The following output formats are supported:

output_format	description
raw_data	generated data (without using the model)
data	generated data with predicted labels
xml	using simple xml elements for predicted labels
json	json of prediction

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/models/header?first_page=1&last_page=1&output_format=xml"

Submit a sample document to the name-header api

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/models/name-header?first_page=1&last_page=1&output_format=xml"

GROBID compatible APIs

The following APIs are aiming to be compatible with selected endpoints of the GROBID's REST API, for common use-cases.

Submit a sample document to the header document api

The /processHeaderDocument endpoint is similar to the /processFulltextDocument, but it will only contain front matter. It still uses the same segmentation model, but it won't need to process a number of other models.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processHeaderDocument?first_page=1&last_page=1"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processHeaderDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

(BibTeX output is currently not supported)

Submit a sample document to the full text document api

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextDocument?first_page=1&last_page=1"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

Submit a sample document to the references api

The /processReferences endpoint is similar to the /processFulltextDocument, but it will only contain references. It still uses the same segmentation model, but it won't need to process a number of other models.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processReferences?first_page=1&last_page=100"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processReferences?first_page=1&last_page=100"

Regardless, the returned content type will be application/xml.

Submit a sample document to the full text asset document api

The processFulltextAssetDocument is like processFulltextDocument. But instead of returning the TEI XML directly, it will contain a zip with the TEI XML document, along with other assets such as figure images.

curl --fail --show-error \
    --output "example-tei-xml-and-assets.zip" \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextAssetDocument?first_page=1&last_page=1"

The default response will be ZIP containing TEI XML (application/tei+xml+zip). The Accept HTTP request header may be used to request a ZIP containing JATS, with the mime type application/vnd.jats+xml+zip.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml+zip' \
    --output "example-jats-xml-and-assets.zip" \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextAssetDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/zip.

Submit a sample document to the `/convert` api

The /convert API is aiming to be a single endpoint for the conversion of PDF documents to a semantic representation. By default it will return JATS XML.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

The following section describe parameters to influence the response:

Using the `Accept` HTTP header parameter

The Accept HTTP header may be used to request a different response type. e.g. application/tei+xml for TEI XML.

curl --fail --show-error \
    --header 'Accept: application/tei+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

The /convert endpoint can also be used for a Word* to PDF conversion by specifying application/pdf as the desired response:

curl --fail --show-error --silent \
    --header 'Accept: application/pdf' \
    --form "file=@test-data/minimal-office-open.docx;filename=test-data/minimal-office-open.docx" \
    --output "example.pdf" \
    "http://localhost:8080/api/convert?first_page=1&last_page=1"

Using the `includes` request parameter

The includes request parameter may be used to specify the requested fields, in order to reduce the processing time. e.g. title,abstract to requst the title and the abstract only. In that case fewer models will be used. The output may still contain more fields than requested.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?includes=title,abstract"

The currently supported fields are:

title
abstract
authors
affiliations
references

Passing in any other values (no values), will behave as if no includes parameter was passed in.

Word* support

All of the above APIs will also accept a Word* document instead of a PDF.

Formats that are supported:

.docx (media type: application/vnd.openxmlformats-officedocument.wordprocessingml.document)
.dotx (media type: application/vnd.openxmlformats-officedocument.wordprocessingml.template)
.doc (media type: application/msword)
.rtf (media type: application/rtf)

The support is currently implemented by converting the document to PDF using LibreOffice.

Where no content type is provided, the content type is inferred from the file extension.

For example:

curl --fail --show-error \
    --form "file=@test-data/minimal-office-open.docx;filename=test-data/minimal-office-open.docx" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

Docker Usage

docker pull elifesciences/sciencebeam-parser

docker run --rm \
    -p 8070:8070 \
    elifesciences/sciencebeam-parser

Note: Docker images with the tag suffix -cv include the dependencies required for the CV (Computer Vision) models (disabled by default).

docker run --rm \
    -p 8070:8070 \
    --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_CV_MODEL=true \
    --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_OCR_MODEL=true \
    elifesciences/sciencebeam-parser:latest-cv

Non-release builds are available with the _unstable image suffix, e.g. elifesciences/sciencebeam-parser_unstable.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.8

Mar 3, 2022

0.1.7

Feb 10, 2022

0.1.6

Jan 14, 2022

0.1.5

Nov 23, 2021

0.1.4

Nov 19, 2021

0.1.3

Nov 17, 2021

0.1.2

Nov 16, 2021

This version

0.1.1

Nov 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sciencebeam_parser-0.1.1.tar.gz (82.8 kB view hashes)

Uploaded Nov 15, 2021 Source

Hashes for sciencebeam_parser-0.1.1.tar.gz

Hashes for sciencebeam_parser-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0d7e369f569c23e9a3d4dc8dc723c6b4f5fdb77727c628fa0afa6216d887513b`
MD5	`8acc3cd729515b13bf6507d8110a06ae`
BLAKE2b-256	`abe4ad088db358400dfe790091b642f80bd4ef1b6bb58ec4286d3db3c32edb2b`

sciencebeam-parser 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

ScienceBeam Parser

Pre-requisites

Development

Create Virtual Environment and install Dependencies

Configuration

Run tests (linting, pytest, etc.)

Start the server

Submit a sample document to the server

Submit a sample document to the header model

Submit a sample document to the name-header api

GROBID compatible APIs

Submit a sample document to the header document api

Submit a sample document to the full text document api

Submit a sample document to the references api

Submit a sample document to the full text asset document api

Submit a sample document to the /convert api

Using the Accept HTTP header parameter

Using the includes request parameter

Word* support

Docker Usage

See also

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Submit a sample document to the `/convert` api

Using the `Accept` HTTP header parameter

Using the `includes` request parameter