Extract the main article content (and optionally comments) from a web page

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ExtractNet

Based on the popular content extraction package Dragnet, ExtractNet extend the machine learning approach to extract other attributes such as date, author and keywords from news article.

ExtractNet pipeline

Example code:

Simply use the following command to install the latest released version:

pip install extractnet

Start extract content and other meta data passing the result html to function

from extractnet import Extractor

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = Extractor().extract(raw_html)

Why don't just use existing rule-base extraction method:

We discover some webpage doesn't provide the real author name but simply populate the author tag with a default value.

For example ltn.com.tw, udn.com always populate the same author value for each news article while the real author can only be found within the content.

Our machine learnig first approach extract correct fields just like human reading a website

ExtractNet uses machine learning approach to extract these relevant data through visible section of the webpage just like a human.

What ExtractNet is and isn't

ExtractNet is a platform to extract any interesting attributes from any webpage, not just limited to content based article.
The core of ExtractNet aims to convert unstructured webpage to structured data without relying hand crafted rules
ExtractNet do not support boilerplate content extraction

Performance

Results of the body extraction evaluation:

We use the same body extraction benchmark from article-extraction-benchmark

Model	Precision	Recall	F1	Accuracy
AutoExtract	0.984 ± 0.003	0.956 ± 0.010	0.970 ± 0.005	0.470 ± 0.037
Diffbot	0.958 ± 0.009	0.944 ± 0.013	0.951 ± 0.010	0.348 ± 0.035
boilerpipe	0.850 ± 0.016	0.870 ± 0.020	0.860 ± 0.016	0.006 ± 0.006
dragnet	0.925 ± 0.012	0.889 ± 0.018	0.907 ± 0.014	0.221 ± 0.030
ExtractNet	0.922 ± 0.011	0.933 ± 0.013	0.927 ± 0.010	0.160 ± 0.027
html-text	0.500 ± 0.017	0.994 ± 0.001	0.665 ± 0.015	0.000 ± 0.000
newspaper	0.917 ± 0.013	0.906 ± 0.017	0.912 ± 0.014	0.260 ± 0.032
readability	0.913 ± 0.014	0.931 ± 0.015	0.922 ± 0.013	0.315 ± 0.034
trafilatura	0.930 ± 0.010	0.967 ± 0.009	0.948 ± 0.008	0.243 ± 0.031

Results of author name extraction:

Model	F1
fasttext embeddings + CRF	0.904 ± 0.10

List of changes from Dragnet

Underlying classifier is replaced by Catboost instead of Decision Tree for all attributes extraction for consistency and performance boost.
Updated CSS features, added text+css latent feature
Includes a CRF model that extract names from author block text.
Trained on 22000+ updated webpages collected in the late 2020. The training data size is 20 times the size of dragnet data.

GETTING STARTED

pip install extractnet

Code

from extractnet import Extractor

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = Extractor().extract(raw_html)
for key, value in results.items():
    print(key)
    print(value)
    print('------------')

Contributing

We love contributions! Open an issue, or fork/create a pull request.

More details about the code structure

Coming soon

Reference

Content extraction using diverse feature sets

[1] Peters, Matthew E. and D. Lecocq, Content extraction using diverse feature sets

@inproceedings{Peters2013ContentEU,
  title={Content extraction using diverse feature sets},
  author={Matthew E. Peters and D. Lecocq},
  booktitle={WWW '13 Companion},
  year={2013}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.0.7

Nov 6, 2022

2.0.6

Oct 30, 2022

2.0.4

Apr 27, 2022

2.0.3

Apr 27, 2022

This version

1.0.4

Feb 9, 2021

1.0.3

Jan 1, 2021

1.0.2

Dec 17, 2020

1.0.0

Dec 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

extractnet-1.0.4-cp38-cp38-manylinux2010_x86_64.whl (14.3 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.8 manylinux: glibc 2.12+ x86-64

extractnet-1.0.4-cp38-cp38-macosx_10_15_x86_64.whl (12.3 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.8 macOS 10.15+ x86-64

extractnet-1.0.4-cp37-cp37m-manylinux2010_x86_64.whl (14.2 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.7m manylinux: glibc 2.12+ x86-64

extractnet-1.0.4-cp37-cp37m-macosx_10_15_x86_64.whl (12.3 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.7m macOS 10.15+ x86-64

extractnet-1.0.4-cp36-cp36m-manylinux2010_x86_64.whl (14.2 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.6m manylinux: glibc 2.12+ x86-64

extractnet-1.0.4-cp36-cp36m-macosx_10_15_x86_64.whl (12.3 MB view hashes)

Uploaded Feb 9, 2021 CPython 3.6m macOS 10.15+ x86-64

Hashes for extractnet-1.0.4-cp38-cp38-manylinux2010_x86_64.whl

Hashes for extractnet-1.0.4-cp38-cp38-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`b5b180803b71f96ae1fc271481c195f55e56478f9515294e4697a719237d6bdf`
MD5	`df13cae27a9c32ae1fb837fed5acd057`
BLAKE2b-256	`771dc0a3b1c434241da316f9f9afc72fdf9bf9cadabd478b495b49edc071142b`

Hashes for extractnet-1.0.4-cp38-cp38-macosx_10_15_x86_64.whl

Hashes for extractnet-1.0.4-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm	Hash digest
SHA256	`cd5ec39e72d6caaab8f46ef022a3cdd065a3b06d96e9d9069e381b1f7ef85e42`
MD5	`084d1065c9a5a1c390d87d9d4d7e07f5`
BLAKE2b-256	`93e205ed2842cf1e828dce53574393216cfe8e55289161c59073e1f64f5f2639`

Hashes for extractnet-1.0.4-cp37-cp37m-manylinux2010_x86_64.whl

Hashes for extractnet-1.0.4-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`43a077abe6795420e8a1d34c6c7fca179b0b8b309c545ccf78ba7d584277e704`
MD5	`7a5815d535e4fab67f3e4f852bfd500d`
BLAKE2b-256	`9056e1d63511d82e9910f32c03ed8b8090171203794991d2491f987df7e5cbf4`

Hashes for extractnet-1.0.4-cp37-cp37m-macosx_10_15_x86_64.whl

Hashes for extractnet-1.0.4-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm	Hash digest
SHA256	`143dfbc7375592a622ed97d0d6a17f03a70fb1cbbea5517c1aee7f275ea97aaf`
MD5	`56a476775bbdfab1a9b53ea8c4dd454b`
BLAKE2b-256	`50423e775756eb6d9067f40981b9d92af03f587b62f4302a0aaf96d5b70273c6`

Hashes for extractnet-1.0.4-cp36-cp36m-manylinux2010_x86_64.whl

Hashes for extractnet-1.0.4-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`603344c319dfb8bf735a9a6d67735fdad3b700ec87454d772c6156d7761a0cbd`
MD5	`c88a4e97934ea6032dadddaab1122e6c`
BLAKE2b-256	`53c861b02dfe82d8cb1037f3dac36d9689146a7511f6902cde0d4669560fc917`

Hashes for extractnet-1.0.4-cp36-cp36m-macosx_10_15_x86_64.whl

Hashes for extractnet-1.0.4-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm	Hash digest
SHA256	`db8f32a28c901434c866d92e94386dfee1af06d58c2c809b330752c264dafbc9`
MD5	`7c17f1de5a4875961eb8df1f6650c94d`
BLAKE2b-256	`b27b5af5ea7c3645cb83936b8ddd70c9779f59d312c34c2dc51a90a6ad6bc008`