eatiht

A package and script for extracting article text in html documents.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

eatiht
======

A python package for **e**xtracting **a**rticle **t**ext **i**n **ht**ml documents. Check out this [demo](http://web-tier-load-balancer-1502628209.us-west-2.elb.amazonaws.com/filter?url=http://www.nytimes.com/2014/12/18/world/asia/us-links-north-korea-to-sony-hacking.html).

At a Glance
-----------

#### To install:

```bash
pip install eatiht
...
easy_install eatiht
```

Note: On Windows, you may need to install lxml manually using:
pip install lxml

#### Using in Python
```python
import eatiht

url = 'http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html'

print eatiht.extract(url)
```
##### Output
```
NASA's Curiosity rover is continuing to help scientists piece together the mystery of how Mars lost its
surface water over the course of billions of years. The rover drilled into a piece of Martian rock called
Cumberland and found some ancient water hidden within it. Researchers were then able to test a key ratio
in the water with Curiosity's onboard instruments...
```

#### Using as a command line tool:
```bash
eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html >> out.txt
```

Note: Window's users may have to add the C:\PythonXX\Scripts directory to your ["path"](http://www.computerhope.com/issues/ch000549.htm) so that the command line tool works from any directory, not only the ..\Scripts directory.

Requirements
------------
```
requests
lxml
```

Motivation
----------

After searching through the deepest crevices of the internet for some tool|library|module that could effectively extract the main content from a website (ignoring text from ads, sidebar links, etc.), I was slightly disheartened by the apparent ambiguity caused by this content-extraction problem.

My survey resulted in some of the following solutions:

* [boilerpipe](https://code.google.com/p/boilerpipe/) - *Boilerplate Removal and Fulltext Extraction from HTML pages*. Java library written by Christian Kohlschütter
* ["The Easy Way to Extract Useful Text from Arbitrary HTML"](http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/) - a Python tutorial on implementing a neural network for html content extraction. Written by alexjc
* [Pyteaser's Cleaners module](https://github.com/xiaoxu193/PyTeaser/blob/master/goose/cleaners.py) - from what I can tell, it's a purely heuristic-based process
* ["Text Extraction from the Web via Text-to-Tag Ratio"](http://www.cse.nd.edu/~tweninge/pubs/WH_TIR08.pdf) - a thesis on Text-to-Tag-heuristic driven clustering as a solution for the problem at hand. Written by Tim Weninger & William H. Hsu

The number of research papers I found on the subject largely outweighs the number available open-source projects. This is my attempt at balancing out the disparity.

In the process of coming up with a solution, I made two unoriginal observations:

1. XPath's select all (//), parent node (..) queries and functions ('string-length') are remarkably powerful when used together
2. Unnecessary machine learning is unnecessary

By making an assumption on sentence length, and this is trivial, one can query for text-nodes satisfying said sentence length, then create a frequency distribution (histogram) across the parent-nodes, and the argmax of the resulting distribution is the xpath that is shared amongst likely sentences.

The results were surprisingly good. I personally prefer this approach to the others as it seems to lie somewhere in between the purely rule-based and the drowning-in-ML approaches.

Please raise any issues or yell at me at rodrigopala91@gmail.com or [@mi_ogirdor](https://twitter.com/mi_ogirdor)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.14

Mar 28, 2015

0.1.13

Mar 13, 2015

0.1.12

Dec 31, 2014

0.1.11

Dec 28, 2014

0.1.1

Dec 27, 2014

0.1.0

Dec 27, 2014

0.0.10

Dec 21, 2014

0.0.9

Dec 19, 2014

This version

0.0.8

Dec 19, 2014

0.0.7

Dec 19, 2014

0.0.6

Dec 18, 2014

0.0.5

Dec 18, 2014

0.0.4

Dec 18, 2014

0.0.3

Dec 18, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eatiht-0.0.8.zip (8.4 kB view hashes)

Uploaded Dec 19, 2014 Source

Hashes for eatiht-0.0.8.zip

Hashes for eatiht-0.0.8.zip
Algorithm	Hash digest
SHA256	`15e6d143af976d653211ae37de58aca98b6dddbb7f8ce0ea609df8092ba178bf`
MD5	`ef72cc8412e5e317167891e1c394fcd1`
BLAKE2b-256	`685cab5b2f5cbc0202fa08ab9234f250a743222c4e0af6861136d836f64385e2`