tweet-preprocessor

Elegant tweet preprocessing

These details have not been verified by PyPI

Project links

Homepage

Project description

Preprocessor

Preprocessor is a preprocessing library for tweet data written in Python. It was written as part of my bachelor thesis in sentiment analysis. Later I extracted it to a library for broader usage.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

URLs
Hashtags
Mentions
Reserved words (RT, FAV)
Emojis
Smileys
JSON and .txt file support

Preprocessor v0.6.0 supports Python 2.7 and 3.5+ on Linux, macOS and Windows. Tests run on following setups:

Linux Xenial with Python 2.7, 3.5, 3.6, 3.7
macOS 10.14 with Python 3.7.5, 3.8.0
Windows 10.0.17134 with Python 2.7, 3.5.4, 3.6.8

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Processing files:

Preprocessor currently supports processing .json and .txt formats. Please see below examples for the correct input format.

Example JSON file

[
    "Preprocessor now supports files. https://github.com/s/preprocessor",
    "#preprocessing is a cruical part of @ML projects.",
    "@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl"
]

Example Text file

Preprocessor now supports files. https://github.com/s/preprocessor
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl

Preprocessing JSON file:

# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json

Preprocessing text file:

# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt

Available Options:

Option Name	Option Short Code
URL	p.OPT.URL
Mention	p.OPT.MENTION
Hashtag	p.OPT.HASHTAG
Reserved Words	p.OPT.RESERVED
Emoji	p.OPT.EMOJI
Smiley	p.OPT.SMILEY
Number	p.OPT.NUMBER

Installation

using pip:

$ pip install tweet-preprocessor

using manual installation:

$ python setup.py build
$ python setup.py install

Contributing

Are you willing to contribute to preprocessor? That’s great! Please follow below steps to contribute to this project:

Create a bug report or a feature idea using the templates on Issues page.
Fork the repository and make your changes.
Open a PR and make sure your PR has tests and all the checks pass.
And that’s all!

Algorithm	Hash digest
SHA256	`827d20d4c3ab8f8c3a084a56991b061be77bdf1d2e30b6b0d930f7f0e140b961`
MD5	`ce806591317bb74f458bde0d461a464e`
BLAKE2b-256	`087e60d1b535babb9f90e6809ad16484e8d634bc179056da7438fb8887e1524d`

Algorithm	Hash digest
SHA256	`303ce6f1c788cde01eb279a2cc5035d493a31b5a2fb7f8e2a1679d7e1e3e1fa6`
MD5	`2a59f4a77d298f216341df157cce38e4`
BLAKE2b-256	`179d71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c`

tweet-preprocessor 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Preprocessor

Features

Usage

Basic cleaning:

Tokenizing:

Parsing:

Fully customizable:

Processing files:

Example JSON file

Example Text file

Preprocessing JSON file:

Preprocessing text file:

Available Options:

Installation

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes