Skip to main content

German language support for TextBlob.

Project description

Latest version Travis-CI Number of PyPI downloads

German language support for TextBlob by Steven Loria.

This python package is being developed as a TextBlob Language Extension. See Extension Guidelines for details.

Features

  • All directly accessible textblob_de classes (e.g. Sentence() or Word()) are initialized with default models for German

  • Properties or methods that do not yet work for German raise a NotImplementedError

  • German sentence boundary detection and tokenization (NLTKPunktTokenizer)

  • Consistent use of specified tokenizer for all tools (NLTKPunktTokenizer or PatternTokenizer)

  • Part-of-speech tagging (PatternTagger) with keyword include_punc=True (defaults to False)

  • Parsing (PatternParser) with all pattern keywords, plus pprint=True (defaults to False)

  • Noun Phrase Extraction (PatternParserNPExtractor)

  • Lemmatization (PatternParserLemmatizer)

  • Polarity detection (PatternAnalyzer) - Still EXPERIMENTAL, does not yet have information on subjectivity

  • NEW: Full pattern.text.de API support on Python3

  • Supports Python 2 and 3

  • See working features overview for details

Installing/Upgrading

$ pip install -U textblob-de
$ python -m textblob.download_corpora

Or the latest development release (apparently this does not always work on Windows see issues #1744/5 for details):

$ pip install -U git+https://github.com/markuskiller/textblob-de.git@dev
$ python -m textblob.download_corpora

Usage

>>> from textblob_de import TextBlobDE as TextBlob
>>> text = '''Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag.
Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider
habe ich nur noch EUR 18.50 in meiner Brieftasche.'''
>>> blob = TextBlob(text)
>>> blob.sentences
[Sentence("Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag."),
 Sentence("Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen."),
 Sentence("Aber leider habe ich nur noch EUR 18.50 in meiner Brieftasche.")]
>>> blob.tokens
WordList(['Heute', 'ist', 'der', '3.', 'Mai', ...]
>>> blob.tags
[('Heute', 'RB'), ('ist', 'VB'), ('der', 'DT'), ('3.', 'LS'), ('Mai', 'NN'),
('2014', 'CD'), ...]
# Default: Only noun_phrases that consist of two or more meaningful parts are displayed.
# Not perfect, but a start (relies heavily on parser accuracy)
>>> blob.noun_phrases
WordList(['Mai 2014', 'Dr. Meier', 'seinen 43. Geburtstag', 'Kuchen einzukaufen',
'meiner Brieftasche'])
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.parse()
'Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O'
>>> from textblob_de import PatternParser
>>> blob = TextBlobDE(u"Das ist ein schönes Auto.", parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
#          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMM
#
#       Das   DT     -       -      -      -      das
#       ist   VB     VP      -      -      -      sein
#       ein   DT     NP      -      -      -      ein
#   schönes   JJ     NP ^    -      -      -      schö
#      Auto   NN     NP ^    -      -      -      auto
#         .   .      -       -      -      -      .
>>> from textblob_de import PatternTagger
>>> blob = TextBlob(text, pos_tagger=PatternTagger(include_punc=True))
[('Das', 'DT'), ('Auto', 'NN'), ('ist', 'VB'), ('sehr', 'RB'), ('schön', 'JJ'), ('.', '.')]
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.sentiment
(1.0, 0.0)
>>> blob = TextBlob("Das ist ein hässliches Auto.")
>>> blob.sentiment
(-1.0, 0.0)
>>> blob.words.lemmatize()
WordList(['das', 'sein', 'ein', 'hässlich', 'Auto'])
>>> from textblob_de.lemmatizers import PatternParserLemmatizer
>>> _lemmatizer = PatternParserLemmatizer()
>>> _lemmatizer.lemmatize("Das ist ein hässliches Auto.")
[('das', 'DT'), ('sein', 'VB'), ('ein', 'DT'), ('hässlich', 'JJ'), ('Auto', 'NN')]

Access to pattern API in Python3

>>> from textblob_de.packages import pattern_de as pd
>>> print(pd.attributive("neugierig", gender=pd.FEMALE, role=pd.INDIRECT, article="die"))
neugierigen

Requirements

  • Python >= 2.6 or >= 3.3

TODO

  • TextBlob Extension: textblob-rftagger (wrapper class for RFTagger)

  • TextBlob Extension: textblob-cmd (command-line wrapper for TextBlob, basically TextBlob for files

  • TextBlob Extension: textblob-stanfordparser (wrapper class for StanfordParser via NLTK)

  • TextBlob Extension: textblob-berkeleyparser (wrapper class for BerkeleyParser)

  • TextBlob Extension: textblob-sent-align (sentence alignment for parallel TextBlobs)

  • TextBlob Extension: textblob-converters (various input and output conversions)

  • Additional PoS tagging options, e.g. NLTK tagging (NLTKTagger)

  • Improve noun phrase extraction (e.g. based on RFTagger output)

  • Improve sentiment analysis (find suitable subjectivity scores)

  • Improve functionality of Sentence() and Word() objects

  • Adapt more tests from textblob main package (esp. for TextBlobDE() in test_blob.py)

License

MIT licensed. See the bundled LICENSE file for more details.

Changelog

0.2.4 (04/08/2014)

  • Major internal refactoring (but no backwards-incompatible API changes) with the aim of restoring complete compatibility to original pattern>=2.6 library on Python2

  • Separation of textblob and pattern code

  • On Python2 the vendorized version of pattern.text.de is only used, if original is not installed (same as nltk)

  • Made pattern.de.pprint function and all parser keywords accessible to customise parser output

  • Access to complete pattern.text.de API on Python2 and Python3 from textblob_de.packages import pattern_de as pd

  • tox passed on all major platforms (Win/Linux/OSX)

0.2.3 (26/07/2014)

  • Lemmatizer: PatternParserLemmatizer() extracts lemmata from Parser output

  • Improved polarity analysis through look-up of lemmatised word forms

0.2.2 (22/07/2014)

  • Option: Include punctuation in tags/pos_tags properties (b = TextBlobDE(text, tagger=PatternTagger(include_punc=True)))

  • Added BlobberDE() class initialized with German models

  • TextBlobDE(), Sentence(), WordList() and Word() classes are now all initialized with German models

  • Restored complete API compatibility with textblob.tokenizers module of textblob main package

0.2.1 (20/07/2014)

  • Noun Phrase Extraction: PatternParserNPExtractor() extracts NPs from Parser output

  • Refactored the way TextBlobDE() passes on arguments and keyword arguments to individual tools

  • Backwards-incompatible: Deprecate parser_show_lemmata=True keyword in TextBlob(). Use parser=PatternParser(lemmata=True) instead.

0.2.0 (18/07/2014)

  • vastly improved tokenization (NLTKPunktTokenizer and PatternTokenizer with tests)

  • consistent use of specified tokenizer for all tools

  • TextBlobDE with initialized default models for German

  • Parsing (PatternParser) plus test_parsers.py

  • EXPERIMENTAL implementation of Polarity detection (PatternAnalyzer)

  • first attempt at extracting German Polarity clues into de-sentiment.xml

  • tox tests passing for py26, py27, py33 and py34

0.1.3 (09/07/2014)

  • First release on PyPI

0.1.0 - 0.1.2 (09/07/2014)

  • First release on github

  • A number of experimental releases for testing purposes

  • Adapted version badges, tests & travis-ci config

  • Code adapted from sample extension textblob-fr

  • Language specific linguistic resources copied from pattern-de

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textblob-de-0.2.4.tar.gz (27.1 kB view hashes)

Uploaded Source

Built Distribution

textblob_de-0.2.4-py2.py3-none-any.whl (1.0 MB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page