textdirectory

TextDirectory allows you to combine multiple text files into one.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language
- Python :: 3.6

Project description

=============
TextDirectory
=============

.. image:: https://img.shields.io/pypi/v/textdirectory.svg
:target: https://pypi.python.org/pypi/textdirectory

.. image:: https://img.shields.io/travis/IngoKl/textdirectory.svg
:target: https://travis-ci.org/IngoKl/textdirectory

.. image:: https://readthedocs.org/projects/textdirectory/badge/?version=latest
:target: https://textdirectory.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
|
|
.. image:: https://user-images.githubusercontent.com/16179317/39367680-cd409a00-4a37-11e8-8d42-0bed5a4e814b.png
:alt: TextDirectory

*TextDirectory* allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching
files for certain criteria and applying transformations to the aggregated text.

*TextDirectory* can be used as a mere tool (via the CLI) and as a Python library.

Of course, everything *TextDirectory* does could be achieved in bash or PowerShell. However, there are certain
use-cases (e.g. when used as a library) in which it might be useful.

* Free software: MIT license
* Documentation: https://textdirectory.readthedocs.io.

Features
--------
* Aggregating multiple text files
* Matching based on length (character, tokens), content, and random sampling
* Transforming the aggregated text (e.g. transforming the text to lowercase)

.. csv-table::
:header: "Version", "Filters", "Transformations"
:widths: 10, 30, 30

0.1.0, filter_by_max_chars(n); filter_by_min_chars(n); filter_by_max_tokens(n); filter_by_min_tokens(n); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n), transformation_lowercase
0.1.1, filter_by_chars_outliers(n sigmas), transformation_remove_nl
0.1.2, filter_by_filename_contains(str), transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spaCy model)

Quickstart
----------
Install *TextDirectory* via pip: ``pip install textdirectory``

*TextDirectory*, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.

.. image:: https://user-images.githubusercontent.com/16179317/39367589-7f774116-4a37-11e8-9a09-5cbdf5f3311b.png
:alt: TextDirectory

As a Command-Line Tool
~~~~~~~~~~~~~~~~~~~~~~
*TextDirectory* comes equipped with a CLI.

The syntax for both the *filters* and *tranformations* works similarly. They are chained by adding slashes (/) and
parameters are passed via commas (,): ``filter_by_min_tokens,5/filter_by_random_sampling,2``.

**Example 1: A Very Simple Aggregation**

``textdirectory --directory testdata --output_file aggregated.txt``

This will take all files (.txt) in *testdata* and then aggregates the files into a file called *aggregated.txt*.

**Example 2: Applying Filters and Transformations**

In this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.

``textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase``

After passing two filters (*filter_by_min_tokens* and *filter_by_random_sampling*) we've applied the *transform_lowercase* transformation.

The resulting file will contain the content of two files that each have at least five tokens.

As a Python Library
~~~~~~~~~~~~~~~~~~~
In order to demonstrate *TextDirectory* as a Python library, we'll recreate the second example from above:

.. code:: python

import textdirectory
td = textdirectory.TextDirectory(directory='testdata')
td.load_files(recursive=False, filetype='txt', sort=True)
td.filter_by_min_tokens(5)
td.filter_by_random_sampling(2)
td.stage_transformation(['transform_lowercase'])
td.aggregate_to_file('aggregated.txt')

If we wanted to keep working with the actual aggregated text, we could have called ``text = td.aggregate_to_memory()``.

ToDo
--------
* Increasing test coverage
* Writing better documentation
* Adding better error handling (raw exception are, well ...)
* Adding logging

Behaviour
---------
We are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but
saves memory.

``transformation_usas_en_semtag`` relies on the web versionof `Paul Rayson's USAS Tagger
<http://ucrel.lancs.ac.uk/usas/>`_. Don't use this transformation for large amounts of text, give credit, and
consider using their commercial product `Wmatrix <http://ucrel.lancs.ac.uk/wmatrix/>`_.

Credits
-------
This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage

=======
History
=======

0.1.0 (2018-04-26)
------------------

* Initial release
* First release on PyPI.

0.1.1 (2018-04-27)
------------------

* added filter_by_chars_outliers
* added transformation_remove_nl

0.1.2 (2018-04-29)
------------------
* added transformation_postag
* added transformation_usas_en_semtag
* added transformation_uppercase
* added filter_by_filename_contains
* added parameter support for transformations

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language
- Python :: 3.6

Release history Release notifications | RSS feed

0.3.3

Sep 25, 2022

0.3.2

Jan 10, 2021

0.3.1.1

Jan 20, 2020

0.3.1

Jan 20, 2020

0.3.0

Jan 20, 2020

0.2.2

Jun 13, 2019

0.2.1

Jun 13, 2019

0.2.0

May 13, 2018

0.1.4

May 2, 2018

0.1.3

Apr 30, 2018

This version

0.1.2

Apr 29, 2018

0.1.0

Apr 27, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdirectory-0.1.2.tar.gz (15.1 kB view hashes)

Uploaded Apr 29, 2018 Source

Built Distribution

textdirectory-0.1.2-py2.py3-none-any.whl (10.8 kB view hashes)

Uploaded Apr 29, 2018 Python 2 Python 3

Hashes for textdirectory-0.1.2.tar.gz

Hashes for textdirectory-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`7112d64abb93d23f68fe84cbb837f3b4e946b119bf88d5064604790186836408`
MD5	`eb1c138e3ceba7ad89c695d119c6b73d`
BLAKE2b-256	`ccacb262c793628858a67a2b256803528de57848d919899afd3efda7e35d7d1e`

Hashes for textdirectory-0.1.2-py2.py3-none-any.whl

Hashes for textdirectory-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd0eea9a172cbb953b1211a169c501d1b527eebd47281d61ceb61705867318ec`
MD5	`31c5d0d2bc75bde1745dcc711fbf9d05`
BLAKE2b-256	`7eb6034c56cd6c5fa37444b926e3c4764200711b82368c1c0edabc46767c6c00`