Arbitrary transliterations on Microsoft Office documents

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

convertextract

========

Extract and find/replace text based on arbitrary correspondences. This library is a fork from the Textract library by Dean Malmgren. https://github.com/deanmalmgren/textract

Documentation

Installation

To install, you must have Python 3.4+ and pip installed.

pip install convertextract

Some source libraries need to be installed for different operating systems to support various file formats. Visit http://textract.readthedocs.org/en/latest/installation.html for documentation.

=========

Basic CLI Use

Some basic Textract functions are preserved. Please visit http://textract.readthedocs.org for documentation.

Converting a file based on xlsx

convertextract requires two arguments:

A file containing text to convert (as of Version 1.0.4, this includes .pptx, .docx, .xlsx, and .txt)
An .xlsx file containing the find/replace correspondences. As of Version 2.0.1 you can also use either .csv files or feed a list of correspondences (as Python dicts) directly into the language keyword argument for either process or process_text

Running the comand:

convertextract path/to/foo.docx -l path/to/bar.xlsx

Will produce a new file path/to/foo_converted.docx which will contain the same content as path/to/foo.docx but with find/replace performed for all correspondences listed in path/to/bar.xlsx.

Creating an .xlsx correspondence sheet

Your correspondence sheet must be set up as follows:

in	out
aa	å
oe	ø
ae	æ

Here, this correspondence sheet (do not include headers like "replace with" or "find") would replace all instances of aa, oe, or ae in a given file with å, ø, or æ respectively.

Supported conversions

As of Version 2.0, the following conversions are supported:

Heiltsuk Doulos Font -> Unicode

convertextract path/to/foo.docx -l hei -t Doulos

Heiltsuk Times Font -> Unicode

convertextract path/to/foo.docx -l hei -t Times

Tsilhqot'in Doulos Font -> Unicode

convertextract path/to/foo.docx -l clc -t Doulos

Navajo Times Font -> Unicode

convertextract path/to/foo.docx -l nav -t Times

Using Regular Expressions

As of Version 1.5, there is support for Regular Expressions. If you do not need to use context-sensitive conversions, you do not need to include them. However, if you do, you should set up your correspondence sheet as follows:

in	out	context_before	context_after
aa	å	[k,d]	$
aa	æ	t	$
aa	a:

For more information on how the g2p is acutally processed, please visit https://github.com/roedoejet/g2p.

Use as Python package

You can use the package in a Python script, which returns converted text, but without formatting. Running the script will still create a foo_converted.docx file.

import convertextract
text = convertextract.process('foo.docx', language='bar.xlsx')

You can also use convertextract to just convert text in Python using process_text.

import convertextract
text = convertextract.process_text('test', language=[{'in': 't', 'out': 'p', 'context_before': '^', 'context_after': 'e'}])

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.2.17

Oct 13, 2022

3.2.16

May 19, 2021

3.2.15

May 14, 2021

3.2.14

Mar 16, 2021

3.2.13

Mar 12, 2021

3.2.12

Nov 17, 2020

3.2.11

Nov 17, 2020

3.2.10

Nov 2, 2020

3.2.9

Sep 9, 2020

3.2.8

Aug 14, 2020

3.2.7

Aug 12, 2020

3.2.2

Aug 7, 2020

3.2.1

Aug 7, 2020

3.2.0

Aug 7, 2020

3.1.2

Aug 7, 2020

3.1.1

Jul 24, 2020

3.1.0

Mar 23, 2020

3.0.0

Feb 16, 2020

This version

2.5.0

Aug 7, 2019

2.0.2

Feb 7, 2019

2.0.1

Jan 11, 2019

2.0

Jul 24, 2018

1.6.1

Jun 5, 2018

1.6

May 7, 2018

1.5.1

Apr 30, 2018

1.5

Mar 22, 2018

1.3

Jan 16, 2018

1.2.2

Sep 27, 2017

1.2.1

Jun 18, 2017

1.2

Jun 18, 2017

1.1

Jan 22, 2017

1.0.9

Nov 27, 2016

1.0.8

Nov 27, 2016

1.0.7

Oct 28, 2016

1.0.6

Oct 21, 2016

1.0.5

Oct 21, 2016

1.0.4

Oct 14, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

convertextract-2.5.0.tar.gz (15.2 kB view hashes)

Uploaded Aug 7, 2019 Source

Built Distribution

convertextract-2.5.0-py3-none-any.whl (46.3 kB view hashes)

Uploaded Aug 7, 2019 Python 3

Hashes for convertextract-2.5.0.tar.gz

Hashes for convertextract-2.5.0.tar.gz
Algorithm	Hash digest
SHA256	`5be3ca1060230d99a8d0889df8210c345e05734a73bd8630c8facf5ba065ec83`
MD5	`5f7329f1d9c564b3d18986b594835b77`
BLAKE2b-256	`24896fbe9ad2952c4f43ebd8a1f8baf66c642de060ecda54f07c99ba292f4913`

Hashes for convertextract-2.5.0-py3-none-any.whl

Hashes for convertextract-2.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bcb727eac8dbf24910e95299ac57650009d45d0d53232fec8073378d8573769a`
MD5	`c5dc6abcca38361ceea79fdd991f7ac6`
BLAKE2b-256	`99acba9cb86f4086c2a682202767358783f9a591ba2f51333565d6f606b84515`