A conversion/manipulation tool for oral linguistics.

These details have not been verified by PyPI

Project links

repository

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Science/Research
Natural Language
- English
Operating System
- Microsoft :: Windows
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Project description

Corflow

A file conversion/manipulation software for corpus linguistics.

See the Github's wiki for documentation.

0. Readme updates

François Delafontaine: Neuchâtel (Switzerland), 22.08.2022

1. Context

Corflow, originally the 'multitool', has been started around 2015 to anonymize and convert files for the OFROM corpus (at Neuchatel, Switzerland). Initially in C++, it was reworked between 2016-9 in the ANR-DFG SegCor project (at Orleans, France) and translated in Python. It is now and since 2019 being developped within the ANR-DFG DoReCo project (at Lyon, France).

2. Objectives

While core objectives are the conversion and manipulation of files in the context of corpus linguistics (notably oral linguistics), some precisions are needed. But first: conversion means changing a file's format. A format is the way information is stored in the file. We will generally use the software or collection associated with a format to designate the format itself. Elan-to-Praat for example means converting from the '.eaf' format to the '.TextGrid' format. Finally manipulation means operations on the stored information itself: merging, anonymization, inter-rater agreement, etc. In details, the objectives are:

An "X-to-Y" conversion: meaning conversion should be possible from any supported format to any other supported format (see Pepper's swiss-army knife approach).
A lossless conversion: meaning that as little information should be lost during conversion as is feasible.
Accessibility: meaning that the package should be available (a) for automatic integration, (b) through command prompt and (c) through a dedicated graphical interface.
More accessibility: meaning that the package should require as few third-party libraries as possible, be easy to understand and to expand (by users adding their own scripts). This software's public (in corpus linguistics) is expected to have little to no experience with code. More advanced users are expected to prefer Pepper.

3. Limitations

No versioning has been yet set in place.

No user interface provided.
No customized error messages.
Current supported formats are. 'Praat (.TextGrid)', 'Elan (.eaf)', 'Pangloss (.xml)'. Testing has been limited and users should expect potential errors. TEI import is still in development.

4. Package

In its Python version, Corflow is considered as a package to import as is. That package corresponds to the conversion folder. The conversion folder should contain a 'Transcription.py' file and a set of 'fromX.py' and 'toX.py' files (for import and export respectively).

5. How does it work?

Corflow is built around a Transcription class used for "universal" information storage: all information from all the supported formats should fit in. Import scripts instantiate a Transcription object and fill it with the file's information; export scripts use a Transcription object to write a file: X -fromX-> Transcription -toY-> Y Manipulations are expected to operate on Transcription objects: X -fromX-> Transcription -manipulation-> Transcription -toY-> Y In practice this can vary, as manipulations are open and dependent on the user's needs.

The Transcription class is divided in (a) data and (b) metadata. (5a) Data is, for oral linguistics, what corresponds to a transcription. A transcription is text aligned to sound. The alignment relies on time points (time boundaries or timestamps). A set composed of a given text and two time boundaries (its start and end points relative to sound) is called a Segment: technically any arbitrary unit generated that way. Segments might not be linguistic units, and might not be units at all (and conversely, a linguistic unit like the pause might have no corresponding segment). A set of segments is called a Tier and a set of tiers corresponds to the whole transcription. We don't claim here that all tiers, that is, all sets of segments, are linguistic transcriptions. They can also represent translations, annotations, etc. Tiers, like segments, are type-neutral. (5b) Metadata is, for corpus linguistics, all information around the transcription: where, when, who, how...

*Start* and *end* contain the time boundaries and the content the *text*. This is how data is stored in the `Transcription` class in general, although more variables exist.

## 6. Conclusion
The question of [file conversion](https://corflo.hypotheses.org/122) might never be answered in a satisfactory manner. Originally just an nth homemade conversion tool, our hope is this becomes an easily-accessible package for other teams/projects to use either as is, for basic use, or by being able to quickly adapt it to their requirements.

Project details

These details have not been verified by PyPI

Project links

repository

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Science/Research
Natural Language
- English
Operating System
- Microsoft :: Windows
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

3.2.15

Feb 26, 2024

3.2.14

Feb 18, 2024

3.2.13

Feb 17, 2024

3.2.12

Feb 17, 2024

3.2.11

Feb 13, 2024

3.2.10

Feb 11, 2024

3.2.9

Feb 11, 2024

3.2.8

Feb 10, 2024

3.2.7

Feb 8, 2024

3.2.6

Aug 11, 2023

3.2.5

Jul 28, 2023

3.2.4

Mar 13, 2023

3.2.3

Feb 10, 2023

3.2.2

Feb 9, 2023

3.2.1

Jan 25, 2023

3.2.0

Jan 25, 2023

3.1.4

Dec 16, 2022

3.1.3

Dec 15, 2022

3.1.2

Sep 25, 2022

3.1.1

Aug 23, 2022

3.1.0

Aug 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corflow-3.2.15.tar.gz (50.1 kB view hashes)

Uploaded Feb 26, 2024 Source

Built Distribution

corflow-3.2.15-py3-none-any.whl (58.3 kB view hashes)

Uploaded Feb 26, 2024 Python 3

Hashes for corflow-3.2.15.tar.gz

Hashes for corflow-3.2.15.tar.gz
Algorithm	Hash digest
SHA256	`acd8f6868bdeeb66336aef44520a6621ad0e1dfb635f2f970ff186eee56f4efa`
MD5	`5fe474ddc250bc29b349cd44f1b0017b`
BLAKE2b-256	`9a1a9de64c69bd764e3c36c7b37d72e52524b6850d738a120adcd29d043ba743`

Hashes for corflow-3.2.15-py3-none-any.whl

Hashes for corflow-3.2.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09d6079db6a6984845545860b14ab9477f19598914dadb11b97c8468947d2d8e`
MD5	`77e0e6bd76378d51f45ba4955a3ca390`
BLAKE2b-256	`572b9efce2bb885c7bb35a880f17871f2f03736dc21aae34e154a81966836d4e`