pycaption

Closed caption converter

These details have been verified by PyPI

Maintainers

jamesturk jnorton001 pbs pbsi-plops rudemateo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pycaption is a caption reading/writing module. Use one of the given Readers to read content into an intermediary format known as PCC (PBS Common Captions), and then use one of the Writers to output the PCC into captions of your desired format.

Turn a caption into multiple caption outputs:

srt_caps = '''1
00:00:09,209 --> 00:00:12,312
This is an example SRT file,
which, while extremely short,
is still a valid SRT file.
'''

converter = CaptionConverter()
converter.read(srt_caps, SRTReader())
print converter.write(SAMIWriter())
print converter.write(DFXPWriter())
print converter.write(TranscriptWriter())

Not sure what format the caption is in? Detect it:

caps = '''1
00:00:01,500 --> 00:00:12,345
Small caption'''

if SRTReader().detect(caps):
    print SAMIWriter().write(SRTReader().read(caps))
elif DFXPReader().detect(caps):
    print SAMIWriter().write(DFXPReader().read(caps))
elif SCCReader().detect(caps):
    print SAMIWriter().write(SCCReader().read(caps))

Supported Formats

Read: - SCC - SAMI - SRT - DFXP

Write: - DFXP - SAMI - SRT - Transcript

See the examples folder for example captions that currently can be read correctly.

Python Usage

Example: Convert from SAMI to DFXP

from pycaption import SAMIReader, DFXPWriter

sami = '''<SAMI><HEAD><TITLE>NOVA3213</TITLE><STYLE TYPE="text/css">
<!--
P { margin-left:  1pt;
    margin-right: 1pt;
    margin-bottom: 2pt;
    margin-top: 2pt;
    text-align: center;
    font-size: 10pt;
    font-family: Arial;
    font-weight: normal;
    font-style: normal;
    color: #ffffff; }

.ENCC {Name: English; lang: en-US; SAMI_Type: CC;}
.FRCC {Name: French; lang: fr-cc; SAMI_Type: CC;}

--></STYLE></HEAD><BODY>
<SYNC start="9209"><P class="ENCC">
       ( clock ticking )
</P><P class="FRCC">
       FRENCH LINE 1!
</P></SYNC>
<SYNC start="12312"><P class="ENCC">&nbsp;</P></SYNC>
<SYNC start="14848"><P class="ENCC">
              MAN:<br/>
         <span style="text-align:center;font-size:10">When <i>we</i> think</span><br/>
    of E equals m c-squared,
</P><P class="FRCC">
       FRENCH LINE 2?
</P></SYNC>'''

print DFXPWriter().write(SAMIReader().read(sami))

Which will output the following:

<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling">
 <head>
  <styling>
   <style id="p" tts:color="#fff" tts:fontfamily="Arial" tts:fontsize="10pt" tts:textAlign="center"/>
  </styling>
 </head>
 <body>
  <div xml:lang="fr-cc">
   <p begin="00:00:09.209" end="00:00:14.848" style="p">
    FRENCH LINE 1!
   </p>
   <p begin="00:00:14.848" end="00:00:18.848" style="p">
    FRENCH LINE 2?
   </p>
  </div>
  <div xml:lang="en-US">
   <p begin="00:00:09.209" end="00:00:12.312" style="p">
    ( clock ticking )
   </p>
   <p begin="00:00:14.848" end="00:00:18.848" style="p">
    MAN:<br/>
    <span tts:fontsize="10" tts:textAlign="center">When</span> <span tts:fontStyle="italic">we</span> think<br/>
    of E equals m c-squared,
   </p>
  </div>
 </body>
</tt>

Extensibility

Different readers and writers are easy to add if you would like to: - Read/Write a previously unsupported format - Read/Write a supported format in a different way (more styling?)

Simply follow the format of a current Reader or Writer, and edit to your heart’s desire.

PyCaps Format:

The different Readers will return the captions in PBS Common Captions (PCC) format. The Writers will be expecting captions in PCC format as well.

PCC format:

{
    "captions": {
        lang: list of captions
    }
    "styles":{
        style: styling
    }
}

Example PCC json:

{
    "captions": {
        "en": [
            [
                9209000,
                12312000,
                [
                    {"type": "text", "content": "Line 1"},
                    {"type": "break"},
                    {"type": "style", "start": True, "content": {"italics": True}},
                    {"type": "text", "content": "Line 2"},
                    {"type": "style", "start": False, "content": {"italics": True}}
                ],
                {
                    "class": "encc",
                    "text-align": "right"
                }
            ],
            [
                14556000,
                18993000,
                [
                    {"type": "text", "content": "Line 3, all by itself"}
                ],
                {
                    "class": "encc",
                    "italics": True
                }
            ]
        ]
    },
    "styles": {
            "encc": {
                "lang": "en-US"
            },
            "p": {
                "color": "#fff",
                "font-size": "10pt",
                "font-family": "Arial",
                "text-align": "center"
            }
    }
}

SAMI Reader / Writer :: spec

Microsoft Synchronized Accessible Media Interchange. Supports multiple languages.

Supported Styling: - text-align - italics - font-size - font-family - color

If the SAMI file is not valid XML (e.g. unclosed tags), will still attempt to read it.

DFXP Reader / Writer :: spec

The W3 standard. Supports multiple languages.

Supported Styling: - text-align - italics - font-size - font-family - color

SRT Reader / Writer :: spec

SubRip captions. If given multiple languages to write, will output all joined together by a ‘MULTI-LANGUAGE SRT’ line.

Supported Styling: - None

Assumes input language is english. To change:

pycaps = SRTReader().read(srt_content, lang='fr')

SCC Reader :: spec

Scenarist Closed Caption format. Assumes Channel 1 input.

Supported Styling: - italics

By default, the SCC Reader does not simulate roll-up captions. To enable roll-ups:

pycaps = SCCReader().read(scc_content, simulate_roll_up=True)

Also, assumes input language is english. To change:

pycaps = SCCReader().read(scc_content, lang='fr')

Now has the option of specifying an offset (measured in seconds) for the timestamp. For example, if the SCC file is 45 seconds ahead of the video:

pycaps = SCCReader().read(scc_content, offset=45)

The SCC Reader handles both dropframe and non-dropframe captions, and will auto-detect which format the captions are in.

Transcript Writer

Text stripped of styling, arranged in sentences.

Supported Styling: - None

The transcript writer uses natural sentence boundary detection algorithms to create the transcript.

License

Project details

These details have been verified by PyPI

Maintainers

jamesturk jnorton001 pbs pbsi-plops rudemateo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.2.6

Apr 9, 2024

2.2.5 yanked

Feb 20, 2024

Reason this release was yanked:

causes issues with positioning commands

2.2.4

Feb 7, 2024

2.2.3

Feb 2, 2024

2.2.2

Jan 30, 2024

2.2.1

Jan 9, 2024

2.2.0

Nov 1, 2023

2.1.1

Jan 19, 2023

2.1.0

Sep 7, 2022

2.0.9

May 10, 2022

2.0.8

Apr 19, 2022

2.0.7

Apr 4, 2022

2.0.6

Mar 8, 2022

2.0.5

Feb 18, 2022

2.0.4

Jan 12, 2022

2.0.3

Dec 7, 2021

2.0.2

Oct 26, 2021

2.0.1

Sep 14, 2021

2.0.0

Aug 17, 2021

2.0.dev3 pre-release

Nov 16, 2021

2.0.dev2 pre-release

May 9, 2022

2.0.dev0 pre-release

Dec 14, 2021

1.0.7

Jul 23, 2021

1.0.6

Jun 30, 2021

1.0.5

Jun 3, 2021

1.0.4

May 19, 2021

1.0.3

May 12, 2021

1.0.2

Mar 8, 2021

1.0.1

Oct 9, 2017

1.0.0

Jul 27, 2016

0.7.7

May 29, 2019

0.7.6

Apr 5, 2019

0.7.5

Mar 25, 2019

0.7.4

Nov 6, 2018

0.7.3

Nov 2, 2017

0.7.2

Mar 14, 2017

0.7.1

Feb 9, 2017

0.7.0

Sep 22, 2016

0.6.1

Sep 15, 2016

0.5.6

Jun 23, 2016

0.5.5

Feb 5, 2016

0.5.4

Nov 9, 2015

0.5.4c2 pre-release

Jul 3, 2015

0.5.4c1 pre-release

Jun 10, 2015

0.5.4b pre-release

Jun 8, 2015

0.5.3

Jun 3, 2015

0.5.2

Jun 2, 2015

0.5.2c4 pre-release

May 26, 2015

0.5.1

May 14, 2015

0.5.1c3 pre-release

May 22, 2015

0.5.1c1 pre-release

May 14, 2015

0.5.1b3 pre-release

May 19, 2015

0.5.1b2 pre-release

May 19, 2015

0.5.1b1 pre-release

May 12, 2015

0.5.0

Apr 9, 2015

0.4.6

Jul 3, 2015

0.4.5

Nov 12, 2014

0.4.4

Nov 11, 2014

0.4.3

Nov 4, 2014

0.4.2

Nov 4, 2014

0.4.0

Oct 7, 2014

0.3.6

May 20, 2014

0.3.5

May 19, 2014

0.3.4

Mar 21, 2014

0.3.3

Mar 11, 2014

0.3.2

Mar 10, 2014

0.3.1

Jan 30, 2014

0.3

Jan 29, 2014

0.2.14

Jan 3, 2014

0.2.13

Oct 18, 2013

0.2.11

Aug 22, 2013

0.2.10

Jun 11, 2013

This version

0.2.9

Jun 10, 2013

0.2.8

Apr 23, 2013

0.2.7

Feb 21, 2013

0.2.6

Dec 11, 2012

0.2.5

Sep 10, 2012

0.2.4

Sep 6, 2012

0.2.3

Aug 30, 2012

0.2.2

Aug 22, 2012

0.2.1

Aug 9, 2012

0.2

Aug 9, 2012

0.1

Aug 8, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycaption-0.2.9.tar.gz (181.9 kB view hashes)

Uploaded Jun 10, 2013 Source

Hashes for pycaption-0.2.9.tar.gz

Hashes for pycaption-0.2.9.tar.gz
Algorithm	Hash digest
SHA256	`b171e2e0626ae7f1598cc7d4574e24d8a9fc0c8152e7b842dff510954b58b71c`
MD5	`da42813c889508eba812f8fbe2e332d7`
BLAKE2b-256	`24df59a14dfc0aec00e08667bed4a4ee00dbf92ec142e016ed943a9fcc74e3e0`