PyDocX

docx (OOXML) to html converter

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

======
pydocx
======
.. image:: https://travis-ci.org/OpenScienceFramework/pydocx.png?branch=master
:align: left
:target: https://travis-ci.org/OpenScienceFramework/pydocx

pydocx is a parser that breaks down the elements of a docxfile and converts them
into different markup languages. Right now, HTML is supported. Markdown and LaTex
will be available soon. You can extend any of the available parsers to customize it
to your needs. You can also create your own class that inherits DocxParser
to create your own methods for a markup language not yet supported.

Currently Supported
###################

* tables
* nested tables
* rowspans
* colspans
* lists in tables
* lists
* list styles
* nested lists
* list of tables
* list of pragraphs
* justification
* images
* styles
* bold
* italics
* underline
* hyperlinks
* headings

Usage
#####

DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:

::

class DocxParser:

@property
def parsed(self):
return self._parsed

@property
def escape(self, text):
return text

@abstractmethod
def linebreak(self):
return ''

@abstractmethod
def paragraph(self, text):
return text

@abstractmethod
def heading(self, text, heading_level):
return text

@abstractmethod
def insertion(self, text, author, date):
return text

@abstractmethod
def hyperlink(self, text, href):
return text

@abstractmethod
def image_handler(self, path):
return path

@abstractmethod
def image(self, path, x, y):
return self.image_handler(path)

@abstractmethod
def deletion(self, text, author, date):
return text

@abstractmethod
def bold(self, text):
return text

@abstractmethod
def italics(self, text):
return text

@abstractmethod
def underline(self, text):
return text

@abstractmethod
def superscript(self, text):
return text

@abstractmethod
def subscript(self, text):
return text

@abstractmethod
def tab(self):
return True

@abstractmethod
def ordered_list(self, text):
return text

@abstractmethod
def unordered_list(self, text):
return text

@abstractmethod
def list_element(self, text):
return text

@abstractmethod
def table(self, text):
return text
@abstractmethod
def table_row(self, text):
return text

@abstractmethod
def table_cell(self, text):
return text

@abstractmethod
def page_break(self):
return True

@abstractmethod
def indent(self, text, left='', right='', firstLine=''):
return text

Docx2Html inherits DocxParser and implements basic HTML handling. Ex.

::

class Docx2Html(DocxParser):

# Escape '&', '<', and '>' so we render the HTML correctly
def escape(self, text):
return xml.sax.saxutils.quoteattr(text)[1:-1]

# return a line break
def linebreak(self, pre=None):
return ' '

# add paragraph tags
def paragraph(self, text, pre=None):
return '' + text + ''

However, let's say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type `my_implementation`. Simply extend docx2Html and add what you need.

::

class My_Implementation_of_Docx2Html(Docx2Html):

def paragraph(self, text, pre = None):
return + text + ''

OR, let's say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser

::

class Docx2Foo(DocxParser):

# because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge :)
def linebreak(self):
return '!!!!!!!!!!!!'

Custom Pre-Processor
####################

When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the `pre_processor` field on the custom parser, like so:

::

class Docx2Foo(DocxParser):
pre_processor_class = FooPrePorcessor

The `FooPrePorcessor` will need a few things to get you going:

::

class FooPrePorcessor(PydocxPrePorcessor):
def perform_pre_processing(self, root, *args, **kwargs):
super(FooPrePorcessor, self).perform_pre_processing(root, *args, **kwargs)
self._set_foo(root)

def _set_foo(self, root):
pass

If you want `_set_foo` to be called you must add it to `perform_pre_processing` which is called in the base parser for pydocx.

Everything done during pre-processing is executed prior to `parse` being called for the first time.

Styles
######

The base parser `Docx2Html` relies on certain css class being set for certain behaviour to occur. Currently these include:

* class `pydocx-insert` -> Turns the text green.
* class `pydocx-delete` -> Turns the text red and draws a line through the text.
* class `pydocx-center` -> Aligns the text to the center.
* class `pydocx-right` -> Aligns the text to the right.
* class `pydocx-left` -> Aligns the text to the left.
* class `pydocx-comment` -> Turns the text blue.
* class `pydocx-underline` -> Underlines the text.
* class `pydocx-caps` -> Makes all text uppercase.
* class `pydocx-small-caps` -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
* class `pydocx-strike` -> Strike a line through.
* class `pydocx-hidden` -> Hide the text.

Exceptions
##########

Right now there is only one custom exception (`MalformedDocxException`). It is raised if either the `xml` or `zipfile` libraries raise an exception.

Optional Arguments
##################

You can pass in `convert_root_level_upper_roman=True` to the parser and it will convert all root level upper roman lists to headings instead.

Command Line Execution
######################

First you have to install pydocx, this can be done by running the command `pip install pydocx`. From there you can simply call the command `pydocx --html path/to/file.docx path/to/output.html`. Change `pydocx --html` to `pydocx --markdown` in order to convert to markdown instead.

Changelog
=========
* 0.3.12
* Added command line support to convert from docx to either html or
markdown.
* 0.3.11
* The non breaking hyphen tag was not correctly being imported. This issue
has been fixed.
* 0.3.10
* Found and optimized a fairly large performance issue with tables that had
large amounts of content within a single cell, which includes nested
tables.
* 0.3.9
* We are now respecting the `<w:tab/>` element. We are putting a space in
everywhere they happen.
* Each styling can have a default defined based on values in `styles.xml`.
These default styles can be overwritten using the `rPr` on the actual `r`
tag. These default styles defined in `styles.xml` are actually being
respected now.
* 0.3.8
* If zipfile fails to open the passed in file, we are now raising a
`MalformedDocxException` instead of a `BadZipFIle`.
* 0.3.7
* Some inline tags (most notably the underline tag) could have a `val` of
`none` and that would signify that the style is disabled. A `val` of
`none` is now correctly handled.
* 0.3.6
* It is possible for a docx file to not contain a `numbering.xml` file but
still try to use lists. Now if this happens all lists get converted to
paragraphs.
* 0.3.5
* Not all docx files contain a `styles.xml` file. We are no longer assuming
they do.
* 0.3.4
* It is possible for `w:t` tags to have `text` set to `None`. This no
longer causes an error when escaping that text.
* 0.3.3
* In the event that `cElementTree` has a problem parsing the document, a
`MalformedDocxException` is raised instead of a `SyntaxError`
* 0.3.2
* We were not taking into account that vertical merges should have a
continue attribute, but sometimes they do not, and in those cases word
assumes the continue attribute. We updated the parser to handle the
cases in which the continue attribute is not there.
* We now correctly handle documents with unicode character in the
namespace.
* In rare cases, some text would be output with a style when it should not
have been. This issue has been fixed.
* 0.3.1
* Added support for several more OOXML tags including:
* caps
* smallCaps
* strike
* dstrike
* vanish
* webHidden
More details in the README.
* 0.3.0
* We switched from using stock *xml.etree.ElementTree* to using
*xml.etree.cElementTree*. This has resulted in a fairly significant speed
increase for python 2.6
* It is now possible to create your own pre processor to do additional pre
processing.
* Superscripts and subscripts are now extracted correctly.
* 0.2.1
* Added a changelog
* Added the version in pydocx.__init__
* Fixed an issue with duplicating content if there was indentation or
justification on a p element that had multiple t tags.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.9.10

Jul 29, 2016

0.9.9

May 23, 2016

0.9.8

May 20, 2016

0.9.7

May 19, 2016

0.9.6

May 10, 2016

0.9.5

Sep 28, 2015

0.9.4

Sep 24, 2015

0.9.3

Sep 23, 2015

0.9.2

Sep 23, 2015

0.9.1

Sep 23, 2015

0.9.0

Sep 22, 2015

0.8.5

Sep 3, 2015

0.8.4

Sep 2, 2015

0.8.3

Aug 27, 2015

0.8.2

Aug 26, 2015

0.8.1

Aug 20, 2015

0.8.0

Aug 4, 2015

0.7.0

May 17, 2015

0.6.0

Apr 7, 2015

0.5.1

Mar 30, 2015

0.5.0

Mar 23, 2015

0.4.4

Mar 18, 2015

0.4.3

Jan 5, 2015

0.4.2

Oct 14, 2014

0.4.01

Sep 5, 2014

0.4.00

Sep 3, 2014

0.3.23

Jun 6, 2014

0.3.22

Jun 4, 2014

0.3.21

May 22, 2014

0.3.20

May 22, 2014

0.3.19

May 21, 2014

0.3.18

May 21, 2014

0.3.17

May 21, 2014

0.3.16

May 21, 2014

0.3.15

May 21, 2014

0.3.14

May 21, 2014

0.3.13

May 21, 2014

This version

0.3.12

May 21, 2014

0.3.11

May 21, 2014

0.3.10

May 21, 2014

0.3.9

May 21, 2014

0.3.8

May 21, 2014

0.3.7

May 21, 2014

0.3.6

May 21, 2014

0.3.5

May 21, 2014

0.3.4

May 21, 2014

0.3.3

May 21, 2014

0.3.2

May 21, 2014

0.3.1

May 21, 2014

0.3.0

May 21, 2014

0.2.1

May 21, 2014

0.2.0

May 21, 2014

0.1.8

May 21, 2014

0.1.7

May 21, 2014

0.1.6

May 21, 2014

0.1.4

May 21, 2014

0.1.3

May 21, 2014

0.1.2

May 21, 2014

0.1.1

May 21, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDocX-0.3.12.tar.gz (371.4 kB view hashes)

Uploaded May 21, 2014 Source

Hashes for PyDocX-0.3.12.tar.gz

Hashes for PyDocX-0.3.12.tar.gz
Algorithm	Hash digest
SHA256	`15c9238f9388e65408a3ae78813f038fb76edb5ad248a3730fb0c96f5fef2573`
MD5	`43d49be093b663a644f6d48ad3f56fc8`
BLAKE2b-256	`5c642dd72d3cecdbf8005e34dc0de6ba9b10b9430b55173be61b2f05a888d945`