ftw.tika

Apache Tika integration for Plone using portal transforms.

These details have been verified by PyPI

Maintainers

4teamwork bierik buchi deif elio.schmutz jone lukasg phgross

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ftw.tika

This product integrates Apache Tika for full text indexing with Plone by providing portal transforms to text/plain for the various document formats supported by Tika.

Supported Formats

Input Formats

Microsoft Office formats (Office Open XML)
- *.docx Word Documents
- *.dotx Word Templates
- *.xlsx Excel Sheets
- *.xltx Excel Templates
- *.pptx Powerpoint Presentations
- *.potx Powerpoint Templates
- *.ppsx Powerpoint Slideshows
Legacy Microsoft Office (97) formats
Rich Text Format
OpenOffice ODF formats
OpenOffice 1.x formats
Common Adobe formats (InDesign, Illustrator, Photoshop)
PDF documents
WordPerfect documents
E-Mail messages

See the mimetypes module for details on the MIME types corresponding to these formats.

Formats supported by Tika, but not wired up (yet)

Electronic Publication Format
Compression and packaging formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

See the Supported Document Formats page on the Apache Tika Wiki for details.

Output Formats

text/plain

Installation

The preferred method to run tika is as a daemon. Although it is possible to run tika without a daemon (by booting it up for each time a file is converted), the daemon is a lot faster.

Both methods require the tika jar to be downloaded and a ZCML configuration of ftw.tika.

Below are some configuration examples.

Daemon buildout example

[buildout]
parts +=
    tika-download
    tika-server


[instance0]
zcml-additional += ${tika:zcml}
eggs += ftw.tika


[tika]
server-port = 8077
zcml =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika-download:destination}/${tika-download:filename}"
                     port="${tika:server-port}" />
    </configure>


[tika-download]
recipe = hexagonit.recipe.download
url = http://mirror.switch.ch/mirror/apache/dist/tika/tika-app-1.4.jar
download-only = true
filename = tika.jar


[tika-server]
recipe = collective.recipe.scriptgen
cmd = java
arguments = -jar ${tika-download:destination}/${tika-download:filename} --server --port ${tika:server-port} --text


[supervisor]
programs +=
    20 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true zope

How it works:

The tika-download part downloads the tika jar and places it at ./parts/tika-download/tika.jar.
The tika-server part creates a bin/tika-server script which already includes the port configuration defined in the tika part.
The instance0 part is expected to be the Plone instance part and is extended with the ZCML configuration for ftw.tika
A supervisor configuration example is also included in the supervisor part.

If your deployment buildout bases on the deployment buildouts included in the ftw-buildouts repository on github, you can simply extend the tika-server.cfg and you have everything configured:

[buildout]
extends =
    https://raw.github.com/4teamwork/ftw-buildouts/master/production.cfg
    https://raw.github.com/4teamwork/ftw-buildouts/master/zeoclients/4.cfg
    https://raw.github.com/4teamwork/ftw-buildouts/master/tika-server.cfg

deployment-number = 05

filestorage-parts =
    www.mywebsite.com

instance-eggs =
    mywebsite

Non-daemon buildout example

Note that running tika in non-daemon mode is very, very slow!

When you dont want to use tika as daemon, you can simply just configure the path to the tika.jar in the ftw.tika ZCML configuration and it will fire up tika.jar (in a new JVM) every time something needs to be converted.

Here is a short example of how to download the tika.jar and configuring ftw.tika with buildout:

[buildout]
parts +=
    tika

[tika]
recipe = hexagonit.recipe.download
url = http://mirror.switch.ch/mirror/apache/dist/tika/tika-app-1.4.jar
download-only = true
filename = tika.jar

[instance]
eggs += ftw.tika
zcml-additional =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika:destination}/${tika:filename}" />
    </configure>

Installing ftw.tika in Plone

Install ftw.tika by adding it to the list of eggs in your buildout. (The buildout examples above include adding ftw.tika to the eggs).

[instance]
eggs +=
    ftw.tika

Run buildout and start your instance
Go to Site Setup of your Plone site and activate the ftw.tika add-on, or depend on the ftw.tika:default profile from your package’s metadata.xml.

Uninstalling ftw.tika

ftw.tika has an uninstall profile. To uninstall ftw.tika, import the ftw.tika:uninstall profile using the portal_setup tool.

Compatibility

Plone 4.1

Plone 4.2

Plone 4.3

Configuration

ftw.tika expects to be provided with a path to an installed tika-app.jar. This can be done through ZCML, and therefore also through buildout.

Configuration in ZCML

The path to the Tika JAR file must be configured in ZCML.

If you used the supplied tika.cfg as described above, you can reference the download location directly from buildout by using ${tika:destination}/${tika:filename}:

[instance]
zcml-additional =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika:destination}/${tika:filename}" />
    </configure>

If you installed Tika yourself, just set path="/path/to/tika" accordingly.

Usage

To use ftw.tika, simply ask the portal_transforms tool for a transformation to text/plain from one of the input formats supported by ftw.tika:

namedfile = self.context.file
transform_tool = getToolByName(self.context, 'portal_transforms')

stream = transform_tool.convertTo(
    'text/plain',
    namedfile.data,
    mimetype=namedfile.contentType)
plain_text = stream and stream.getData() or ''

Caching

If you want the result of the transform to be cached, you’ll need to pass a persistent ZODB object to transform_tool.convertTo() to store the cached result on.

For example, for a NamedBlobFile versioned with CMFEditions you’d use namedfile.data to access the data of the current working copy, and pass namedfile._blob as the object for the cache to be stored on (the namedfile is always the same instance for any version, only the _blob changes):

stream = transform_tool.convertTo(
    'text/plain',
    namedfile.data,
    mimetype=namedfile.contentType,
    object=namedfile._blob)

Stand-alone converter

The code calling Tika is encapsulated in its own class, so if for some reason you don’t want to use the portal_transforms tool, you can also use the converter directly by just instanciating it:

from ftw.tika.converter import TikaConverter

data = StringIO('foo')
converter = TikaConverter(path="/path/to/tika.jar")
plain_text = converter.convert(data)

The convert() method accepts either a data string or a file-like stream object. If no path keyword argument is supplied, the converter tries to get the path to the tika-app.jar from the ZCML configuration.

Copyright

This package is copyright by 4teamwork.

ftw.tika is licensed under GNU General Public License, version 2.

Changelog

1.1.2 (2014-09-01)

Changed tika source to archive.apache.org. [lknoepfel]
Extend integration tests to test conversion of all common formats we claim to support. [lgraf]
Updated tika to version 1.5. Updated detection of protected office files. [lknoepfel]

1.1.1 (2014-04-01)

Only log a warning on protected PDFs / MS Office documents. [jone]

1.1.0 (2014-03-14)

Add support for running tika as a deamon. The deamon speeds up the conversion from approximately 1.1 seconds per document to 0.06 seconds. [jone]

1.0 (2013-11-29)

First implementation. [lgraf]

Project details

These details have been verified by PyPI

Maintainers

4teamwork bierik buchi deif elio.schmutz jone lukasg phgross

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.10.0

Sep 4, 2019

2.9.0

Dec 19, 2017

2.8.0

Dec 18, 2017

2.7.0

Mar 15, 2016

2.6.0

Oct 27, 2015

2.5.0

Oct 27, 2015

2.4.0

Oct 27, 2015

2.3.0

Oct 27, 2015

2.2.0

Oct 25, 2015

2.1.0

Oct 25, 2015

2.0.1

Dec 8, 2014

2.0

Nov 24, 2014

This version

1.1.2

Sep 1, 2014

1.1.1

Apr 1, 2014

1.1.0

Mar 14, 2014

1.0

Nov 29, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ftw.tika-1.1.2.zip (188.4 kB view hashes)

Uploaded Sep 1, 2014 Source

Hashes for ftw.tika-1.1.2.zip

Hashes for ftw.tika-1.1.2.zip
Algorithm	Hash digest
SHA256	`8050cb904e436341d598fb5798e28fe0496a1250131ada248b3b10e94a332551`
MD5	`18ecf328ee6d9dab493466ade9d3c361`
BLAKE2b-256	`1e55ab2ed5599305acd3019fb36fe440e28f2228edb66f836bf94b552ddc330e`

ftw.tika 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

ftw.tika

Supported Formats

Input Formats

Formats supported by Tika, but not wired up (yet)

Output Formats

Installation

Daemon buildout example

Non-daemon buildout example

Installing ftw.tika in Plone

Uninstalling ftw.tika

Compatibility

Configuration

Configuration in ZCML

Usage

Caching

Stand-alone converter

Links

Copyright

Changelog

1.1.2 (2014-09-01)

1.1.1 (2014-04-01)

1.1.0 (2014-03-14)

1.0 (2013-11-29)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution