Skip to main content

Easily import static HTML websites into Plone.

Project description

Introduction
============

``Parse2Plone`` is an HTML parser (in the form of a Buildout recipe that
creates a script for you) to easily get content from static HTML websites
into Plone.

Warning
-------

This is a **Buildout recipe** for use with Plone; by itself it does nothing. If you
do not know what Plone is, please see: http://plone.org. If you do not know
what Buildout is, please see: http://www.buildout.org/.

Getting started
---------------

Because it always drives me nuts when you have to dig for a recipe's options,
here they are::

[import]
recipe = parse2plone
path = /Plone
illegal_chars = _ .
html_extensions = html
image_extensions = gif jpg jpeg png
file_extensiosn = mp3
target_tags = a div h1 h2 p
force = false
publish = false
slugify = false
rename = false

The parameters listed above are configured with their default values. Edit these
values if you would like to change the default behavior; they are (mostly)
self-explanatory. Now you can just cut and paste to get started or keep reading if
you would like to know more.

Explanation
-----------

Why did you create ``Parse2Plone`` when the following packages already exist:

* http://pypi.python.org/pypi/collective.transmogrifier

* http://pypi.python.org/pypi/transmogrify.filesystem
* http://pypi.python.org/pypi/transmogrify.htmlcontentextractor
* http://pypi.python.org/pypi/transmogrify.webcrawler

Here are some reasons:

* Because ``Parse2Plone`` is aimed at lowering the bar for folks who don't already
know (or want to know) what a "transmogrifier blueprint" is but are able to update
their ``buildout.cfg`` file; run ``Buildout``; then run a single command all
without having to think too much.

* collective.transmogrify provides a framework for creating reusable pipes
(whose definitions are called blueprints). ``Parse2Plone`` provides
a single, non-reusable "pipe/blueprint".

* The author had an itch to scratch; it will be nice for him to be able to say
"just go write a script" and then point to an example.

* Transmogrifier and friends appear to be "developer's tools", while the author wants
``Parse2Plone`` to be an "end user's tool".

If you are a developer looking to create repeatable migrations, you probably want to be
using ``collective.transmogrifier``. If you are an end user that just wants to see your
static website in Plone, then you might want to give ``Parse2Plone`` a try.

There is also this user comment, which captures the author's sentiment::

Parse2Plone's release was very timely as I need either this or something very
similar - and while I've no doubt I could make transmogrify do the job, it's a
lot of work for a one-shot loading of legacy pages.

-Derek Broughton

Installation
------------

You can install ``Parse2Plone`` by editing your ``buildout.cfg`` file like
so. First add an ``import`` section::

[import]
recipe = parse2plone

Then add the ``import`` section to the list of parts::

[buildout]
...
parts =
...
import

Now run ``bin/buildout`` as usual.

Execution
---------

You can run ``Parse2Plone`` like this::

$ bin/plone run bin/import /path/to/files

.. Note::
In the example above and examples below, ``bin/plone`` refers to a *Zope 2
instance* script created by `plone.recipe.zope2instance`_.

Your ``bin/plone`` script may be called ``bin/instance`` or
``bin/client``, etc. instead.

.. _`plone.recipe.zope2instance`: http://pypi.python.org/pypi/plone.recipe.zope2instance

Demonstration
-------------

If you have a site in /var/www/html that contains the following::

/var/www/html/index.html
/var/www/html/about/index.html

You should run::

$ bin/plone run bin/import /var/www/html

And the following will be created:

* http://localhost:8080/Plone/index.html
* http://localhost:8080/Plone/about/index.html

Modification
------------

Modifying the default behavior of ``parse2plone`` is easy; just use the command
line options or add parameters to your ``buildout.cfg`` file. Both approaches
allow customization of the same set of options, but the command line arguments
will trump any settings found in your ``buildout.cfg`` file.

Buildout options
~~~~~~~~~~~~~~~~

You can configure the following parameters in your ``buildout.cfg`` file in
the ``parse2plone`` recipe section.

Options
'''''''
+---------------------+-------------+----------------------------------------+
| **Parameter** | **Default** | **Description** |
| | **value** | |
+---------------------+-------------+----------------------------------------+
| ``path`` | /Plone | Specify an alternate location in the |
| | | database for the import to occur. |
+---------------------+-------------+----------------------------------------+
| ``illegal_chars`` | _ . | Specify illegal characters. |
| | | ``parse2plone`` will ignore files that |
| | | contain these characters. |
+---------------------+-------------+----------------------------------------+
| ``html_extensions`` | html | Specify HTML file extensions. |
| | | ``parse2plone`` will import HTML files |
| | | with these extensions |
+---------------------+-------------+----------------------------------------+
| ``image_extensions``| png, gif, | Specify image file extensions. |
| | jpg, jpeg, | ``parse2plone`` will import image files|
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``file_extensions`` | mp3 | Specify image file extensions. |
| | | ``parse2plone`` will import files with |
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``target_tags`` | a h1 h2 p | Specify target tags. ``parse2plone`` |
| | | will parse the contents of HTML tags |
| | | listed. |
+---------------------+-------------+----------------------------------------+
| ``force`` | false | Force create folders that do not exist.|
+---------------------+-------------+----------------------------------------+
| ``publish`` | false | Publish newly created content. |
+---------------------+-------------+----------------------------------------+
| ``slugify`` | false | "Slugify" content. (see slugify.py) |
+---------------------+-------------+----------------------------------------+
| ``rename`` | false | Rename content. (see rename.py) |
+---------------------+-------------+----------------------------------------+

Example
'''''''

Instead of accepting the default ``parse2plone`` behaviour, in your
``buildout.cfg`` file you may specify the following::

[import]
recipe = parse2plone
path = /Plone/foo
html_extensions = htm
image_extensions = png
target_tags = p

This will configure ``parse2plone`` to (only) import content *from*:

* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags

*to*:

* A folder named ``/Plone/foo``.

Command line options
~~~~~~~~~~~~~~~~~~~~

The following ``parse2plone`` command line options are supported.

Options
'''''''

``'--path'``, ``'-p'``
**********************

You can specify an alternate import path ('/Plone' by default)
with ``--path`` or ``-p``::

$ bin/plone run bin/import /path/to/files --path=/Plone/foo

``'--html-extensions'``
***********************

You can specify HTML file extensions with the ``--html-extensions`` option::

$ bin/plone run bin/import /path/to/files --html-extensions=htm

``'--image-extensions'``
************************

You can specify image file extensions with the ``--image-extensions`` option::

$ bin/plone run bin/import /path/to/files --image-extensions=png

``'--file-extensions'``
***********************

You can specify generic file extensions with the ``--file-extensions`` option::

$ bin/plone run bin/import /path/to/files --file-extensions=pdf

``'--target-tags'``
*******************

You can specify the target tags to parse with the ``--target-tags`` option::

$ bin/plone run bin/import /path/to/files --target-tags=p

``'--force'``
*************

Force create folders that do not exist.

``'--publish'``
***************

Publish newly created content.

``'--slugify'``
***************

"Slugify" content (see slugify.py).

``'--rename'``
***************

Rename content (see rename.py).

``'--help'``
************

You can ask ``parse2plone`` to tell you about its available options with the ``--help``
option::

$ bin/plone run bin/import -h
Usage: import [options]

Options:
-h, --help show this help message and exit
-p PATH, --path=PATH Path to Plone site object or sub-folder
--html-extensions=HTML_EXTENSIONS
Specify HTML file extensions
--illegal-chars=ILLEGAL_CHARS
Specify characters to ignore
--image-extensions=IMAGE_EXTENSIONS
Specify image file extensions
--file-extensions=FILE_EXTENSIONS
Specify generic file extensions
--target-tags=TARGET_TAGS
Specify HTML tags to parse
--force Force creation of folders
--publish Optionally publish newly created content
--slugify Optionally "slugify" content (see slugify.py)
--rename=RENAME Optionally rename content (see rename.py)

Example
'''''''

Instead of accepting the default ``parse2plone`` behaviour, on the command line you
may specify the following::

$ bin/plone run bin/import /path/to/files -p /Plone/foo --html-extensions=html \
--image-extensions=png --target-tags=p

This will configure ``parse2plone`` to (only) import content *from*:

* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags

*to*:

* A Plone site folder named ``/Plone/foo``.

Consternation
-------------

Here are some trouble-shooting comments/tips.

Compiling lxml
~~~~~~~~~~~~~~

``Parse2Plone`` requires ``lxml`` which in turn requires ``libxml2`` and
``libxslt``. If you do not have ``lxml`` installed "globally" (i.e. in your
system Python's site-packages directory) then Buildout will try to install it
for you. At this point ``lxml`` will look for the libxml2/libxslt2 development
libraries to build against, and if you don't have them installed on your system
already *your mileage may vary* (i.e. Buildout will fail).

Database access
~~~~~~~~~~~~~~~

Before running ``parse2plone``, you must either stop your Plone site or
use ZEO. Otherwise ``parse2plone`` will not be able to access the
database.

Communication
-------------

Questions, comments, or concerns? Please e-mail: aclark@aclark.net.

History
-------

0.9.0 (11/03/2010)
~~~~~~~~~~~~~~~~~~

- Fix regressions introduced (or unresolved as of) 0.8.2. Thanks Derek
Broughton for the bug report(s).
- Many fixes to convert_parameter_values() method which converts
recipe parameters to arguments passed to main()
- Fix slugify feature

0.8.2 (11/02/2010)
~~~~~~~~~~~~~~~~~~

- Add rename feature
- Fix regressions introduced in 0.8.1

0.8.1 (10/29/2010)
~~~~~~~~~~~~~~~~~~

- Refactor options/parameters functionality to universally support _SETTINGS dict
- Add "slugify" feature
- Doc fixes
- Add support to optionally publish content after creation
- Add support for generic file import

0.8 (10/27/2010)
~~~~~~~~~~~~~~~~

- Support the importing of content to folders within the Plone site object

0.7 (10/25/2010)
~~~~~~~~~~~~~~~~

- Documentation fixes

0.6 (10/25/2010)
~~~~~~~~~~~~~~~~

- Support customization via recipe parameters and command line arguments

0.5 (10/22/2010)
~~~~~~~~~~~~~~~~

- Revert 'Add Plone to install_requires'

0.4 (10/22/2010)
~~~~~~~~~~~~~~~~

- Add 'Plone' to install_requires

0.3 (10/22/2010)
~~~~~~~~~~~~~~~~

- Another setuptools fix

0.2 (10/22/2010)
~~~~~~~~~~~~~~~~

- Setuptools fix

0.1 (10/21/2010)
~~~~~~~~~~~~~~~~

- Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parse2plone-0.9.0.zip (35.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page