<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns="http://usefulinc.com/ns/doap#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Project><name>rdfadict</name>
<shortdesc>An RDFa parser wth a simple dictionary-like interface.</shortdesc>
<description>========
rdfadict
========

:Date: $LastChangedDate: 2008-08-12 13:59:24 -0700 (Tue, 12 Aug 2008) $
:Version: $LastChangedRevision: 10667 $
:Author: Nathan R. Yergler &lt;nathan@creativecommons.org&gt;
:Organization: `Creative Commons &lt;http://creativecommons.org&gt;`_
:Copyright: 
   2006-2008, Nathan R. Yergler, Creative Commons; 
   licensed to the public under the `MIT license 
   &lt;http://opensource.org/licenses/mit-license.php&gt;`_.

.. contents::

Installation
************

rdfadict and its dependencies may be installed using `easy_install 
&lt;http://peak.telecommunity.com/DevCenter/EasyInstall&gt;`_ (recommended) ::

  $ easy_install rdfadict

or by using the standard distutils setup.py::

  $ python setup.py install

If you are installing from source, you will also need the following
packages:

* `rdflib 2.4.x &lt;http://rdflib.net/&gt;`_
* `pyRdfa &lt;http://www.w3.org/2007/08/pyRdfa/&gt;`_
* `html5lib &lt;http://code.google.com/p/html5lib/&gt;`_ (required if you
  want to support non-XHTML documents)

``easy_install`` will satisfy depedencies for you if necessary.


Usage
*****

.. admonition:: Document Purpose

     This document is intended to provide a set of literate tests
     for the ``rdfadict`` package; it is **not** intended to provide thorough
     coverage of RDFa syntax or semantics.  See the `RDF Primer 
     &lt;http://www.w3.org/2006/07/SWD/RDFa/primer/&gt;`_ or the `RDFa Syntax 
     &lt;http://www.w3.org/2006/07/SWD/RDFa/syntax/&gt;`_ for details on RDFa.

**rdfadict** parses RDFa metadata encoded in HTML or XHTML documents.  It can
parse a block of text (as a string), or a URL.  For example, given the 
following block of sample text:

  &gt;&gt;&gt; rdfa_sample = """
  ... &lt;div xmlns:dc="http://purl.org/dc/elements/1.1/"
  ...      xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
  ... &lt;h1 property="dc:title"&gt;Vacation in the South of France&lt;/h1&gt;
  ... &lt;h2&gt;created 
  ... by &lt;span property="dc:creator"&gt;Mark Birbeck&lt;/span&gt;
  ... on &lt;span property="dc:date" type="xsd:date"
  ...          content="2006-01-02"&gt;
  ...   January 2nd, 2006
  ...    &lt;/span&gt;
  ... &lt;/h2&gt;
  ... &lt;/div&gt;"""

Triples can be extracted using **rdfadict**:

  &gt;&gt;&gt; import rdfadict
  &gt;&gt;&gt; base_uri = "http://example.com/rdfadict/"
  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; triples = parser.parse_string(rdfa_sample, base_uri)

We define the variable ``base_uri`` to let the parser know what URI assertions
without subjects apply to.  

Based on our example text, we expect to get three triples back -- title, 
creator and date.  Triple are indexed as a dictionary, first by subject,
then by predicate, finally retuning a ``list`` of objects.  For example, 
a list of all subjects is retrieved using:

  &gt;&gt;&gt; triples.keys()
  ['http://example.com/rdfadict/']

If assertions were made about resources other than the default, those URIs
would appear in this list.  We can verify how many predicates were found
for this subject by accessing the next level of the dictionary:

  &gt;&gt;&gt; len(triples['http://example.com/rdfadict/'].keys())
  3

Finally, we can retrieve the value for the title by fully dereferencing
the dictionary:

  &gt;&gt;&gt; triples['http://example.com/rdfadict/'][
  ...     'http://purl.org/dc/elements/1.1/title']
  ['Vacation in the South of France']

Note that the objects are stored as a list by the default triple sink.

Multiple Assertions
===================

Because the ``property`` attribute always denotes triple with a literal string
as its object and ``rel`` and ``rev`` denote triples with URIs as their 
objects, it is possible to make multiple assertions with a single HTML tag.

For example:

  &gt;&gt;&gt; multi_rdfa = """
  ... &lt;div xmlns:foaf="http://xmlns.com/foaf/0.1/" 
  ...      xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
  ...   This photo was taken by &lt;a about="photo1.jpg" property="dc:title"
  ...   content="Portrait of Mark" rel="dc:creator"
  ...   rev="foaf:img" 
  ...   href="http://www.blogger.com/profile/1109404"&gt;Mark Birbeck&lt;/a&gt;.
  ... &lt;/div&gt;
  ... """

In this statement we are making three assertions: two involving URI objects
(specified by ``rel`` and ``rev``), and one involving the ``property``.

  &gt;&gt;&gt; import rdfadict
  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; multi_base_uri = "http://example.com/multiassert/"
  &gt;&gt;&gt; triples = parser.parse_string(multi_rdfa, multi_base_uri)

We expect the triples generated to have two subjects: the photo URI (for the 
``rel`` and ``property`` assertions) and the ``href`` URI (for the ``rev``
assertion).

  &gt;&gt;&gt; len(triples.keys()) == 2
  True
  &gt;&gt;&gt; 'http://example.com/multiassert/photo1.jpg' in triples.keys()
  True
  &gt;&gt;&gt; 'http://www.blogger.com/profile/1109404' in triples.keys()
  True

Finally, we verify that the assertions made about each subject are correct:

  &gt;&gt;&gt; len(triples['http://example.com/multiassert/photo1.jpg'].keys()) == 2
  True
  &gt;&gt;&gt; triples['http://example.com/multiassert/photo1.jpg'] \
  ...          ['http://purl.org/dc/elements/1.1/creator']
  ['http://www.blogger.com/profile/1109404']
  &gt;&gt;&gt; triples['http://example.com/multiassert/photo1.jpg'] \
  ...          ['http://purl.org/dc/elements/1.1/title']
  ['Portrait of Mark']

  &gt;&gt;&gt; triples['http://www.blogger.com/profile/1109404']
  {'http://xmlns.com/foaf/0.1/img': ['http://example.com/multiassert/photo1.jpg']}


Resolving Statements
====================

When resolving statements, the REL, REV, CLASS and PROPERTY attributes expect
a `CURIE &lt;http://www.w3.org/2001/sw/BestPractices/HTML/2005-10-21-curie&gt;`_, 
while the HREF property expects a URI.  When resolving CURIEs, un-namespaced 
values which are not HTML reserved words (such as license) are ignored to 
prevent "triple bloat".

Given an example:

  &gt;&gt;&gt; rdfa_sample2 = """
  ... &lt;div xmlns:dc="http://purl.org/dc/elements/1.1/"
  ...      xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
  ... &lt;link rel="alternate" href="/foo/bar" /&gt;
  ... &lt;h1 property="dc:title"&gt;Vacation in the South of France&lt;/h1&gt;
  ... &lt;h2&gt;created 
  ... by &lt;span property="dc:creator"&gt;Mark Birbeck&lt;/span&gt;
  ... on &lt;span property="dc:date" type="xsd:date"
  ...          content="2006-01-02"&gt;
  ...   January 2nd, 2006
  ...    &lt;/span&gt;
  ... &lt;/h2&gt;
  ... &lt;img src="/myphoto.jpg" class="photo" /&gt;
  ... (&lt;a href="http://creativecommons.org/licenses/by/3.0/" rel="license"
  ...    about="/myphoto.jpg"&gt;CC License&lt;/a&gt;)
  ... &lt;/div&gt;"""

We can extract RDFa triples from it:

  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; base_uri2 = "http://example.com/rdfadict/sample2"
  &gt;&gt;&gt; triples = parser.parse_string(rdfa_sample2, base_uri2)

This block of RDFa includes a license statement about another document, the
photo:

  &gt;&gt;&gt; len(triples["http://example.com/myphoto.jpg"])
  1

  &gt;&gt;&gt; triples["http://example.com/myphoto.jpg"].keys()
  ['http://www.w3.org/1999/xhtml/vocab#license']
  &gt;&gt;&gt; triples["http://example.com/myphoto.jpg"] \
  ...    ['http://www.w3.org/1999/xhtml/vocab#license']
  ['http://creativecommons.org/licenses/by/3.0/']

There are two things to note with respect to this example.  First, the relative
URI for the photo is resolved with respect to the ``base_uri`` value.  Second,
the "class" attribute is not processed, because it's value is not in a 
declared namespace:

  &gt;&gt;&gt; 'photo' in [ n.lower() for n in
  ...      triples['http://example.com/rdfadict/sample2'].keys() ]
  False

Similar to this case is the ``link`` tag in the example HTML.  Based on the
subject resolution rules for ``link`` and ``meta`` tags, no subject can be 
resolved for this assertion.  However, this does not throw an exception because
the value of the ``rel`` attribute is not namespaced.

Consider an alternative, contrived example:

  &gt;&gt;&gt; link_sample = """
  ... &lt;div xmlns:dc="http://purl.org/dc/elements/1.1/"
  ...      xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  ...      about="http://example.com/"&gt;
  ... &lt;link rel="dc:creator" href="http://example.com/birbeck" /&gt;
  ... &lt;/div&gt;"""

Based on the subject resolution rules for ``link`` tags, we expect to see
one assertion: that http://example.com/birbeck represents the creator of
http://example.com.  This can be tested; note we supply a different 
``base_uri`` to ensure the subject is being properly resolved.

  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; link_base_uri = 'http://example.com/foo'
  &gt;&gt;&gt; triples = parser.parse_string(link_sample, link_base_uri)

  &gt;&gt;&gt; triples.keys()
  ['http://example.com/']
  &gt;&gt;&gt; len(triples['http://example.com/'])
  1
  &gt;&gt;&gt; triples['http://example.com/']['http://purl.org/dc/elements/1.1/creator']
  ['http://example.com/birbeck']

Note that this HTML makes **no** assertions about the source document:

  &gt;&gt;&gt; link_base_uri in triples.keys()
  False

If the HTML sample is modified slightly, and the ``about`` attribute
is omitted, rdfadict is resolves the subject to the explicit base URI.

  &gt;&gt;&gt; link_sample = """
  ... &lt;div xmlns:dc="http://purl.org/dc/elements/1.1/"
  ...      xmlns:xsd="http://www.w3.org/2001/XMLSchema" &gt;
  ... &lt;link rel="dc:creator" href="http://example.com/birbeck" /&gt;
  ... &lt;/div&gt;"""
  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; link_base_uri = 'http://example.com/foo'
  &gt;&gt;&gt; triples = parser.parse_string(link_sample, link_base_uri)
  &gt;&gt;&gt; link_base_uri in triples.keys()
  True

If a namespace is unable to be resolved, the assertion is ignored.

  &gt;&gt;&gt; ns_sample = """
  ... &lt;a href="http://example.com/foo" rel="foo:bar"&gt;Content&lt;/a&gt;
  ... """
  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; triples = parser.parse_string(ns_sample, 'http://example.com/bob')
  &gt;&gt;&gt; triples
  {}

See the `RDFa Primer &lt;http://www.w3.org/2006/07/SWD/RDFa/primer/&gt;`_
for more RDFa examples.

Parsing Files
=============

**rdfadict** can parse from three sources: URLs, file-like objects, or
strings.  The examples thus far have parsed strings using the
``parse_string`` method.  A file-like object can also be used:

   &gt;&gt;&gt; from StringIO import StringIO
   &gt;&gt;&gt; file_sample = """
   ... &lt;html&gt;
   ...  &lt;body&gt;
   ...    &lt;a href="http://creativecommons.org/licenses/by/3.0/"
   ...       rel="license"&gt;the license&lt;/a&gt;
   ...  &lt;/body&gt;
   ... &lt;/html&gt;
   ... """
   &gt;&gt;&gt; parser = rdfadict.RdfaParser()
   &gt;&gt;&gt; result = parser.parse_file(StringIO(file_sample),
   ...                            "http://creativecommons.org")
   &gt;&gt;&gt; result.keys()
   ['http://creativecommons.org']
   &gt;&gt;&gt; result['http://creativecommons.org']
   {'http://www.w3.org/1999/xhtml/vocab#license': ['http://creativecommons.org/licenses/by/3.0/']}

Parsing By URL
==============

**rdfadict** can parse a document retrievable by URI.  Behind the
scenes it uses ``urllib2`` to open the document.  

  &gt;&gt;&gt; parser = rdfadict.RdfaParser()
  &gt;&gt;&gt; result = \
  ... parser.parse_url('http://creativecommons.org/licenses/by/2.1/jp/')
  &gt;&gt;&gt; print result['http://creativecommons.org/licenses/by/2.1/jp/']\
  ... ['http://purl.org/dc/elements/1.1/title'][0]
  表示 2.1 日本

Note that ``parse_file`` is not recommended for use with ``urllib2``
handler objects.  In the event that pyRdfa encounters a non-XHTML
source, it re-opens the URL to begin processing with a more tolerant
parser.  When ``parse_file`` is used to initiate parsing, it is unable
to re-open the URL correctly.

Triple Sinks
============

**rdfadict** uses a simple interface (the triple sink) to pass RDF triples
extracted back to some storage mechanism.  A class which acts as a triple
sink only needs to define a single method, ``triple``.  For example::

   class StdOutTripleSink(object):
       """A triple sink which prints out the triples as they are received."""

       def triple(self, subject, predicate, object):
           """Process the given triple."""

           print subject, predicate, object

The default triple sink models the triples as a nested dictionary, 
as described above.  Also included with the package is a list triple sink,
which stores the triples as a list of 3-tuples.  To use a different sink,
pass an instance in as the ``sink`` parameter to either parse method.  For
example:

   &gt;&gt;&gt; parser = rdfadict.RdfaParser()
   &gt;&gt;&gt; list_sink = rdfadict.sink.SimpleTripleSink()
   &gt;&gt;&gt; result = parser.parse_string(rdfa_sample, base_uri, sink=list_sink)

   &gt;&gt;&gt; result is list_sink
   True

   &gt;&gt;&gt; len(list_sink)
   3

Note that the parse method returns the sink used.  Since the sink we're using
is really just a ``list``, the interpreter prints the contents upon return.

Limitations and Known Issues
****************************

**rdfadict** currently does not implement the following areas properly; 
numbers in parenthesis refer to the section number  in the `RDFa Syntax 
Document &lt;http://www.w3.org/2006/07/SWD/RDFa/syntax/&gt;`_.

* ``xml:base`` is not respected (2.3)
* Typing is not implemented; this includes implicit XMLLiteral typing as well
  as explicit types specified by the ``datatype`` attribute (5.1)
* Blank nodes are not guaranteed to work per the syntax document (5.2); if
  you try to use them, you will probably be disappointed.
* Reification is not implemented (5.3).


Change History
**************

0.7 (2009-06-02)
================

* DictTripleSink uses ``encode`` instead of ``str``, making it
  friendlier to Unicode.
* Eliminated custom pyRDFa wrapper.
* Added a test for handling Unicode triples. 

0.6 (2008-10-14)
================

* Added DictSetTripleSink

0.5.2 (2008-08-14)
==================

* Corrected bug with parse_url; non-XHTML sources will now be parsed
  correctly.

0.5.1 (2008-08-13)
==================

* Added ``parse_file`` method for parsing data from a file-like
  object.
* ``parseurl`` and ``parsestring`` are now aliased to ``parse_url``
  and ``parse_string`` respectively.

0.5 (2008-08-12)
================

* rdfadict now acts as a wrapper for `pyRdfa
  &lt;http://www.w3.org/2007/08/pyRdfa/&gt;`_ for full compliance with the
  candidate recommendation.
* The ``cc`` namespace is no longer special cased with a default
  value.
* Removed tidy extra and uTidylib dependency; parsing is now handled
  by pyRdfa which uses html5lib for handling more broken HTML.
* Doctests are now in README.txt in the rdfadict package.
* The default XHTML namespace is now
  http://www.w3.org/1999/xhtml/vocab instead of
  http://www.w3.org/1999/xhtml

0.4.2 (2007-06-05)
==================

* Corrected dependency link for uTidylib.

0.4.1 (2007-03-21)
==================

* Use `uTidylib &lt;http://utidylib.berlios.de&gt;`_ instead of ``os.system`` for
  wrapping tidy.

0.4.0 (2007-03-20)
==================

* Provide rudimentary fallback to Tidy when we encounter HTML which is not
  well-formed XML.

0.3.3 (2007-03-14)
==================

* Removed special case for ``cc:license``; instead, ``cc`` namespace simply 
  has a default value of ``http://web.resource.org/cc/``.

0.3.2 (2007-03-12)
==================

* Ignore assertions which have unresolvable namespace prefixes.
* Special case handling for ``cc:license``.

0.3.1 (2007-03-09)
==================

* Fixed bug in subject resolution exception handling.

0.3 (2007-03-08)
================

* Fixed resolution of URIs v. CURIEs
* Drop assertions with non-namespaced CURIEs as the predicate (per updated spec)
* Updated test suite to comply with updated RDFa specification
* Corrected subject resolution behavior for &lt;link&gt; and &lt;meta&gt; elements
* Implemented entry point and extractor interface for compatibility with the
  ccrdf.rdfextract library.
* Fixed parsing of ``rev`` assertions, which was formerly completely broken.

0.2 (2006-11-21)
================

* Directly subclass list and dict for our sample triple sinks
* Additional package metadata for PyPI
* Additional documentation of sink interface and tests for the SimpleTripleSink

0.1 (2006-11-20)
================

* Initial public release


Download
********</description>
<homepage rdf:resource="http://wiki.creativecommons.org/RdfaDict" />
<maintainer><foaf:Person><foaf:name>Nathan R. Yergler</foaf:name>
<foaf:mbox_sha1sum>e6fd927e2c047c0aa9740bcc67f8b04fcd2d9ae5</foaf:mbox_sha1sum></foaf:Person></maintainer>
<release><Version><revision>0.7</revision></Version></release>
</Project></rdf:RDF>