Skip to main content

Client and server for the Open Computational Humanities Research Ecosystem (OCHRE)

Project description

The Open Computational Humanities Research Ecosystem (OCHRE) provides the server infrastructure and client libraries to experiment with complex machine learning and rich humanistic scholarship.

Installation

This package can be installed via pip:

$ pip install pyochre

However, it’s advisable to employ Python virtual environments (here and in other situations), in which case you would run something like the following in a new empty directory:

$ python3 -m venv local
$ source local/bin/activate
$ pip install pyochre

and run deactivate to exit the virtual environment, source local/bin/acticate to enter it again.

The simple package doesn’t include certain dependencies that are important for deploying a dedicated server, but is designed to be fully functional without requiring significant modification. There are three extra options that can be included, ldap, postgres, and torchserve. For instance, to include the full set of options, the command is:

$ pip install pyochre[ldap,postgres,torchserve]

Note that these options may require additional effort, such as non-Python dependencies that need to be installed independently. For most situations, the simple package is the right choice.

Library structure

The package has five submodules:

pyochre.utils

Various functions and classes that are generally useful in many places throughout the package

pyochre.server

The OCHRE server, an orchestrated set of servers and frontends that manages the complexity of interdisciplinary computational research

pyochre.primary_sources

Formal domain descriptions, data, and multimedia materials

pyochre.machine_learning

Training, applying, and fine-tuning models with well-defined signatures

pyochre.scholarly_knowledge

Labeling data, specifying conceptual frameworks, and comparing hypotheses

The latter three submodules correspond to basic concepts in computational humanities research, and constitute the “client library” that will be most relevant for the majority of users.

Additionally, the pyochre.primary_sources, pyochre.machine_learning, pyochre.scholarly_knowledge, and pyochre.server submodules can each be executed as scripts, for instance:

$ python -m pyochre.scholarly_knowledge --help

will print usage information about the pyochre.scholarly_knowledge script. See the Scripts section for detailed information on how to use these tools.

Concepts and background

Primary sources

A primary source consists of the domain, describing types of entities and their potential properties and relationships, and the data, which are the actual instantiations of those types of entities, their specific properties and relationships. For practical reasons, when a property is associated with a substantial amount of information (like a long document, image, video, etc), there is a third aspect of primary sources, materials, allowing them to be stored and accessed efficiently.

As a simple abstract example, primary sources of campaign contribution information might have a domain capturing that there are entity types Politician, Office, Donation, and Organization, that a Politician has text property givenName, relationship runningFor with Office, another property headShot that should be a unique identifier (that will select a file in the materials) and so forth. The domain might have thousands of entities of each type, e.g. a Politician with givenName of “Dan”, runningFor an Office with its own properties, and a headShot value of “some_long_random_value”. Finally, the materials might contain lots of image files, one of them named “some_long_random_value”.

Both domain and data are represented using the RDF framework, and the representation has several goals:

  • Map closely to human understanding and intuition

  • Avoid introducing debatable scholarly inferences

  • Define and constrain the form of information in the primary sources

  • Provide links from the domain into the broader space of human knowledge

Each of these requires careful consideration by the scholar, and can be sensitive to the field, the specific research, and available resources.

OCHRE uses Wikidata entities and properties for semantic links to broader human knowledge.

The SHACL vocabulary is used in domain representations to constrain how entities and properties are arranged in a given primary source.

Machine learning

Machine learning models, in the most general sense, are functions that take in some sort of information as input, and produce another sort of information as output. By describing the structure and semantics (or the “signature”) of these inputs and outputs for a given model, OCHRE can determine how a model can be adapted (“trained” or “fine-tuned”) on new primary sources, or applied to them to infer new information. Focusing on the structural and semantics of model input and output, there are several goals for representation:

  • Both input and output signatures should allow expressive specification of graph structure

  • Provenance of training data for a fitted model to facilitate parameter re-use etc

  • Output of a model, in combination with its signatures and the corresponding inputs, should allow creation of annotations of the same form as described in Scholarly knowledge.

OCHRE has provisionally adopted the MLSchema specification to describe models, though real-world experience will determine if it is sufficiently expressive.

Ideally, signatures are generated as models are assembled and trained. In particular, OCHRE will be integrating the Starcoder project to automatically generate, train, and reuse graph neural networks based on primary sources and scholarly knowledge, with signatures capturing the structural and semantic relationships.

Existing techniques like topic models, pretrained object recognition, and so forth, are being translated into simple signatures that provide a starting point for OCHRE.

Scholarly knowledge

Colloquially, “scholarly knowledge” corresponds to information not clearly immanent in primary sources themselves according to the research context. This can be a rather subtle distinction, because it depends on the aims of the scholar and the norms of the field. As a simple example, scholars often work with materials that have been classified in some way: for Cuneiform tablets, this might be according to language, genre, material, kingdom, and so forth. These classifications differ greatly in certainty, tangibility, agreement, and relevance for a given scholarly effort.

Trying to “get behind” all of this sort of scholarly knowledge is generally a lost cause: the closest situation might be something like archaeological fieldwork, but even that is not straightforward. Instead, OCHRE encourages scholars to find stable, canonical materials and explicitly reify them as “primary sources”, in the sense of “this is what a scholar in my position treats as the foundation to build on”. This view of “primary sources” will often include information like the classifications mentioned earlier, but the fact that the “material” was determined by a spectrogram thousands of years after an inscription was made can be represented in the primary source representation itself.

Therefore, in OCHRE, “scholarly knowledge” roughly refers to structured information that is added and interacted with via OCHRE and by a specific, identifiable agent.

Scholarly knowledge can take an infinite variety of forms, much like primary sources themselves, and so OCHRE again uses the RDF framework for its representation. Even moreso that with model signatures, the details of this representation will need to evolve with real-world experience.

Additional resources being considered

There are several existing standards being considered for OCHRE’s various representational needs: the PROV ontology for describing the provenance of primary sources, models, and annotations

The OCHRE Scripts

Primary sources

The general pattern for converting a non-RDF document is: as a format is parsed, certain “events” fire, each of which is an opportunity to generate RDF triples based on the event and the current location in the document.

Each event indicates what has just been parsed by sending a tag, content, and a dictionary of attributes (only tag is certain to have a value). The particular tags and attributes will be specific to the format and data. For instance, the event that fires for a cell in a CSV file in the column “day” with value “Monday” would send the tag “cell”, the content “Monday”, and the attribute dictionary:

{"id" : "day"}

Along with the tag and attributes, the event sends its location, which for most formats is the list of “parent” events the current event is happening under. If the above event was happening while processing the 22nd row of a file called “some_file.csv”, the location might be (ignore for the moment the “uid” entries):

[
  {"tag" : "table", "content" : "", "attributes" : {"id" : "some_file.csv"}, "uid" : "43k2"},
  {"tag" : "row", "content" : "", "attributes" : {"id" : "22"}, "uid" : "213j"}
]

No matter the format (CSV, XML, etc), events have the same structure, and in fact the event just described might be this JSON object:

{
  "tag" : "cell",
  "attributes" : {"id" : "day", "value" : "Monday"},
  "location" : [
    {"tag" : "table", "attributes" : {"id" : "some_file.csv"}, "uid" : "43k2"},
    {"tag" : "row", "attributes" : {"id" : "22"}, "uid" : "213j"}
  ],
  "uid" : "t98f"
}

Again, the possible values for tag will depend on the format (HTML won’t ever have a “row” tag, but might have “div”, “body”, etc), as will the attributes dictionary.

The goal is to decide what RDF triples to generate when seeing an event. This involves specifying rules that 1) can be determined if they match the event, and 2) describe the RDF triple(s) to create from it. Here is an example of a match portion:

{
  "tag" : ["cell"],
  "attributes" : {"id" : ["day"]}
}

Note how it constrains the tag and the attribute “id” by giving a list of acceptable values.

Here is an example of a creation portion with some placeholders for readability, that creates the two triples (S, P1, O1) and (S, P2, O2) when its rule matches:

{
  "subject" : S,
  "predicate_objects" : [
    {
      "predicate" : P1,
      "object" : O1
    },
    {
      "predicate" : P2,
      "object" : O2
    }
  ]
}

The placeholders are a bit more interesting: they tell OCHRE how to create an RDF node based on the event. Here is an example that creates an RDF integer literal node that doesn’t depend at all on the event:

{
  "type" : "literal",
  "datatype" : "integer",
  "value" : "27"
}

Here is an example that also creates an integer literal node, but based on the event:

{
  "type" : "literal",
  "datatype" : "integer",
  "value" : "{content}"
}

In the CSV example, if the rule were matching rows, this would correspond to the row number. This curly-braces interpolation can also be used to refer to attributes and locations in the event, and mixed arbitrarily with bare strings, allowing the extraction of fairly sophisticated patterns.

Here is an example that creates a URI node, directly specifying the Wikidata entry for “photograph”:

{
  "type" : "uri",
  "value" : "wd:Q125191"
}

Importantly, most entities in a primary source will not have a clear corresponding entity in Wikidata (e.g. there may be a long list of photos, so the above example is useful for saying “this is an instance of a photo”, but not for referring to this or that specific photo). To handle this, every time an event occurs, OCHRE creates a unique identifier based on the event. This unique identifier is the “uid” seen in the full event example above, and can be interpolated as-needed to derive unique URIs. For instance:

{
  "type" : "uri",
  "value" : "ochre:{uid}"
}

is an entity in the OCHRE namespace corresponding to the particular event being processed.

Finally, OCHRE keeps track of the sequential number of each tag value within one tier of the input, and this number can be interpolated with “index”. For example, if the input involves processing sentences, each of which are a sequence of words, the string “{index}” within a word-rule will give the current word’s number within its sentence, starting from 0.

Materials

The mechanisms described above are for generating RDF. There is also the need to connect parts of RDF to materials, larger files that don’t belong directly in the RDF graph, such as JPGs, audio recordings, and long documents. To accomplish this, there is special information that can be added to an entry in a “predicate_object” list:

{
  "predicate" : P,
  "object" : O,
  "file" : "path/some_file_{attributes['name']}.jpg",
  "file_type" : "image/jpg"
}

When the pyochre.primary_sources script encounters a “file” like this, it looks for it on the local filesystem. If found, it creates a unique identifier I based on the file’s contents, and adds an additional RDF triple that links it to the object in the predicate_object rule (roughly, (O, hasMaterialId, I)) indicating “the entity O has an associated file identified with the id I”. Then, after OCHRE creates the RDF graph, it also uploads all such files in the appropriate fashion.

Machine learning

While the ultimate aim is for OCHRE to employ and generate complex models, there are already several simple types of models that can be incorporated via the pyochre.machine_learning script. Ultimately, all models are transformed into MAR archives, so other than the case of Existing MAR archives, these situations are essentially different ways of building such an archive for a particular type of model.

Existing MAR archives

The simplest scenario:

$ python -m pyochre.machine_learning create --mar_url https://torchserve.pytorch.org/mar_files/maskrcnn.mar --name "Object detection" --signature_file https://github.com/comp-int-hum/ochre-python/raw/main/examples/object_detection_signature.ttl

Topic models

Huggingface models

Custom models

Scholarly knowledge

Server

The package also contains the server side of OCHRE under the pyochre.server submodule. When invoked as a script, it functions in most ways as a standard [Django](https://docs.djangoproject.com/en/4.1/) project’s manage.py script:

$ python -m pyochre.server --help

The database for the server can be initialized and initial user created by running:

$ python -m pyochre.server migrate
$ python -m pyochre.server createcachetable
$ python -m pyochre.server collectstatic
$ python -m pyochre.server shell_plus
>> u = User.objects.create(username="joe", email="joe@somewhere.net", is_staff=True, is_superuser=True)
>> u.set_password("CHANGE_ME")
>> u.save()

Finally, start the server with:

$ python -m pyochre.server runserver

At this point you should be able to browse to http://localhost:8000 and interact with the site. Note that it will only be accessible on the local computer and this is by design: it is running without encryption, and using infrastructure that won’t scale well and doesn’t implement some important functionality.

Advanced topics

Converting a new primary source format

Running a full “production”-like server

To run a full-functioning (though resource-constrained) OCHRE server on your personal computer you’ll need to take a few more steps than the simple procedure described in the Server section.

First, install either Docker or Podman, depending on what’s available or easiest for your operating system. In what follows, substitute “docker” for “podman” if you installed the former.

Second, start containers for the Jena RDF database and the Redis cache:

$ podman run -d --rm --name jena -p 3030:3030 -e ADMIN_PASSWORD=CHANGE_ME docker.io/stain/jena-fuseki
$ podman run -d --rm --name redis -p 6379:6379 docker.io/library/redis

Third, the Celery execution server and Torchserve model server each need to run alongside the OCHRE server. The simplest way to accomplish this is to open two more terminals, navigate to the virtual environment directory where OCHRE is installed, run:

$ source local/bin/activate

to enter the same virtual environment as the OCHRE package, and then run the following commands, one in each terminal:

$ celery -A pyochre.server.ochre worker -l DEBUG
$ torchserve --model-store ~/ochre/models/ --foreground --no-config-snapshots

At this point, with the two containers running (can be verified with podman ps), and Celery and TorchServe running in separate terminals, running:

$ USE_JENA=True USE_TORCHSERVE=True python -m pyochre.server runserver

Should start the OCHRE server, and the site should work near-identically to when it’s officially deployed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyochre-0.0.4.tar.gz (403.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page