Skip to main content

Library for the Open Computational Humanities Research Ecosystem (OCHRE).

Project description

OCHRE: The Open Computational Humanities Research Ecosystem

This package can be installed via pip:

pip install pyochre

It has three submodules that correspond more or less to basic concepts in computational humanities research:

  • primary_sources ::
  • machine_learning ::
  • scholarly_knowledge ::

In addition to using the package as a library in one's own code, each submodule can be executed as a command-line script that provides quick ways to perform common tasks. For instance:

python -m pyochre.scholarly_knowledge --help

prints information on how to use the scholarly_knowledge script.

Primary sources

A primary source consists of the domain, describing types of entities and their potential properties and relationships, and the data, which are the actual instantiations of those types of entities, their specific properties and relationships. For practical reasons, when a property is associated with a substantial amount of information (like a long document, image, video, etc), there is a third aspect of primary sources, materials, allowing them to be stored and accessed efficiently.

As a simple abstract example, primary sources of campaign contribution information might have a domain capturing that there are entity types Politician, Office, Donation, and Organization, that a Politician has text property givenName, relationship runningFor with Office, another property headShot that should be a unique identifier (that will select a file in the materials) and so forth. The domain might have thousands of entities of each type, e.g. a Politician with givenName of "Dan", runningFor an Office with its own properties, and a headShot value of "some_long_random_value". Finally, the materials might contain lots of image files, one of them named "some_long_random_value".

Simple start

There are several pre-made methods for creating new primary sources for common situations. The fundamental goal is to take some files in particular formats, and extract all the entities, relationships, and properties.

Details

We represent both domain and data using the RDF framework. Only the domain is strictly necessary when creating a new primary source: data and materials can be added later, or incrementally. The simplest way to create a new primary source is using the script:

python -m pyochre.primary_sources create --domain_file domain.ttl --name "My Research Area"

This would create new primary sources named "My Research Area", whose domain structure is described in the RDF Turtle format, with no data or materials yet.

Domain

The domain of the primary sources is expressed using RDF, and has several goals:

  • Map closely to human understanding and intuition
  • Avoid introducing debatable scholarly inferences
  • Define and constrain the form of information in the primary sources
  • Provide links from the domain into the broader space of human knowledge

Each of these requires careful consideration by the scholar, and can be sensitive to the field, the specific research, and available resources.

SHACL vocabulary

Converting various formats

The technical details and advanced methods are described further below, but on the simplest level, a CSV file can be converted by simply running:

python -m pyochre.primary_sources convert --format csv --input spreadsheet.csv --

How formats are converted

The general pattern for converting a non-RDF document is: as the format is read certain "events" fire, each of which is an opportunity to generate RDF triples based on the event and the current location in the document.

Events

Each event is either the start or end of something, a tag, and a dictionary of attributes. For instance, an event indicating the end of a single cell in a CSV file might be:

("end", "cell", {"n" : "day", "value" : "Monday"})

A location is the sequence of "start" events enclosing the current event, such as:

[
  ("start", "table", {"n" : "some_file.csv"}),
  ("start", "row", {"n" : "22"})
]

In plain English, this pair of event and location says "We are on row 22 of a CSV file named 'some_file.csv' and have just finished reading its cell corresponding to the column 'day' with value 'Monday'". No matter the format (CSV, XML, etc), events have the same structure (two strings and a dictionary). The possible values for tag will depend on the format (HTML won't ever have a "row" tag, but might have "div", "body", etc), as will the contents of the attributes dictionary.

Specification

The goal, therefore, is to specify how to take an (event, location) pair, and create zero or more RDF triples. This involves specifying rules that 1) can be determined to match such a pair, and 2) if they match, create one or more RDF triples from the information in the pair. Rules are specified in JSON, and have the form:

{
  "metadata" : (arbitrary dictionary),
  "namespaces" : (dictionary from prefix to URI),
  "predefined" : (list of triples),
  "rules" : (list of rules)
}
{
  "match" : (none|list of match specs),
  "create" : (list of creation specs)
}
{
  "event_type" : (none|list of matching event types, i.e. "start" and/or "end"),
  "tag" : (none|list of matching tags),
  "attributes" : (dict of attribute names to matching values in form (none|list of matching values)),
  "location" : (none|list of match specs)
}
{
  "subject" : (value_spec),
  "predicate" : (value_spec),
  "object" : (value_spec),
  "type" : (one of "domain", "data", "materials")
}
{
  "type" : (uri|literal|bnode),
  "value" : (mustache),
  "xml:lang" : (mustache),
  "datatype" : (mustache)
}

OCHRE Server

Quick start development environment

python -m pyochre.server migrate
python -m pyochre.server createcachetable

These instructions assume you have opened a command-line session, e.g. using Terminal on OSX, xterm on Linux, or PowerShell on Windows. In the latter case, the commands may differ slightly from what's described here, but not in substantial ways. It might be worthwhile to enable the Windows Linux Subsystem so you can work in Windows Terminal, where they should be identical.

The necessary software (beyond what will be installed directly from the command-line) is a recent version of Python 3, Git, and NPM, each of which has a corresponding lower-case command, so if they are installed you should be able to see usage info by running e.g.:

$ npm -h

Note that the initial $ sign indicates it is being run on the command-line (your "prompt" will look different, e.g. with information about your computer, username, etc). If you are missing one of these items, you should install them using the idiomatic approach for your operating system, e.g. on many versions of Linux you might run:

$ apt install npm -y

Getting the framework and installing libraries

The first step is to clone the code repository, and moving into the newly-created directory:

$ git clone https://github.com/comp-int-hum/ochre-python.git
$ cd ochre-python

The next step is to creating and activating a fresh Python virtual environment:

$ python3 -m venv local
$ source local/bin/activate

And then, installing all the Python libraries used by the framework, and the utility that will manage Javascript libraries:

$ pip install -r requirements.txt
$ npm install bower

Initializing the framework

At this point, you can start using the standard Django management script. First, initialize the database with the commands:

$ python -m pyochre.server migrate
$ python -m pyochre.server createcachetable

And collect various static resources:

$ python manage.py bower install
$ python manage.py collectstatic

Finally, the following command will start the framework:

$ python manage.py runserver localhost:8080

At this point you should be able to browse to http://localhost:8080 and interact with the site. Note that it will only be accessible on the local computer and this is by design: it is running without encryption, and using infrastructure that won't scale well and doesn't implement some important functionality. The populate script created three users (user1, user2, user3), with different roles (admin, faculty, student), each with password user. Though not too useful in the development scenario, you should also be able to add new users and change passwords from the login page: note that, rather than sending a link via email, in development mode the framework will print out information on the command-line where the runserver command is running. You can log in/out as the users while developing, so you can see and test your changes/new app/etc.

Editing code, templates, and settings

The best reference for how everything is organize is probably the Django documentation: the CDH framework is a fairly complex example, but generally tries to follow best practices in those documents. When you edit code or templates, in most cases the changes will be immediately reflected in the site. The exceptions to this are when changes are made to the database schema (i.e. when models are added, removed, or modified in the various models.py files). When such changes have been made, the local server should be stopped with <ctrl>-c, and commands like the following will compute the database changes, apply them, and restart the server:

python manage.py makemigrations
python manage.py migrate
python manage.py runserver localhost:8080

The file cdh/settings.py is the canonical configuration source for the entire framework, and is heavily commented with useful information (comments beginning with # CDH NOTE: are specifically relevant).

Adding a new app to the framework

An important common goal is to add a new app (i.e. another item on the top navigation bar of the site for logged in users). It's actually a simple process to get the basics in place:

  1. Choose a simple (lower-case, underscores-instead-of-spaces) name and initialize the app directory. If you choose my_great_thing, you would run: python manage.py startapp my_great_thing
  2. In the new directory, create a new file called urls.py (e.g. my_great_thing/urls.py) and add the following content:
    from django.urls import path
    from django.views.generic import TemplateView
    
     app_name = "my_great_thing"
     urlpatterns = [
     	path("", TemplateView.as_view(template_name="cdh/simple_interface.html", extra_context={"content" : "Nothing here!"}))
     ]
    
  3. Finally, choose a more human-friendly title and add an entry to the APPS dictionary at the top of cdh/settings.py, e.g. "my_great_app" : "My great app",.

Now, when you browse to http://localhost:8080 as a logged-in user, you should see your app under the Interaction menu. It has no content yet, but now you can focus on developing in the my_great_thing/ directory, which is a straightforward minimal Django app where you can build whatever you'd like, looking to existing apps for examples and code to import.

Developing complex apps

At the top of the example settings file there are several options set to False:

USE_LDAP = False
USE_CELERY = False
USE_POSTGRES = False
USE_JENA = False
USE_TORCHSERVE = False

These correspond to servers that the framework does, or will, rely on for key functionality that the development setup "fakes" by default. LDAP provides user and group management/authentication for the entire CDH infrastructure, Celery provides asynchronous execution of long-running computations like training and applying models, PostGRES is a production SQL database, and Jena provides storage and manipulation of RDF datasets. In each case, it is possible to start a suitable server on your computer, change the option to True, and edit the connection information further down in the settings file to connect it to your development site. Most importantly, you will need to install either Docker or Podman.

For local development it makes sense to ignore PostGRES and LDAP: they require additional setup and rarely impact development. So, in cdh/settings.py just change USE_CELERY, USE_JENA, and USE_TORCHSERVE to True, and run the following commands to start needed containers (replace podman run with docker run, depending on which you installed):

podman run -d --rm --name jena -p 3030:3030 -e ADMIN_PASSWORD=CHANGE_ME docker.io/stain/jena-fuseki
podman run -d --rm --name redis -p 6379:6379 docker.io/library/redis

Celery and Torchserve each need to have an additional command running alongside Django. Since Celery will be running functions defined in the Django code itself, it needs to have access to that code as it changes during development. Basically, we want Celery to behave the same way as the development web server, which reloads automatically when we edit the code base. In a different terminal (or tab), change into the cdh-domain directory, and run:

$ source local/bin/activate
$ watchmedo auto-restart --directory ./ --pattern=*.py --recursive -- celery -A cdh worker -l DEBUG

You should generally be able to forget about that terminal, as it will restart the Celery worker whenever the code changes. Finally, TorchServe should also be run in another terminal/tab, with the following command:

$ source local/bin/activate
$ torchserve --model-store cdh_site_data/models/ --foreground --no-config-snapshots --ts-config utils/torchserve.properties

At this point, with the two containers running (can be verified with podman ps), and Celery and TorchServe running in separate terminals, the entire site should work near-identically to production. The only functionality lacking is sending email, which is how we let new users sign up, reset their passwords, etc: this is because there is no way around needing a real email account to send from. Under this configuration, emails that would normally be sent will instead be printed to the console where the Django server is running.

Production deployment (not relevant for development)

A goal of this repository is that the process described above for development only needs a handful of changes to deploy to the actual CDH site (test or production). This section documents how to deploy the site and associated servers, but probably isn't important for developing the site content or functionality. The CDH infrastructure looks something like this:

flowchart LR
  subgraph Wyatt
  end
  subgraph Spenser
  end
  subgraph Larkin
  end
  subgraph Bishop
    gpu(GPU)
    celery(Celery)
    ldap(LDAP)
    torchserve(TorchServe)
  end
  subgraph Ludwig
    celerydev(Celery)
    torchservedev(TorchServe)
    postgres(PostGRES)
    jena(Jena)
  end
  subgraph Marcel
    raid(RAID10)
    nfs(NFS)
  end
  subgraph CDH
    django(Django)
  end
  classDef servers fill:#f96;
  classDef workstations fill:#096;
  class Bishop,Ludwig,Marcel,CDH servers;
  class Larkin,Wyatt,Spenser workstations;

Preamble

From a default installation of Debian 11, the following steps should get you pretty close to a functional web server. It assumes that you are logged in as root, or have been elevated to root via sudo. First, install needed repositories and software, create a cdh user, and make a few directories for static files and certificates:

apt update
apt install software-properties-common -y
apt-add-repository contrib
apt-add-repository non-free
apt update
apt install python3-venv rsync nginx postgresql postgresql-client gcc python3-dev postgresql-server-dev-all git unzip npm -y
useradd -m -s /bin/bash cdh
mkdir /mnt/static /mnt/media /mnt/uploads /mnt/certs
chown cdh:cdh /mnt/static /mnt/media /mnt/uploads /mnt/certs
chmod go-rwx /mnt/static /mnt/media /mnt/uploads /mnt/certs

Web server

Copy encryption certificates into place, e.g. as provided by JHU IT services:

cp SIGNED_CERT_FILE /mnt/certs/site.pem
cp KEY_FILE /mnt/certs/site.key
chmod go-rwx /mnt/certs/site.key

Set up Postgres, Gunicorn, and NginX, noting the password you select for the "cdh" database user:

sudo -u postgres createuser cdh -P -d
sudo -u postgres createdb -O cdh cdh
cp utils/gunicorn.service /etc/systemd/system/
cp utils/gunicorn.socket /etc/systemd/system/
cp utils/cdh /etc/nginx/sites-available/
ln -sf /etc/nginx/sites-available/cdh /etc/nginx/sites-enabled/cdh
rm /etc/nginx/sites-enabled/default
systemctl daemon-reload
systemctl enable gunicorn
systemctl start gunicorn
systemctl restart nginx

Next, switch to the cdh user, clone this repository into the home directory, set up a Python virtual environment therein, and follow the steps described earlier, changing development_requirements.txt to production_requirements.txt.

Finally, as root, restart the server process:

systemctl restart gunicorn

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyochre-0.0.2.tar.gz (9.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page