Skip to main content

Convert Marc21 Classification records in MARC/XML to SKOS/RDF

Project description

Build status Test coverage Code health Latest version MIT license

Python script for converting MARC 21 Classification and MARC 21 Authority records (serialized as MARCXML) to SKOS concepts.

Initially developed to support the project “Felles terminologi for klassifikasjon med Dewey”, for converting Dewey Decimal Classification (DDC) records. Issues and suggestions for generalizations and improvements are welcome!

See mapping schema for MARC21 Classification and for MARC21 Authority below.

Installation

Releases can be installed from the command line with pip:

$ pip install --upgrade mc2skos             # with virtualenv or as root
$ pip install --upgrade --user mc2skos      # install to ~/.local
  • Works with both Python 2.7 and 3.4+. See Travis for details on tested Python versions.

  • If lxml fails to install on Windows, try the windows installer from from PyPI.

  • If lxml fails to install on Unix, install system packages python-dev and libxml2-dev

  • Make sure the Python scripts folder has been added to your PATH.

To directly use a version from source code repository:

$ git clone https://github.com/scriptotek/mc2skos.git
$ cd mc2skos
$ pip install -e .

Usage

mc2skos infile.xml outfile.ttl      # from file to file
mc2skos infile.xml > outfile.ttl    # from file to standard output

Run mc2skos --help or mc2skos -h for options.

URIs

URIs are generated automatically for known concept schemes, identified from 084 $a for classification records and from 008[11] / 040 $f for authority records. To list known concept schemes:

$ mc2skos -l

To add more vocabularies, you can edit vocabularies.yml. Pull requests for adding more vocabularies are very welcome!

URIs can be also be generated on the fly from an URI template specified with option --uri. The following template parameters are recognized:

  • {control_number} is the control number from 001, 010 or 016. The current approach is to use 010 or 016 if defined, otherwise 001. If you find examples where this approach fails, please add them to [#42](https://github.com/scriptotek/mc2skos/issues/42).

  • {collection} is “class”, “table” or “scheme”

  • {object} is a member of the classification scheme and part of a {collection}, such as a specific class or table. Spaces in the URI are replaced by hyphens or another character configured with option –whitespace.

  • {edition} is taken from 084 $c (with language code stripped)

To add skos:inScheme statements to all records, an URI template can be specified with option --scheme. Otherwise, it will be derived from a default template if the concept scheme is known.

To add an additional skos:inScheme statement to table records, an URI template can be specified with option --table_scheme. Otherwise, it will be derived from a default template if the concept scheme is known.

The following example is generated from a DDC table record:

<http://dewey.info/class/6--982/e21/> a skos:Concept ;
    skos:inScheme <http://dewey.info/scheme/edition/e21/>,
                  <http://dewey.info/table/6/e21/> ;
    skos:notation "T6--982" ;
    skos:prefLabel "Chibchan and Paezan languages"@en .

Mapping schema for MARC21 Classification

Only a small part of the MARC21 Classification data model is converted, and the conversion follows a rather pragmatic approach, exemplified by the mapping of the 7XX fields to skos:altLabel.

MARC21XML

RDF

001 Control Number (see note above on 001, 010 & 016)

dcterms:identifier

005 Date and time of latest transaction

dcterms:modified

008[0:6] Date entered on file

dcterms:created

008[8]="d" or "e" Classification validity

owl:deprecated

010 Control Number (see note above on 001, 010 & 016)

dcterms:identifier

016 Control Number (see note above on 001, 010 & 016)

dcterms:identifier

153 $a, $c, $z Classification number

skos:notation

153 $j Caption

skos:prefLabel

153 $e, $f, $z Classification number hierarchy

skos:broader

253 Complex See Reference

skos:editorialNote

353 Complex See Also Reference

skos:editorialNote

680 Scope Note

skos:scopeNote

683 Application Instruction Note

skos:editorialNote

684 Auxiliary Instruction Note

skos:editorialNote

685 History Note

skos:historyNote

700 Index Term-Personal Name

skos:altLabel

710 Index Term-Corporate Name

skos:altLabel

711 Index Term-Meeting Name

skos:altLabel

730 Index Term-Uniform Title

skos:altLabel

748 Index Term-Chronological

skos:altLabel

750 Index Term-Topical

skos:altLabel

751 Index Term-Geographic Name

skos:altLabel

753 Index Term-Uncontrolled

skos:altLabel

765 Synthesized Number Components

mads:componentList (see below)

Synthesized number components

Components of synthesized numbers explicitly described in 765 fields are expressed using the mads:componentList property, and to preserve the order of the components, we use RDF lists. Example:

@prefix mads: <http://www.loc.gov/mads/rdf/v1#> .

<http://dewey.info/class/001.30973/e23/> a skos:Concept ;
    mads:componentList (
        <http://dewey.info/class/001.3/e23/>
        <http://dewey.info/class/1--09/e23/>
        <http://dewey.info/class/2--73/e23/>
    ) ;
    skos:notation "001.30973" .

Retrieving list members in order is surprisingly hard with SPARQL. Retrieving ordered pairs is the best solution I’ve come up with so far:

PREFIX mads: <http://www.loc.gov/mads/rdf/v1#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?c1_notation ?c1_label ?c2_notation ?c2_label
WHERE { GRAPH <http://localhost/ddc23no> {

    <http://dewey.info/class/001.30973/e23/> mads:componentList ?l .
        ?l rdf:rest* ?sl .
        ?sl rdf:first ?e1 .
        ?sl rdf:rest ?sln .
        ?sln rdf:first ?e2 .

        ?e1 skos:notation ?c1_notation .
        ?e2 skos:notation ?c2_notation .

        OPTIONAL {
            ?e1 skos:prefLabel ?c1_label .
        }
        OPTIONAL {
            ?e2 skos:prefLabel ?c2_label .
        }
}}

c1_notation

c1_label

c2_notation

c2_label

“001.3”

“Humaniora”@nb

“T1–09”

“Historie, geografisk behandling, biografier”@nb

“T1–09”

“Historie, geografisk behandling, biografier”@nb

“T2–73”

“USA”@nb

Additional conversion rules for WebDewey data

The script comes with a few extra rules for distinguishing between different types of notes in WebDewey records and extract entities from these. The entity extraction rules (marked with [*] below) utilizes a non-standard namespace and are not enabled by default. Specify the --webdewey flag to use them.

MARC21XML

RDF

680 having $9 ess=ndf Definition note

skos:definition

680 having $9 ess=nvn Variant name note

wd:variantName [*] for each subfield $t

680 having $9 ess=nch Class here note

wd:classHere [*] for each subfield $t

680 having $9 ess=nin Including note

wd:including [*] for each subfield $t

680 having $9 ess=nph Former heading

wd:formerHeading [*] for each subfield $t

694 having $9 ess=nml ???

SKOS.editorialNote

7XX having $9 ess=isCaption Relative index term to use as caption

skos:prefLabel

Notes that are currently not treated in any special way:

  • 253 having $9 ess=nsx Do-not-use.

  • 253 having $9 ess=nce Class-elsewhere

  • 253 having $9 ess=ncw Class-elsewhere-manual

  • 253 having $9 ess=nse See.

  • 253 having $9 ess=nsw See-manual.

  • 353 having $9 ess=nsa See-also

  • 683 having $9 ess=nbu Preference note

  • 683 having $9 ess=nop Options note

  • 683 having $9 ess=non Options note

  • 684 having $9 ess=nsm Manual note

  • 685 having $9 ess=ndp Discontinued partial

  • 685 having $9 ess=nrp Relocation

  • 689 having $9 ess=nru Sist brukt i…

Mapping schema for MARC21 Authority

Only a small part of the MARC21 Authority data model is converted.

MARC21XML

RDF

001 Control Number

dcterms:identifier

005 Date and time of latest transaction

dcterms:modified

008[0:6] Date entered on file

dcterms:created

065 Other Classification Number

skos:exactMatch (see below)

080 Universal Decimal Classification Number

skos:exactMatch (see below)

083 Dewey Decimal Classification Number

skos:exactMatch (see below)

1XX Headings

skos:prefLabel

4XX See From Tracings

skos:altLabel

5XX See Also From Tracings

skos:related, skos:broader or skos:narrower (see below)

667 Nonpublic General Note

skos:editorialNote

670 Source Data Found

skos:note

677 Definition

skos:definition

678 Biographical or Historical Data

skos:note

680 Public General Note

skos:note

681 Subject Example Tracing Note

skos:example

682 Deleted Heading Information

skos:changeNote

688 Application History Note

skos:historyNote

7XX Heading Linking Entries

skos:xxxMatch (see below)

Notes:

  • Mappings are generated for 065, 080 and 083 only if an URI pattern for the classification scheme has been defined in the config.

  • SKOS relations are generated from 5XX fields if the fields contain a $0 subfield containing either a control number or an URI for the related record. The relationship type is skos:broader if $w=g, skos:narrower if $w=h, and skos:related otherwise. If $w=r and $4 contains an URI, that URI is used as the relationship type. Note that $4 must precede $0 (since both subfields can be repeated).

  • Mappings/relationships are generated for 7XX headings if the fields contain a $0 subfield containing either the control number or the URI of the related record. If $0 contains a control number, an URI pattern for the vocabulary (found in indicator 2 or $2) must be defined in mc2skos.record.CONFIG. If $4 contains an URI, that URI is used as the relationship type. Otherwise, if $4 contains one of the ISO 25964 relations, the corresponding SKOS relation is used. Otherwise, the default value skos:closeMatch is used. Note that $4 must precede $0 (since both subfields can be repeated).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mc2skos-0.12.0.tar.gz (31.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page