Skip to main content

I/O for ISIS files in Python

Project description

IOISIS - I/O tools for converting ISIS data in Python

This is a Python library with a command line interface (CLI) intended to access data from ISIS database files and convert among distinct file formats.

The converters available in the CLI are:

Command Description
bruma-mst2csv MST+XRF to CSV based on Bruma
bruma-mst2jsonl MST+XRF to JSON Lines based on Bruma
csv2iso CSV to ISO2709
csv2jsonl CSV to JSON Lines
csv2mst CSV to ISIS/FFI Master File Format
iso2csv ISO2709 to CSV
iso2jsonl ISO2709 to JSON Lines
jsonl2csv JSON Lines to CSV
jsonl2iso JSON Lines to ISO2709
jsonl2mst JSON Lines to ISIS/FFI Master File Format
mst2csv ISIS/FFI Master File Format to CSV
mst2jsonl ISIS/FFI Master File Format to JSON Lines

Note: The bruma-* commands and the bruma module use a specific pre-compiled version of Bruma through JPype, which requires the JVM (Java Virtual Machine). The iso and mst modules, as well as the other modules and CLI commands don't require Bruma. Bruma only gets downloaded in its first use.

The Python-based alternative to Bruma was created from scratch, and it's based on Construct, a Python library that allows a declarative implementation of the binary file structures for both parsing and building. Currently, the ISO (ISO2709-based file format), the MST (ISIS/FFI Master file format) and the XRF (ISIS/FFI Cross-reference file format) file formats can be parsed/built with the library, but the XRF files aren't used nor built by the Bruma-independent library/CLI.

Most details regarding the parse/build process can be configured in both the library and the CLI, including the several variations of the MST file that are specific to CISIS. CISIS has a serialization behavior dependent of the architecture and of its compilation flags, but ioisis can deal with most (perhaps all) the distinct MST "file formats" that can be generated/read by some specific CISIS version.

Everything in ioisis is platform-independent, and most of its defaults are based on the lindG4 version of CISIS, and on the isis2json MongoDB type 1 (-mt1) output. The --xylose option of several CLI commands switches the JSONL defaults to use the dictionary structure expected by Xylose.

Installation and testing

It requires Python 3.6+, and it's prepared to be tested in every Python version with tox and pytest.

# Installation
pip install ioisis

# Testing (one can install tox with "pip install tox")
tox                      # Test on all Python versions
tox -e py38 -- -k scanf  # Run "scanf" tests on Python 3.8

Command Line Interface (CLI)

To use the CLI command, use ioisis or python -m ioisis. Examples:

# Convert file.mst to a JSONL in the standard output stream
ioisis mst2jsonl file.mst

# Convert file.iso in UTF-8 to an ASCII file.jsonl
ioisis iso2jsonl --ienc utf-8 --jenc ascii file.iso file.jsonl

# Convert file.jsonl to file.iso where the JSON lines are like
# {"tag": ["field", ...], ...}
ioisis jsonl2iso file.jsonl file.iso

# Convert big-endian lindG4 MST data to CSV (one line for each field)
# ignoring noise in the MST file that might appear between records
# (it can access data from corrupt MST files)
ioisis mst2csv --ibp ignore --be file.mst file.csv

# Convert active and logically deleted records from file.mst
# to filtered.mst, selecting records and filtering out fields with jq,
# using a "v" prefix to the field tags,
# reseting the MFN to 1, 2, etc. while keeping its order
# instead of using the in-file order, besides enforcing a new encoding,
# with a file that might already have some records partially in UTF-8
ioisis bruma-mst2jsonl --all --ftf v%z --menc latin1 --utf8 file.mst \
| jq -c 'select(.v35 == ["PRINT"]) | del(.v901) | del(.v540)'
| ioisis jsonl2mst --ftf v%z --menc latin1 - filtered.mst

By default, the input and output are the standard streams, but some commands require a file name, not a pipe/stream. Bruma requires the MST input to be a file name since the XRF will be found based on it (only the bruma-* commands require XRF). The *2mst commands require a file name for the MST output because the first record of it (the control record) has some information that will be available only after generating the entire file (i.e., it's created at the end), this makes the random access a requirement.

All commands have an alias: their names with only the first character of the extension (or b for bruma-). Try ioisis --help for more information about all commands and ioisis csv2mst --help for the specific csv2mst help (every command has its own help).

The encoding of all files are explicit through a --_enc option, where the _ should be replaced by the first letter of the file extension, hence --menc has the MST encoding, --cenc the CSV encoding, and so on. For the bruma-* commands, the --menc is handled in Java, all other encoding options are handled in Python. The --utf8 option forces the input to be handled as UTF-8, and only the parts of it that aren't in such encoding are handled by the specific file format encoding, that is, the --_enc option become a fallback for UTF-8. This helps loading data from databases with mixed encoding data.

JSON/CSV mode, field and subfield processing

There are several other options to the CLI commands intended to customize the process, perhaps the most important of these options is the -m/--mode, which regards to the field and record formats in JSONL files (and the -M/--cmode, which does the same for CSV files). The valid values for it are:

  • field (default): Use the raw field value string (ignore the subfield parsing options)
  • pairs: Split the field string as an array of [key, value] subfield pairs
  • nest: Split the field string as a {key: value} object, keeping the last subfield value of a key when the key appears more than once
  • inest: CISIS-like subfield nesting processing, similar to the nest, but keeps the first entry with the key instead of the last one (only makes difference when --no-number)
  • tidy: Tabular format where the records are splitten, and each field is regarded as a single JSON line like {"mfn": mfn, "index": index, "tag": field_key, "data": value}
  • stidy: Subfield tidy format, it's similar to the tidy format but the fields are themselves splitten in a way that each subfield is regarded as a single JSON line in the result, including the subfield key in the "sub" key of the result

When used together with --no-number, the field, pairs and nest modes are respectively similar to the -mt1, -mt2 and -mt3 options of isis2json. The inest mode isn't available in isis2json, it follows the CISIS behavior on subfield querying instead. For CSV, only the tidy and stidy formats are available, given that the remaining formats aren't tabular.

The --ftf is an option that expects a field tag formatter template for processing the field tag, and it's the same for both JSON/CSV output (rendering/building) and input (parsing). These are the interpreted sequences:

  • %d: Tag number
  • %r: Tag as a string in its raw format.
  • %z: Same to %r, but removes the leading zeros from ISO tags
  • %i: Field index number in the record, starting from zero
  • %%: Escape for the % character

Note: %d and %i options might have a numeric parameter in the middle like the printf's %d (e.g. recall "%03d" % 15 in Python).

For the subfield processing, there are several options available:

  • --prefix: Character/string that starts a new subfield in the field text
  • --length: Size of the subfield key/tag (number of characters)
  • --lower/--no-lower: Toggle for the normalization of the subfield key/tag, which is performed by simply lowering their case
  • --first: The subfield key/tag to be used by the leading field data before the first prefix appears
  • --empty/--no-empty: Toggle to show/hide the subfields with no characters at all (apart from the subfield key/tag)
  • --number/--no-number: Repeated subfield keys are handled by adding a number suffix to them, starting from 1 in the first repeat, and this option toggles this behavior (to add the suffix or not)
  • --zero/--no-zero: Choose if the first occurrence of each subfield key in a field should have a 0 suffix to follow the numbering described in the previous option (it has no effect when --no-number)
  • --sfcheck/--no-sfcheck (for JSONL/CSV input only): Check if the specification of the subfield parsing/unparsing rules given in the previous parameters would resynthesize all input fields exactly in the way they appear

The --xylose option is just an alternative way of using "--mode=inest --ftf=v%z". To be more similar to the isis2json output while still making use of the format expected by Xylose, you should use instead "--mode=nest --no-number --ftf=v%z".

Common MST/ISO input options

Both MST and ISO records have a STATUS flag, which answers this question: is this record logically deleted? STATUS equals to 1 means True (deleted), 0 means False (active).

Every record in the MST file structure has an MFN, a serial number/ID of the record in the database. A major difference between the bruma-mst2* commands and the mst2* ones is in the way they handle the MFN: Bruma always access the MST file through the XRF file, jumping the addresses to iterate through the records sorting them by MFN, whereas the Python implementation gets the records in their block/offset order (i.e., the order they appear in the input file). For ISO files, there's no MFN stored, but ioisis can generate it (starting from 1, like common MST records) if they're required (e.g. for creating CSV files).

These options are common to several commands when reading from MST or ISO files:

  • --only-active/--all: Flag to select if the STATUS=1 records (logically deleted records) should be in the output or not
  • --prepend-mfn/--no-mfn: Add an artificial field mfn at the beginning of each record with the record MFN as a string (though it's always a number)
  • --prepend-status/--no-status Add an artificial field status at the beginning of each record with the record STATUS as a string (though it's usually just zero or one)

ISO-specific options

The ISO file can be seen as just a sequence of records glued together. Each record has 3 parts: a leader, a directory and field values. The leader has some metadata, most of them only accessible through the library, not the CLI (only the STATUS is used by the CLI). The directory is a sequence of constant-sized structures (directory items), each of them representing a single field (its tag, its value length and its relative offset), which is matched with its respective value in the last part of the record.

Internally to the ISO file, after the directory and between each field value, there's a field terminator. At the end of the record, there's both a field terminator and, finally, a record terminator. By default, CISIS uses the "#" as the terminator, the same one for field and record, and that's also the ioisis default. But it's not always the case for input/output files. For example, in the MARC21 specifications the field terminator is the "\x1e" character and the record terminator is the "\x1d" character.

These are the options for ISO I/O commands:

  • --ft: ISO Field terminator
  • --rt: ISO Record terminator
  • --line: Line length for splitting a record (not counting the EOL)
  • --eol: End of line (EOL) character or string, ignored if --line=0

The default values for them are the CISIS ones, which are intended to make it possible to see the ISO file as a common text file. By default, every ISO record (raw bytes) is splitten into lines of 80 bytes, and an EOL gets printed after the record terminator, so two records won't share the same line. The line splitting is a CISIS-specific behavior, it's required in order to open the ISO files it exports, and it might make debugging easier. Using "--line=0" disables this behavior, joining everything as a single huge line. The terminators might have more than one character, as well as the EOL, and these 3 parameters (like other inputs shown as BYTES in the help) are parsed by the CLI, so "\t" is recognized as the TAB character and "\n" as a LF (Line Feed).

MST-specific options (Python/construct)

The options shown here regards to the Python implementation of the MST file format builder/parser, these are not available for the bruma-* commands.

The ISIS/FFI Master File Format (MST file) structure is a binary file divided as joined records. The overall structure of it is documented in the Appendix G of the Mini-micro CDS/ISIS: reference manual (version 2.3), however it's incomplete, several enhancements had been done in the file structure in order to make it possible to fit more data in these databases. Nevertheless, the MST file is still a file with joined records, where each record has 3 blocks: leader, directory and field values. It's similar to an ISO file with an empty field and record terminator, but the leader and directory items are binary, the metadata isn't the same, and the padding, alignment and sizes are quite hard to properly grasp.

This is the internal structure of the leader and a directory item in a single record of a MST file (it doesn't apply to the control record):

                   -------------------------------------------------
                  |    Format | ISIS     ISIS     FFI      FFI      |
                  | Alignment | 2        4        2        4        |
 -----------------------------+-------------------------------------|
|         Leader size (bytes) | 18       20       22       24       |
| Directory item size (bytes) | 6        6        10       12       |
|-----------------------------+-------------------------------------|
|           |      00-01      | MFN.1    MFN.1    MFN.1    MFN.1    |
|           |      02-03      | MFN.2    MFN.2    MFN.2    MFN.2    |
|           |      04-05      | MFRL     MFRL     MFRL.1   MFRL.1   |
|           |      06-07      | MFBWB.1  (filler) MFRL.2   MFRL.2   |
|           |      08-09      | MFBWB.2  MFBWB.1  MFBWB.1  MFBWB.1  |
|  Leader   |      10-11      | MFBWP    MFBWB.2  MFBWB.2  MFBWB.2  |
|           |      12-13      | BASE     MFBWP    MFBWP    MFBWP    |
|           |      14-15      | NVF      BASE     BASE.1   (filler) |
|           |      16-17      | STATUS   NVF      BASE.2   BASE.1   |
|           |      18-19      |          STATUS   NVF      BASE.2   |
|           |      20-21      |                   STATUS   NVF      |
|           |      22-23      |                            STATUS   |
|-----------+-----------------+-------------------------------------|
|           |      00-01      | TAG      TAG      TAG      TAG      |
|           |      02-03      | POS      POS      POS.1    (filler) |
| Directory |      04-05      | LEN      LEN      POS.2    POS.1    |
|   item    |      06-07      |                   LEN.1    POS.2    |
|           |      08-09      |                   LEN.2    LEN.1    |
|           |      10-11      |                            LEN.2    |
 -----------+-----------------+-------------------------------------|
            |  Offset (bytes) |              Structure              |
             -------------------------------------------------------

These structure names follow the Mini-micro CDS/ISIS reference manual, where the ".1" and ".2" suffixes are there to expose where the field has 4 bytes, otherwise the field has just 2 bytes. The starting offset of every field must be an integer multiple of the alignment number, hence the fillers. The endianness don't change the position of any of these fields, it just change the order of the 2 or 4 bytes of the field itself (where little endian, known as "swapped" in CISIS, means that the last byte of the data is at the lowest address/offset). Most of that structure shown up to now can be controlled through three parameters: the Format, the Intra-record alignment and the Endianness. These are the two possible formats:

  • ISIS file format: The original standard documented in the reference manual
  • FFI file format: An alternative to overcome the record size of 16 bytes (MFRL), doubling it and all the other fields that has something to do with the internal offsets of a record

These are the MST-specific options that control the main structure of its records:

  • --end: Tells whether the bytes of each field are big or little endian, the --le and --be are shorthands for these, respectively
  • --format: Choose the isis or ffi file format, the --isis and --ffi are shorthands for these
  • --packed/--unpacked: These control the leader/directory alignment, packed means that their alignment is 2, whereas unpacked means that their aligment is 4.

The MST file has a leading record called the Control record, whose MFN (Master file number, here file stands for a record) is zero. It has this 32-bytes structure (apart from a trailing filler of 32 bytes in CISIS):

 -----------------------------
|  Offset (bytes) | Structure |
|-----------------+-----------|
|      00-01      | CTLMFN.1  |
|      02-03      | CTLMFN.2  |
|      04-05      | NXTMFN.1  |
|      06-07      | NXTMFN.2  |
|      08-09      | NXTMFB.1  |
|      10-11      | NXTMFB.2  |
|      12-13      | NXTMFP    |
|      14-15      | TYPE      |
|      16-17      | RECCNT.1  |
|      18-19      | RECCNT.2  |
|      20-21      | MFCXX1.1  |
|      22-23      | MFCXX1.2  |
|      24-25      | MFCXX2.1  |
|      26-27      | MFCXX2.2  |
|      28-29      | MFCXX3.1  |
|      30-31      | MFCXX3.2  |
 -----------------------------

The most important field in there is the TYPE shown above, which is written as MFTYPE in the CDS/ISIS reference manual, but the TYPE has actually two single-byte fields in it, and the order of these two is the only multi-field scenario that depends on the endianness:

  • MSTXL (most significant byte): The offset shift in all XRF entries (to be discussed)
  • MFTYPE (least significant byte): The master file type (should always be zero for user database files)

We've already seen the intra-record differences among distinct MST file formats, but the overall structure itself has differences. A really important parameter for the overall MST file structure is the Inter-record alignment. Some details about the overall file structure and alignment are:

  • The file is divided as 512-bytes blocks, and the last block should be filled up to the end
  • The first record must be the control record
  • The records are simply stacked one after another, but with alignment constraints:
    • The BASE and MFN fields of a record must be in the same block
    • The record itself should have an alignment of 2 bytes (word alignment, the ISIS default for inter-record alignment)

The shift name comes from the XRF file structure, which has just 32 bytes to store both the block, the offset and some flags. The XRF should be capable of pointing to the address of every record in the MST file, hence some "bit twiddling" must be done to enable larger MST files. This had been done through the MSTXL field, which represents the shift, the number of times we must bit-shift the offset to the right. Doing so we lose the least significant bits, hence our offsets should always be aligned to "2^shift" (two raised to the power of shift). That's the main inter-record alignment constraint we have.

These are the MST-specific options regarding the inter-record alignment:

  • --control-len: Length of the control record, in bytes, to control the first filler size
  • --shift (MST file output only): The MSTXL value, telling the inter-record alignment should be of at least 2 raised to the power of MSTXL bytes
  • --shift4is3/--shift4isnt3: Toggle if MSTXL equals to 3 in a file or in --shift should be regarded as 4, it's a historical behavior of CISIS
  • --min-modulus: The minimum inter-record alignment, in bytes (2 by default). This option makes it possible to bypass the standard word alignment, "--min-modulus=1 --shift=0" would make MST files with byte-alignment (i.e., with no inter-record padding/filler)

There are three locking mechanisms in ISIS that might be stored in an MST file:

  • EWLOCK (Exclusive Write Lock): It's a flag, stored in MFCXX3 (control record)
  • DELOCK (Data Entry Lock): It's a counter, stored in MFCXX2 (control record), of how many records are locked at once
  • RLOCK (Record Lock): It's the sign of the MFRL (record length) of every record (the record size is actually the absolute value of MFRL)

Usually these makes no difference when the ISIS is just a static file that no process is modifying, and the ioisis CLI ignores the EWLOCK and DELOCK (they can be accessed by ioisis as a library, though). There's one option in ioisis to enable/disable the interpretation of all these locks, and it's exposed to the CLI since it affects the RLOCK:

  • --lockable/--no-locks: Control if the MFRL should be signed (lockable) or unsigned (no RLOCK, doubling the record length limit)

Several are the fillers (padding characters) that might appear in the MST file due to the several alignment constraints. Another issue with the MST file is that it doesn't have one single filler for all these cases, and perhaps some tool in some specific architecture might behave differently. As the parser is strict (i.e., it checks the alignment and fillers), some of these might need to be tuned before loading the MST file, and these are the commands that makes that possible:

  • --filler: Default filler for unset filler options, but the record filler
  • --record-filler: For the trailing record data, after the last field value (the default is a whitespace)
  • --control-filler: For the trailing bytes of the control record
  • --slack-filler: For the leader/directory when --unpacked
  • --block-filler: For the last bytes in a 512-bytes block that don't belong to any record (end of file or due to the "MFN+BASE in the same block" constraint)

The filler options above have a single parameter, which should always be a 2-characters string with the filler byte code in hexadecimal.

Finally, sometimes the input MST file is corrupt and can't be loaded, e.g. because the block filler isn't clean, or because a MFRL is smaller than the actual record data. Since the overall record structure has some internal constraints (sizes and offsets/addresses), ioisis can go ahead ignoring the next few bytes that makes no sense as a new record. To do so, one should call it with the Invalid block padding option (--ibp), whose value can be:

  • check (default): The strict behavior, ioisis crashes when some invalid data appears in some offset that should have a record
  • ignore: Silently skips the invalid data
  • store: Put the trailing information in an artificial ibp field of the output, in hexadecimal

Library

A common data structure in the library for representing a single record is the tidy list of tag-value pairs, or tl. It doesn't have anything to do with the tidy/stidy JSONL/CSV modes, it's just a way to store the data avoiding the scattered structure of the raw record container. To load data with the library:

from ioisis import bruma, iso, mst, fieldutils

# In the mst module, you must create a StructCreator instance
mst_sc = mst.StructCreator(ibp="store")
with open("file.mst", "rb") as raw_mst_file:
    for raw_tl in mst_sc.iter_raw_tl(raw_mst_file):
        tl = fieldutils.nest_decode(raw_tl, encoding="cp1252")
        ...

# For bruma.iter_tl the input must be a file name
for tl in bruma.iter_tl("file.mst", encoding="cp1252"):
    raw_tl = fieldutils.nest_encode(raw_tl, encoding="utf-8")
    ...

# The idea is similar for an ISO file, but ...
for raw_tl in iso.iter_raw_tl("file.iso"):
    tl = utf8_fix_nest_decode(raw_tl, encoding="latin1")
    ...

# ... for ISO files, you can always use either a file name
# or any file-like object open in "rb" mode
with open("file.iso", "rb") as raw_iso_file:
    for tl in iso.iter_tl(raw_iso_file, encoding="latin1"):
        ...

The following generator functions/methods are the ones that appeared in the example above:

  • mst.StructCreator.iter_raw_tl: Read MST keeping data in bytestrings
  • iso.iter_raw_tl: Read ISO keeping data in bytestrings
  • bruma.iter_tl: Read MST already decoding its contents
  • iso.iter_tl: Read ISO already decoding its contents

It's worth noting that the following functions from the fieldutils module allows encoding/decoding all record fields/subfields at once:

  • nest_encode
  • nest_decode
  • utf8_fix_nest_decode

The latter is the same to nest_decode, but uses the given encoding as a fallback, trying first to decoded all the contents as UTF-8.

What's the content of a single decoded tl? It's a list of [tag, value] pairs (as lists or tuples), like:

[["5", "S"],
 ["6", "c"],
 ["10", "br1.1"],
 ["62", "Example Institute"]]

One can generate a single ISO record from a tl:

>>> from ioisis import iso, fieldutils
>>> tl = [["1", "test"], ["8", "it"]]
>>> raw_tl = fieldutils.nest_encode(tl, encoding="utf-8")
>>> raw_tl
[[b'1', b'test'], [b'8', b'it']]
>>> con = fieldutils.tl2con(raw_tl, ftf=iso.DEFAULT_ISO_FTF)
>>> con
{'dir': [{'tag': b'001'}, {'tag': b'008'}], 'fields': [b'test', b'it']}
>>> iso.DEFAULT_RECORD_STRUCT.build(con)
b'000580000000000490004500001000500000008000300005#test#it##\n'

The process to create records is to convert them to the internal [construct] container format (or simply con), which is done by fieldutils.tl2con. To create an MST file, you can use the build_stream method of the mst.StructCreator, whose first parameter should be a generator of con instances, and the second is the seekable file object.

There's still a third format, called the record dict format, which is based on the JSONL "--mode=field" output format. It has less resources available internally to the library when compared with the abovementioned alternative, but it might be simpler to use in some cases:

>>> iso.dict2bytes({"1": ["testing"], "8": ["it"]})
b'000610000000000490004500001000800000008000300008#testing#it##\n'

# The same, but from the tl
>>> tl = [["1", "testing"], ["8", "it"]]
>>> record = fieldutils.tl2record(tl)
>>> iso.dict2bytes(record)
b'000610000000000490004500001000800000008000300008#testing#it##\n'

To load ISIS data from bruma or iso, you can also use the iter_records function of the respective module, but it's more customizable if you use the fieldutils converter functions:

  • record2tl
  • tl2record
  • tl2con

Perhaps the simplest way to understand the behavior of the library is to use the CLI and to check the code of the called command.

Modules

The modules available in the ioisis package are:

Module Content
bruma Everything about MST file processing based on Bruma
ccons Custom construct classes
fieldutils Field/subfield processing functions and classes
iso ISO parsing/building stuff tools on construct
java Java interfacing resources based on JPype1
mst MST/XRF parsing/building tools based on construct
streamutils Classes for precise file/pipe processing
__main__ CLI (Command Line Interface)

Usually, the only modules one would need from ioisis to use it as a library are iso, mst, bruma and fieldutils, the remaining modules can be seen as internal stuff.

By default, the mst module doesn't use/create XRF files. One can create/load XRF data using the struct created by the mst.StructCreator.create_xrf_struct method.

ISO construct containers (lower level data access Python API)

The iso module uses the Construct library, which makes it possible to create a declarative "structure" object that can perform bidirectional building/parsing of bytestrings (instances of bytes) or streams (files open in the "rb" mode) from/to construct containers (dictionaries).

Building and parsing a single record

This low level data access doesn't perform any string encoding/decoding, so every value in the input dictionary used for building some ISO data should be a raw bytestring. Likewise, the parser doesn't decode the encoded strings (tags, fields and metadata), keeping bytestrings in the result.

Here's an example with a record in the "minimal" format expected by the ISO builder. The values are bytestrings, and each directory entry matches its field value based on their index.

>>> lowlevel_dict = {
...     "dir": [{"tag": b"001"}, {"tag": b"555"}],
...     "fields": [b"a", b"test"],
... }

# Build a single ISO record bytestring from a construct.Container/dict
>>> iso_data = iso.DEFAULT_RECORD_STRUCT.build(lowlevel_dict)
>>> iso_data
b'000570000000000490004500001000200000555000500002#a#test##\n'

# Parse a single ISO record bytestring to a construct.Container
>>> con = iso.DEFAULT_RECORD_STRUCT.parse(iso_data)

# The construct.Container instance inherits from dict.
# The directory and fields are instances of construct.ListContainer,
# a class that inherits from list.
>>> [directory["tag"] for directory in con["dir"]]
[b'001', b'555']
>>> con.fields  # Its items can be accessed as attributes
ListContainer([b'a', b'test'])
>>> len(con.fields) == con.num_fields == 2  # A computed attribute
True

# This function directly converts that construct.Container object
# to a dictionary of already decoded strings in the the more common
# {tag: [field, ...], ..} format (default ISO encoding is cp1252):
>>> iso.con2dict(con).items()  # It's a defaultdict(list)
dict_items([('1', ['a']), ('555', ['test'])])

Other record fields

Each ISO record is divided in 3 parts:

  • Leader (24 bytes header with metadata)
  • Directory (metadata for each field value, mainly its 3-bytes tag)
  • Fields (the field values themselves as bytestrings)

The leader has:

  • Single character metadata (status, type, coding)
  • Two numeric metadata (indicator_count and identifier_len), which should range only from 0 to 9
  • Free room for "vendor-specific" stuff as bytestrings: custom_2 and custom_3, where the numbers are their size in bytes
  • An entry map, i.e., the size of each field of the directory: len_len, pos_len and custom_len, which should range only from 0 to 9
  • A single byte, reserved, literally reserved for future use
>>> con.len_len, con.pos_len, con.custom_len
(4, 5, 0)

Actually, the reserved is part of the entry map, but it has no specific meaning there, and it doesn't need to be a number. Apart from the entry map and the not included length/address fields, none of these metadata has any meaning when reading the ISO content, and they're all filled with zeros by default (the ASCII zero when they're strings).

>>> con.status, con.type, con.coding, con.indicator_count
(b'0', b'0', b'0', 0)

Length and position fields that are stored in the record (total_len, base_addr, dir.len, dir.pos) are computed in build time and checked on parsing. We don't need to worry about these fields, but we can read them if needed. For example, one directory record (a dictionary) has this:

>>> con.dir[1]
Container(tag=b'555', len=5, pos=2, custom=b'')

As the default dir.custom field has zero length, it's not really useful for most use cases. Given that, we've already seen all the fields there are in the low level ISO representation of a single record.

Tweaking the field lengths

The ISO2709 specification tells us that a directory entry should have exactly 12 bytes, which means that len_len + pos_len + custom_len should be 9. However, that's not an actual restriction for this library, so we don't need to worry about that, as long as the entry map have the correct information.

Let's customize the length to get a smaller ISO with some data in the custom field of the directory, using a 8 bytes directory:

>>> dir8_dict = {
...     "len_len": 1,
...     "pos_len": 3,
...     "custom_len": 1,
...     "dir": [{"tag": b"001", "custom": b"X"}, {"tag": b"555"}],
...     "fields": [b"a", b"test"],
... }
>>> dir8_iso = iso.DEFAULT_RECORD_STRUCT.build(dir8_dict)
>>> dir8_iso
b'0004900000000004100013100012000X55550020#a#test##\n'
>>> dir8_con = iso.DEFAULT_RECORD_STRUCT.parse(dir8_iso)
>>> dir8_con.dir[0]
Container(tag=b'001', len=2, pos=0, custom=b'X')
>>> dir8_con.dir[1]  # The default is always zero!
Container(tag=b'555', len=5, pos=2, custom=b'0')
>>> dir8_con.len_len, dir8_con.pos_len, dir8_con.custom_len
(1, 3, 1)

What happens if we try to build from a dictionary that doesn't fit with the given sizes?

>>> invalid_dict = {
...     "len_len": 1,
...     "pos_len": 9,
...     "dir": [{"tag": b"555"}],
...     "fields": [b"a string with more than 9 characters"],
... }
>>> iso.DEFAULT_RECORD_STRUCT.build(invalid_dict)
Traceback (most recent call last):
  ...
construct.core.StreamError: Error in path (building) -> dir -> len
bytes object of wrong length, expected 1, found 2

ISO files, line breaking and delimiters

The ISO files usually have more than a single record. However, these files are created by simply concatenating ISO records. That simple: concatenating two ISO files should result in another valid ISO file with all the records from both.

Although that's not part of the ISO2709 specification, the iso.DEFAULT_RECORD_STRUCT parser/builder object assumes that:

  • All lines of a given record but the last one must have exactly 80 bytes, and a line feed (\x0a) must be included after that;
  • Every line must belong to a single record;
  • The last line of a single record must finish with a \x0a.

That's the behavior of iso.LineSplitRestreamed, which "wraps" internally the record structure to give this "line splitting" behavior, but that can be avoided by setting the line_len to None or zero when creating a custom record struct.

Parsing/building data with meaningful line breaking characters

Suppose we want to store these values:

>>> newline_info_dict = {
...     "dir": [{"tag": b"SIZ"}, {"tag": b"SIZ"}, {"tag": b"SIZ"}],
...     "fields": [b"linux^c\n^s1", b"win^c\r\n^s2", b"mac^c\r^s1"],
... }

That makes sense as an example of an ISO record with three SIZ fields, each with three subfields, where the second subfield is the default newline character of some environment, and the third subfield is its size. Although can build that using the DEFAULT_RECORD_STRUCT (the end of line never gets mixed with the content), we know beforehand that our values have newline characters, and we might want an alternative struct without that "wrapped" line breaking behavior:

>>> breakless_struct = iso.create_record_struct(line_len=0)
>>> newline_info_iso = breakless_struct.build(newline_info_dict)
>>> newline_info_iso
b'000950000000000610004500SIZ001200000SIZ001100012SIZ001000023#linux^c\n^s1#win^c\r\n^s2#mac^c\r^s1##'
>>> newline_info_con = breakless_struct.parse(newline_info_iso)
>>> newline_info_simple_dict = dict(iso.con2dict(newline_info_con))
>>> newline_info_simple_dict
{'SIZ': ['linux^c\n^s1', 'win^c\r\n^s2', 'mac^c\r^s1']}
>>> newline_info_iso == iso.dict2bytes(
...     newline_info_simple_dict,
...     record_struct=breakless_struct,
... )
True

Parsing/building with a custom line breaking and delimiters

The default builder/parser for a single record was created with:

DEFAULT_RECORD_STRUCT = iso.create_record_struct(
    field_terminator=iso.DEFAULT_FIELD_TERMINATOR,
    record_terminator=iso.DEFAULT_RECORD_TERMINATOR,
    line_len=iso.DEFAULT_LINE_LEN,
    newline=iso.DEFAULT_NEWLINE,
)

We can create a custom object using other values. To use it, we'll pass that object as the record_struct keyword argument when calling the functions.

>>> simple_data = {
...     "OBJ": ["mouse", "keyboard"],
...     "INF": ["old"],
...     "SIZ": ["34"],
... }
>>> custom_struct = iso.create_record_struct(
...     field_terminator=b";",
...     record_terminator=b"@",
...     line_len=20,
...     newline=b"\n",
... )
>>> simple_data_iso = iso.dict2bytes(
...     simple_data,
...     record_struct=custom_struct,
... )
>>> from pprint import pprint
>>> pprint(simple_data_iso.decode("ascii"))
('00096000000000073000\n'
 '4500OBJ000600000OBJ0\n'
 '00900006INF000400015\n'
 'SIZ000300019;mouse;k\n'
 'eyboard;old;34;@\n')
>>> simple_data_con = custom_struct.parse(simple_data_iso)
>>> simple_data == iso.con2dict(simple_data_con)
True

The calculated sizes don't count the extra line breaking characters:

>>> simple_data_con.total_len, simple_data_con.base_addr
(96, 73)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ioisis-0.4.0.tar.gz (72.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page