A package to process datasets provided by CargoMetrics Technologies Inc.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Project description

cmdata

A python package to get started with CargoMetrics data products

The main goal of this python package is to get subscribers of the CargoMetrics data products up and running with our data as fast as possible. With that in mind we provide functions to quickly access the various datasets and perform the first couple of common transformations. After that, the universe is yours...

Getting started

The cmdata python package from CargoMetrics provides tools to get started with the Advanced datasets, which are datasets that contain point-in-time data.

Installation

The cmdata is available as a pip and can be installed into your python environment

> pip install cmdata

cmdata requires a python version > 3.9 and a pandas version > 1.0.

Generating a view

The point-in-time Advanced datasets contain multiple time dimensions for each datapoint within the dataset (see “Point-in-time deep dive”). To get started inspecting and assessing the data, the cmdata package provides a couple of views that reduce the dual time dimensionality to a single time-series. For a more in-depth example see “Point-in-time deep dive”.

To generate a view:

from cmdata.commodities import point_in_time as pit

PATH = ’...’

pit_df = pit.read(PATH)
view = pit.standard_view(pit_df, asof=’2023-01-01’)

This view can be explored or transformed as any tabular one-dimensional dataset.

For example, to generate a plot:

# plot Australian exports
ax = view[('export', 'AUS')].plot(figsize=(10, 3))
ax.set_xlabel('Date')
ax.set_ylabel('AUS exports of Iron Ore in mt / day')

Australia Iron Ore exports - standard view

Point-in-time deep dive

Terminology

Throughout this document the following terms are used:

dataset: a collection of (daily) increments

increment: a collection of observations, published on day T, that contains information about the last T-3 through T-90 days (i.e., a single csv file)

observation: for the advanced commodity products an observation is the amount (in metric tons) of a particular cargo that is imported or exported by a country on a given day

activity_date: date associated with the observation, i.e., the date of import/export of cargo into/out of country

publication_timestamp: the time an increment was published, i.e., the time an increment is available to the customer

lag: the number of days difference between the publication timestamp and the activity date

The CargoMetrics’ Advanced products are point-in-time datasets. This means that

Each day, the CargoMetrics system use the input datasets, such as AIS, port, and vessel information, available at that time to produce an estimate of global maritime trade covering the last 90 days (this is referred to as the increment)
Each observation in an increment has two associated times:
The publication timestamp (see box above)
The activity date (see box above)
The collection of increments forms the point-in-time dataset, which provides the full history back
to 2013, and enables customers to train models without look-ahead bias and perform honest backtests.

The following section provides a step-by-step overview of how the Advanced products are constructed.

One increment

Each day (T), a single increment is added to the point-in-time dataset. This increment contains estimates of global maritime trade for activity dates T-3 through T-90. For example: the increment published on 2024-01-01 contains activity dates ranging from 2023-10-03 (i.e., lag 90) through 2023-12-29 (i.e., lag 3).

The plot below shows a graphical representation of this increment in a two-dimensional time plot where:

Each square represents an observation
Publication timestamp is along the vertical axis
Activity date is along the horizontal axis

point-in-time visualization: one increment

Multiple increments

The increment published 2024-01-02, i.e., the day after the increment depicted above, contains activity dates ranging from 2023-10-03 through 2023-12-30. This means that 87 activity dates are present in both increments. The graphical representation looks like

point-in-time visualization: two increment

And three increments look like

point-in-time visualization: three increment

Each increment, compared to the previous, adds one new day at the frontier of time (along the activity date axes) and removes one day at the trailing end.

A couple of things to note about this organization:

1: In the full dataset the same activity date shows up in multiple increments. In other words, there are multiple observations for each activity date.

point-in-time visualization: multiple activity dates

2: Each observation can be uniquely identified by its activity date and publication timestamp or activity date and lag.

point-in-time visualization: uniqueness

Lags 3 through 90

A note on why the Advanced products contain only lags between three and 90 days for each increment:

The upper limit of 90 days is set by the longest processes that occur in maritime shipping; 90 days covers the longest voyages. For example, 90 days at 12 knots (a typical speed for tankers and dry bulk vessels) covers more distance than the circumference of the earth.
The lower limit of 3 days is set by the update characteristics of the input data feeding the system. Delays of up to 2 days between when characteristics change - such as the draft of a vessel - and when that change is available in the input data are common.

Building views

The two-dimensional point-in-time data provides crucial features that are important for training models on past data and for running honest backtests on those models. The main characteristic that facilitates this is the ability to select only the observations that were available at a particular time in the past.

To work with the point-in-time data, either to visualize it or to use it as an input to training a model, the dataset needs to be reduced to one time-dimension, hereafter referred to as a view. Typically, a time-series in terms of activity dates is what is required.

A view is defined as a set of rules that results in exactly one publication for each activity date. The rules that define a view depend on the user’s needs. A few examples of views are depicted below. The cmdata package implements some of these views, which can be used as templates for other use cases.

Note: New information is added each increment and the system modeling maritime trade becomes more accurate the more information it has available. Long story short: the data matures over time, from increment to increment.

A fixed lag view

The fixed lag view is defined by a single lag and selects only activity dates that have the same lag. This view suits users interested in capturing the same level of maturation of the data for every activity date. A graphical representation of this view, selecting only the observations marked in black, looks like

point-in-time visualization: fixed lag view

To create this view from a point-in-time dataset use:

from cmdata.commodities import point_in_time as pit

PATH = ’...’

pit_df = pit.read(PATH)

# generate a fixed lag view at a 7-day lag, including data
#   available on or before 2023-01-01
#
view = pit.fixed_lag_view(pit_df, lag=7, asof=’2023-01-01’)

A maturing view

The maturing view, which is provided in the Standard products for CargoMetrics Commodity products, selects the activity date with the most up-to-date information from the available increments. This translates in selecting the activity dates with lags 3 through 90 from the most recent increment; for the remaining increments select the activity date with lag 90 only. The graphical representation is as follows, selecting the dark observations only:

point-in-time visualization: standard view

To create this view from a point-in-time dataset use:

from cmdata.commodities import point_in_time as pit

PATH = ’...’

pit_df = pit.read(PATH)

# generate a standard, aka maturing, view, including data
#   available on or before 2023-01-01
#
view = pit.standard_view(pit_df, asof=’2023-01-01’)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.0.3

Apr 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmdata-0.0.3.tar.gz (12.1 kB view hashes)

Uploaded Apr 24, 2024 Source

Built Distribution

cmdata-0.0.3-py3-none-any.whl (10.3 kB view hashes)

Uploaded Apr 24, 2024 Python 3

Hashes for cmdata-0.0.3.tar.gz

Hashes for cmdata-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`094c33512c09798dfaa5f4fc0508e6c0876c821887033e9cbf84cf8ead33c7db`
MD5	`a1df3454faebfacba4036ee5855cb9d0`
BLAKE2b-256	`32a181ccf59b902477f344aeb0a29c823b36a2c31913f8ace9955b3263d47968`

Hashes for cmdata-0.0.3-py3-none-any.whl

Hashes for cmdata-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c4f05a10874f77ee4be70edb16ddd06d9efdb439221bb713159851af836d53a`
MD5	`4a4e879b70edda7fd1b368b930283b71`
BLAKE2b-256	`22c5cd97cd920387b0afbe1414065f52f9ea30229bf108ef014dc7c5cac5ec0e`