Skip to main content

The XML-to-OCDS parser for the TEDective project based on lxml

Project description

etl

Ruff REUSE status

The code in this repo is part of the TEDective project. It defines an ETL pipeline to transform European public procurement data from Tenders Electronic Daily (TED) into a format that's easier to handle and analyse. Primarily, the TED XMLs (and eForms, WIP) are transformed into Open Contracting Data Standard (OCDS) JSON and parquet files to ease importing the data into a:

  • Graph database (KuzuDB in our case, but processed dataa should be generic enough to support any graph database and a
  • Search engine (Meilisearch in our case)

Organizations are deduplicated using Splinkg and linked to their GLEIF identifiers (WIP) before they are imported into the graph database.

Table of Contents

Background

The TEDective project aims to make European public procurement data explorable for non-experts. This transformation is more or lest based on the Open Contracting Data Standard (OCDS) EU Profile:

As such, this pipeline can be used standalone or as part of your project that does something interesting with TED data. We use it ourselves for the TEDective API that powers the TEDective UI.

Install

This will be available on PyPi soon. Until then you can install it via Nix:

# Install flake iinto your profile
nix profile install git+https://git.fsfe.org/TEDective/etl
run-pipeline --help

Alternatively, you can clone this repository and build it via Nix yourself:

git clone https://git.fsfe.org/TEDective/etl
cd etl
nix-build
result/bin/run-pipeline --help

Another way is to use poetry directly:

poetry install
poetry run run-pipeline --help

Running the pipeline requires running luigi daemon. It is included in the project and you can run it with the following command:

# If using Nix
result/bin/run-server
# If using poetry
poetry run run-server

Usage

:construction: This is still under heavy development.

Maintainers

@linozen

Contributing

The easiest way to start developing is to use devenv via the provided flake.nix. So, clone this repository and run:

# If you have Nix installed
nix develop --impure
# This will drop you into a shell with all the dependencies installed
# If you want to bring up a meilisearch instance, simply run:
devenv up

Small note: If editing the README, please conform to the standard-readme specification. Also, please ensure that documentation is kept in sync with the code. Please note that the main documentation repository is added to this repository via git-subrepo. To update the documentation, please use the following commands:

git-subrepo pull docs
cd ./docs

# Make your changes
git commit -am "docs: update documentation for new feature"

# Preview your changes
pnpm install
pnpm run dev

# If you're happy with your changes, push them
git-subrepo push docs

License

EUPL-1.2 © 2024 Free Software Foundation Europe e.V.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tedective_etl-0.1.0.tar.gz (34.0 kB view hashes)

Uploaded Source

Built Distribution

tedective_etl-0.1.0-py3-none-any.whl (37.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page