Skip to main content

Apache Arrow PostgreSQL connector

Project description

Pgeon 🐦

Build License: MIT

Apache Arrow PostgreSQL connector

pgeon provides a C++ library and (very) simple python bindings. Almost all PostgreSQL native types are supported (see below).

This project is similar to pg2arrow and is heavily inspired by it. The main differences are the use of COPY instead of FETCH and that our implementation uses the Arrow C++ API.

The goal of pgeon is to provide fast bulk data download from a PostgreSQL database into Apache Arrow tables. If you're looking to upload data, you might want to have a look at Arrow ADBC.

Usage

from pgeon import copy_query
db = "postgresql://postgres@localhost:5432/postgres"
query = "SELECT TIMESTAMP '2001-01-01 14:00:00'"
tbl = copy_query(db, query)

The actual query performed is COPY ({query}) TO STDOUT (FORMAT binary), see this page for more information.

Installation

Building and running pgeon requires libpq to be available on your system.

Python

Install from source using pip with

git clone https://github.com/0x0L/pgeon.git
cd pgeon
pip install .

On linux, if pyarrow is already installed as a conda package, you may want to use

CONDA_BUILD=1 pip install .

[optional] C++ library and tools

This requires cmake and ninja. In addition you'll need to install libpq and the Arrow C++ libraries (e.g. arrow-cpp in conda)

mkdir build
cd build
cmake -GNinja ..
ninja

Performance

Elapsed time distributions of a query fetching 7 columns (1 timestamp, 2 ints, 4 reals) and around 4.5 million rows. The result is returned as a pandas.DataFrame in all cases.

Notes

  • Queries using ROW (e.g. SELECT ROW('a', 1)) do not work (anonymous structs)

  • SQL arrays are mapped to pyarrow.list_(...). Only 1D arrays are fully supported. Higher dimensional arrays will be flattened.

  • BitString types output format is not really helpful

  • tsvector types with letter weights are not supported

  • PostgreSQL range and domain types are not supported.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pgeon-0.2.0a0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pgeon-0.2.0a0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pgeon-0.2.0a0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page