A resource normalizer for dataflows

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Software Development :: Libraries :: Python Modules

Project description

Dataflows Resource write to db normalized

This library provides some dataflows processing for normalizing a resource.

It has special support for storing normalized data into DB tables.

What is normalization?

In short, it is the process of reducing duplication in a dataset.

More can be read about this concept here.

Example

Let's take, as an example, this world cities dataset (we shall call it the fact resource):

from dataflows import Flow, load, printer

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    printer(num_rows=1)
).process()

cities:

#	name	country	subcountry	geonameid
1	les Escaldes	Andorra	Escaldes-Engordany	3040051
2	Andorra la Vella	Andorra	Andorra la Vella	3041563
...
23018	Chitungwiza	Zimbabwe	Harare	1106542

It seems that the country and subcountry columns are quite repetitive - let's extract them into a separate, deduplicated resource (we will call that a dimension resource).

To do that we use the normalize processor.

This processor receives a single resource name, and a list of NormGroup instances. Each of these groups specifies one new dimension resource to be extracted and deduplicated.

Let's see it in action:

from dataflows_normalize import normalize, NormGroup

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    normalize([
       NormGroup(['country', 'subcountry'], 'country_id', 'id') 
    ], resource='cities'),
    printer()
).process()

cities:

#	name	geonameid	country_id
1	les Escaldes	3040051	0
2	Andorra la Vella	3041563	1
3	Umm al Qaywayn	290594	2
4	Ras al-Khaimah	291074	3
5	Khawr Fakkān	291696	4
...
23014	Bulawayo	894701	2677
23015	Bindura	895061	2678
23016	Beitbridge	895269	2679
23017	Epworth	1085510	2676
23018	Chitungwiza	1106542	2676

cities_country_id:

#	id	country	subcountry
1	30	Afghanistan	Badakhshan
2	27	Afghanistan	Badghis
3	21	Afghanistan	Balkh
4	33	Afghanistan	Bāmīān
5	31	Afghanistan	Farah
6	19	Afghanistan	Faryab
7	28	Afghanistan	Ghaznī
8	13	Afghanistan	Ghowr
9	22	Afghanistan	Helmand
10	11	Afghanistan	Herat
...
2671	2677	Zimbabwe	Bulawayo
2672	2676	Zimbabwe	Harare
2673	2673	Zimbabwe	Manicaland
2674	2678	Zimbabwe	Mashonaland Central
2675	2675	Zimbabwe	Mashonaland East
2676	2674	Zimbabwe	Mashonaland West
2677	2670	Zimbabwe	Masvingo
2678	2671	Zimbabwe	Matabeleland North
2679	2679	Zimbabwe	Matabeleland South
2680	2672	Zimbabwe	Midlands

If we follow the last line in the dataset (Chitungwiza), we can see that an entry for its region (Zimbabwe/Harare) was created with id 2676, and that id was added to the original row instead of the original values.

How much did we gain?

The original CSV file has a size of 895,586 bytes.

If we save the two new resources as CSVs, we would get

542,299 bytes for the fact resource and 68,023 for the regions dimension resource - a total of 610,322 bytes (or a reduction of 31% in size).

Not only this helps with size, it also improves greatly DB performance to store data in normalized form.

DB Normalization

Running similar code to above, only using normalize_to_db will do the following:

Load existing values from database dimension tables (in case these tables exist)
Normalize the input data, and split into fact and dimension resources
Update the DB tables with new values, while reusing existing references

The main difference in usage from normalize is that the names of DB tables are provided.

from dataflows_normalize import normalize_to_db, NormGroup

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    normalize_to_db(
        [
            NormGroup(['country', 'subcountry'], 'country_id', 'id', db_table='countries_db_table') 
        ], 
        'cities_db_table', 'cities',
        db_connection_str='...'
    ),
).process()

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.0.11

Jul 26, 2020

0.0.10

Jul 26, 2020

0.0.9

Jul 26, 2020

0.0.8

Jul 29, 2019

0.0.7

Jul 28, 2019

0.0.6

Jul 28, 2019

0.0.5

Jul 28, 2019

0.0.4

Jul 26, 2019

0.0.3

Jul 25, 2019

0.0.2

Jul 18, 2019

0.0.1

May 2, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflows-normalize-0.0.11.tar.gz (6.5 kB view hashes)

Uploaded Jul 26, 2020 Source

Built Distribution

dataflows_normalize-0.0.11-py2.py3-none-any.whl (6.8 kB view hashes)

Uploaded Jul 26, 2020 Python 2 Python 3

Hashes for dataflows-normalize-0.0.11.tar.gz

Hashes for dataflows-normalize-0.0.11.tar.gz
Algorithm	Hash digest
SHA256	`b15b1e8dca95c82ec8918b1a662e8f8691b50d89faaac53ecb5dd29b3aa16283`
MD5	`96ca9ea357948c067300be329a99c6c5`
BLAKE2b-256	`37d3f77b3a2e7811d54bf1c55cd80a95a5157503386286bbc15e5c9ec7935ec4`

Hashes for dataflows_normalize-0.0.11-py2.py3-none-any.whl

Hashes for dataflows_normalize-0.0.11-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfc5100792865dbbf069a386d0e6d4a40a83671d94c94bf0974e755bfc5e36ec`
MD5	`fd14dce8352d6e1274b0f7616ef3de61`
BLAKE2b-256	`397e9b3aaf983803bed4985cf7a718d753e3915bcd4903e87125afe97a0d2341`