csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...

from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size, shift=None, stride=1) -> self

Defines the window size, shift and stride

dataset.batch(batch) -> self

Defines batch size

dataset.get() -> Optional[np.ndarray]

Get the data of the next batch

dataset.reset() -> None

Reset reader position

CsvReader(filepath, dtype, indexes, **kwargs)

filepath str absolute path of the csv file
dtype Callable data type. We should only use float or int for this argument.
indexes List[int] column indexes to pick from the lines of the csv file
kwargs
- header bool = False whether we should skip reading the header line.
- splitter str = ',' the column splitter of the csv file
- normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
- max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.

csvReader.seek(pos: int)

csvReader.reset()

csvReader.max_lines()

csvReader.readline() -> list

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.5.0

Feb 13, 2021

3.4.0

Feb 13, 2021

3.3.0

Feb 10, 2021

3.2.1

Feb 7, 2021

3.2.0

Feb 7, 2021

3.1.0

Feb 6, 2021

3.0.1

Feb 6, 2021

3.0.0

Feb 4, 2021

2.1.0

Aug 30, 2020

2.0.1

Aug 30, 2020

2.0.0

Aug 30, 2020

1.1.0

Aug 30, 2020

1.0.0

Aug 28, 2020

This version

0.3.0

Aug 28, 2020

0.2.1

Aug 26, 2020

0.2.0

Aug 21, 2020

0.1.0

Aug 12, 2020

0.0.3

Aug 11, 2020

0.0.2

Aug 11, 2020

0.0.1

Aug 11, 2020

0.0.0

Jul 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-dataset-0.3.0.tar.gz (6.7 kB view hashes)

Uploaded Aug 28, 2020 Source

Built Distribution

csv_dataset-0.3.0-py3-none-any.whl (7.2 kB view hashes)

Uploaded Aug 28, 2020 Python 3

Hashes for csv-dataset-0.3.0.tar.gz

Hashes for csv-dataset-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4fb2db6294838f4765cb02531668ae4405c95ad1868142b545ed47a6f3adc799`
MD5	`9d903c6031bb8ea5bb2dcd4f5b249374`
BLAKE2b-256	`c8bb72cfb47d1b6b642f874cb9936d8bc3de7edb3b0e242ab522f31fe3872829`

Hashes for csv_dataset-0.3.0-py3-none-any.whl

Hashes for csv_dataset-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5facb062e9cb9bbe84def64bcfc90873983390cb340dd4fea64e77040c1a447a`
MD5	`4cc39373a575b2ad799620e006e44be2`
BLAKE2b-256	`e7c093b6b466da42ba7daa47e9fd1295a3df7b8c972f892e6fc1181d574db6f9`