csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...

from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size: int, shift: int = None, stride: int = 1) -> self

Defines the window size, shift and stride.

The default window size is 1 which means the dataset has no window.

dataset.batch(batch: int) -> self

Defines batch size.

The default batch size of the dataset is 1 which means it is single-batch

dataset.get() -> Optional[np.ndarray]

Gets the data of the next batch

dataset.reset() -> None

Resets dataset

dataset.read(amount: int, reset_buffer: bool = False)

amount the maximum length of data the dataset will read
reset_buffer if True, the dataset will reset the data of the previous window in the buffer

Reads multiple batches at a time

CsvReader(filepath, dtype, indexes, **kwargs)

filepath str absolute path of the csv file
dtype Callable data type. We should only use float or int for this argument.
indexes List[int] column indexes to pick from the lines of the csv file
kwargs
- header bool = False whether we should skip reading the header line.
- splitter str = ',' the column splitter of the csv file
- normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
- max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.

csvReader.reset()

Resets reader pos

csvReader.max_lines(lines: int)

Changes max_lines

csvReader.readline() -> list

Returns the converted value of the next line

property csvReader.lines

Returns number of lines has been read

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.5.0

Feb 13, 2021

3.4.0

Feb 13, 2021

3.3.0

Feb 10, 2021

3.2.1

Feb 7, 2021

3.2.0

Feb 7, 2021

3.1.0

Feb 6, 2021

3.0.1

Feb 6, 2021

3.0.0

Feb 4, 2021

This version

2.1.0

Aug 30, 2020

2.0.1

Aug 30, 2020

2.0.0

Aug 30, 2020

1.1.0

Aug 30, 2020

1.0.0

Aug 28, 2020

0.3.0

Aug 28, 2020

0.2.1

Aug 26, 2020

0.2.0

Aug 21, 2020

0.1.0

Aug 12, 2020

0.0.3

Aug 11, 2020

0.0.2

Aug 11, 2020

0.0.1

Aug 11, 2020

0.0.0

Jul 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-dataset-2.1.0.tar.gz (7.1 kB view hashes)

Uploaded Aug 30, 2020 Source

Built Distribution

csv_dataset-2.1.0-py3-none-any.whl (7.6 kB view hashes)

Uploaded Aug 30, 2020 Python 3

Hashes for csv-dataset-2.1.0.tar.gz

Hashes for csv-dataset-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`078428617c9bb8a8412182d6d6f7e6ec2d4c1191008cb79a1a2d008640fb98f9`
MD5	`ad2fe804238de150b694adf0dab20a4b`
BLAKE2b-256	`3cccb175727257a640911a4d3024fa50af8bda408a88f1310b6d6db303040123`

Hashes for csv_dataset-2.1.0-py3-none-any.whl

Hashes for csv_dataset-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7ddbf74eb6f226dd12c6af6ca38d397be3f2a57cfd94d10e2f198a9603c2ece`
MD5	`336b82a0de41b6f31bbf93f4d3d4d570`
BLAKE2b-256	`029201cdda3dba228bcc171e6b7517a48a0fb8062a7e8a0d8921c97e9c97d832`