csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion
Project description
csv-dataset
CsvDataset
helps to read a csv file and create descriptive and efficient input pipelines for deep learning.
CsvDataset
iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.
Install
$ pip install csv-dataset
Usage
Suppose we have a csv file whose absolute path is filepath
:
open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
Dataset,
CsvReader
)
dataset = CsvDataset(
CsvReader(
filepath,
float,
# Abandon the first column and only pick the following
indexes=[1, 2, 3, 4, 5],
header=True
)
).window(3, 1).batch(2)
for element in dataset:
print(element)
The following output shows one print.
[[[7145.99, 7150.0, 7141.01, 7142.33, 21.094283]
[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]]
[[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]
[7123.74, 7128.06, 7117.12, 7126.57, 39.885367]]]
...
Dataset(reader: AbstractReader)
dataset.window(size: int, shift: int = None, stride: int = 1) -> self
Defines the window size, shift and stride.
The default window size is 1
which means the dataset has no window.
dataset.batch(batch: int) -> self
Defines batch size.
The default batch size of the dataset is 1
which means it is single-batch
dataset.get() -> Optional[np.ndarray]
Gets the data of the next batch
dataset.reset() -> None
Resets dataset
dataset.read(amount: int, reset_buffer: bool = False)
- amount the maximum length of data the dataset will read
- reset_buffer if
True
, the dataset will reset the data of the previous window in the buffer
Reads multiple batches at a time
CsvReader(filepath, dtype, indexes, **kwargs)
- filepath
str
absolute path of the csv file - dtype
Callable
data type. We should only usefloat
orint
for this argument. - indexes
List[int]
column indexes to pick from the lines of the csv file - kwargs
- header
bool = False
whether we should skip reading the header line. - splitter
str = ','
the column splitter of the csv file - normalizer
List[NormalizerProtocol]
list of normalizer to normalize each column of data. ANormalizerProtocol
should contains two methods,normalize(float) -> float
to normalize the given datum andrestore(float) -> float
to restore the normalized datum. - max_lines
int = -1
max lines of the csv file to be read. Defaults to-1
which means no limit.
- header
csvReader.reset()
Resets reader pos
csvReader.max_lines(lines: int)
Changes max_lines
csvReader.readline() -> list
Returns the converted value of the next line
property csvReader.lines
Returns number of lines has been read
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for csv_dataset-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7ddbf74eb6f226dd12c6af6ca38d397be3f2a57cfd94d10e2f198a9603c2ece |
|
MD5 | 336b82a0de41b6f31bbf93f4d3d4d570 |
|
BLAKE2b-256 | 029201cdda3dba228bcc171e6b7517a48a0fb8062a7e8a0d8921c97e9c97d832 |