Python function to extract data from an ODS spreadsheet on the fly - without having to store the entire file in memory or disk

These details have not been verified by PyPI

Project links

Source

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

stream-read-ods

Python function to extract data from an ODS spreadsheet on the fly - without having to store the entire file in memory or disk

To construct ODS spreadsheets on the fly, try stream-write-ods.

Installation

pip install stream-read-ods

Usage

To extract the rows you must use the stream_read_ods function, passing it an iterable of bytes instances, and it will return an iterable of (sheet_name, sheet_rows) pairs.

from stream_read_ods import stream_read_ods
import httpx

def ods_chunks():
    # Iterable that yields the bytes of an ODS file
    with httpx.stream('GET', 'https://www.example.com/my.ods') as r:
        yield from r.iter_bytes(chunk_size=65536)

for sheet_name, sheet_rows in stream_read_ods(ods_chunks()):
    for sheet_row in sheet_rows:
        print(sheet_row)  # Tuple of cells

If the spreadsheet is of a fairly simple structure, then the sheet_rows from above can be passed to the simple_table function to extract the names of the columns and the rows of the table.

from stream_read_ods import stream_read_ods, simple_table

for sheet_name, sheet_rows in stream_read_ods(ods_chunks()):
    columns, rows = simple_table(sheet_rows, skip_rows=2)
    for row in rows:
        print(row)  # Tuple of cells

This can then be used to construct a Pandas dataframe from the ODS file (although this would store the entire sheet in memory).

import pandas as pd
from stream_read_ods import stream_read_ods, simple_table

for sheet_name, sheet_rows in stream_read_ods(ods_chunks()):
    columns, rows = simple_table(sheet_rows, skip_rows=2)
    df = pd.DataFrame(rows, columns=columns)
    print(df)

Types

There are 8 possible data types in an Open Document Spreadsheet: boolean, currency, date, float, percentage, string, time, and void. These are converted to Python types according to the following table.

ODS type	Python type
boolean	bool
currency	stream_read_ods.Currency
date	date or datetime
float	Decimal
percentage	stream_read_ods.Percentage
string	str
time	stream_read_ods.Time
void	NoneType

Note that a string in an ODS file can be structured and styled - under the hood this is with an HTML-like syntax. However, these structures and styles are not preserved by the conversion process. The exception is that each paragraph - p tag - after the first is converted into a newline.

stream_read_ods.Currency

A subclass of Decimal with an additional attribute code that contains the currency code, for example the string GBP. This can be None if the ODS file does not specify a code.

stream_read_ods.Percentage

A subclass of Decimal.

stream_read_ods.Time

The Python built-in timedelta type is not used since timedelta does not offer a way to store intervals of years or months, other than converting to days which would be a loss of information.

Instead, a namedtuple is defined, stream_read_ods.Time, with members:

Member	Type
sign	str
years	int
months	int
days	int
hours	int
minutes	int
seconds	Decimal

Merged cells

Merged cells in the spreadsheet are split, with the same value copied into all of the resulting cells. This is probably The Right Thing when converting a spreadsheet into a dataframe-like structure since such cells are usually header-like.

Running tests

pip install -e ".[dev]"
pytest

Exceptions

Exceptions raised by the source iterable are passed through stream_read_ods unchanged. Other exceptions are in the stream_read_ods module, and derive from its StreamReadODSError.

Exception hierarchy

StreamReadODSError

Base class for all explicitly-thrown exceptions
- InvalidOperationError
  - UnfinishedIterationError
    
    The rows iterator of a sheet has not been iterated to completion
- InvalidODSFileError (also inherits from the ValueError built-in)
  
  Base class for errors relating to the bytes of the ODS file not being parsable. Several errors relate to the fact that ODS files are ZIP archives that require specific members and contents.
  - UnzipError
    
    The ODS file does not appear to be a valid ZIP file. More detail is in the __cause__ member of the raised exception, which is an exception that derives from UnzipValueError in stream-unzip.
  - MissingMIMETypeError
    
    The MIME type of the file was not present. In ZIP terms, this means that the first file of the ZIP archive is not named mimetype.
  - IncorrectMIMETypeError
    
    The MIME type was present, but does not match application/vnd.oasis.opendocument.spreadsheet. The can happen if a file such as an Open Document Text (ODT) file is passed rather than an ODS file.
  - MissingContentXMLError
    
    The file claims to be an ODS file according to its MIME type, but does not contain the requires content.xml file that contains the sheet data.
  - InvalidContentXMLError
    
    The file claims to be an ODS file according to its MIME type, it contains a content.xml file, but it doesn't appear to contain valid XML. More detail is in the __cause__ member of the raised exception, which is an exception that derives from lxml.etree.LxmlError
    
    This exception may be raised in cases the underlying XML requires a high amount of memory to be parsed.
  - InvalidODSXMLError
    
    The file has valid content as XML, but there is some aspect of the XML that makes it not parseable as a spreadsheet.
    - InvalidTypeError
      
      The data type of a cell is not one of the 8 ODS data types
    - InvalidValueError
      
      The value of a cell cannot be parsed as its declared type. More detail may be in the __cause__ member of the raised exception.
      - InvalidBooleanValueError
      - InvalidCurrencyValueError
      - InvalidDateValueError
      - InvalidFloatValueError
      - InvalidPercentageValueError
      - InvalidTimeValueError
- SizeError
  
  The file appears valid as an ODS file so far, but processing has hit a size related limit. These limits are in place to avoid unexpected high memory use.
  - TooManyColumnsError
    
    More columns than the max_columns argument to stream_read_ods have been encountered. The default limit is 65536.
  - TooManySplitCells
    
    When splitting merged cells, more split cells need to be created than the max_split_cells argument to stream_read_ods allows. The default limit is 65536.
  - StringTooLongError
    
    A cell with a string value that's longer than the max_string_length argument to stream_read_ods has been encountered. The default limit is 65536.

Project details

These details have not been verified by PyPI

Project links

Source

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.22

Aug 12, 2023

0.0.21

Oct 30, 2022

0.0.20

Oct 30, 2022

0.0.19

Oct 30, 2022

0.0.18

Oct 30, 2022

0.0.17

Oct 30, 2022

0.0.16

Oct 30, 2022

0.0.15

Oct 30, 2022

0.0.14

Oct 29, 2022

0.0.13

Oct 29, 2022

0.0.12

Oct 29, 2022

0.0.11

Oct 29, 2022

0.0.10

Oct 29, 2022

0.0.9

Oct 29, 2022

0.0.8

Oct 29, 2022

0.0.7

Oct 9, 2022

0.0.6

Oct 9, 2022

0.0.5

Oct 8, 2022

0.0.4

Oct 8, 2022

0.0.3

Oct 8, 2022

0.0.2

Oct 8, 2022

0.0.1

Oct 8, 2022

0.0.0

Oct 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream_read_ods-0.0.22.tar.gz (8.7 kB view hashes)

Uploaded Aug 12, 2023 Source

Built Distribution

stream_read_ods-0.0.22-py3-none-any.whl (8.5 kB view hashes)

Uploaded Aug 12, 2023 Python 3

Hashes for stream_read_ods-0.0.22.tar.gz

Hashes for stream_read_ods-0.0.22.tar.gz
Algorithm	Hash digest
SHA256	`c6614cecd4fc23998093a80f5c636181b9a32b102db51aa42d354a9d89334f90`
MD5	`42c3090c7a21fdf4f7277a4e304aefd4`
BLAKE2b-256	`bf3b5a931e5e342863e7f033eca382e346a5073422fbe4cc26f91cc04835acbe`

Hashes for stream_read_ods-0.0.22-py3-none-any.whl

Hashes for stream_read_ods-0.0.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43e0aef757144cc839d2bc291509dc0dd8fd0b176f97ebcf5b4ab7e9ba0aef3c`
MD5	`c52deb552860d7bb3791f1bb6301d7cc`
BLAKE2b-256	`e2d8997a095039b371784ba05694723217975fc1daa419586b7d24cf9af8b71e`