Skip to main content

Package for generating and evaluating patterns in quantitative reports

Project description

data-patterns

Pypi Version Documentation Status License Code style: black

Package for generating and evaluating data-patterns in quantitative reports

Features

Here is what the package does:

  • Generating and evaluating patterns in structured datasets and exporting to Excel and JSON

  • Transforming generated patterns into Pandas code

Quick overview

To install the package

pip install data_patterns

To introduce the features of the this package define the following Pandas DataFrame:

df = pd.DataFrame(columns = ['Name',       'Type',             'Assets', 'TV-life', 'TV-nonlife' , 'Own funds', 'Excess'],
                  data   = [['Insurer  1', 'life insurer',     1000,     800,       0,             200,         200],
                            ['Insurer  2', 'non-life insurer', 4000,     0,         3200,          800,         800],
                            ['Insurer  3', 'non-life insurer', 800,      0,         700,           100,         100],
                            ['Insurer  4', 'life insurer',     2500,     1800,      0,             700,         700],
                            ['Insurer  5', 'non-life insurer', 2100,     0,         2200,          200,         200],
                            ['Insurer  6', 'life insurer',     9000,     8800,      0,             200,         200],
                            ['Insurer  7', 'life insurer',     9000,     0,         8800,          200,         200],
                            ['Insurer  8', 'life insurer',     9000,     8800,      0,             200,         200],
                            ['Insurer  9', 'non-life insurer', 9000,     0,         8800,          200,         200],
                            ['Insurer 10', 'non-life insurer', 9000,     0,         8800,          200,         199.99]])
df.set_index('Name', inplace = True)

Start by defining a PatternMiner:

miner = data_patterns.PatternMiner(df)

To generate patterns use the find-function of this object:

df_patterns = miner.find({'name'      : 'equal values',
                          'pattern'   : '=',
                          'parameters': {"min_confidence": 0.5,
                                         "min_support"   : 2,
                                         "decimal" : 8}})

The result is a DataFrame with the patterns that were found. The first part of the DataFrame now contains

id

pattern_id

pattern_def

support

exceptions

confidence

0

equal values

{Own funds} = {Excess}

9

1

0.9

The miner finds one patterns; it states that the ‘Own funds’-column is identical to the ‘Excess’-column in 9 of the 10 cases (with a confidence of 90 %, there is one case where the equal-pattern does not hold).

To analyze data with the generated set of data-patterns use the analyze function with the dataframe with the data as input:

df_results = miner.analyze(df)

The result is a DataFrame with the results. If we select result_type = False then the first part of the output contains

index

result_type

pattern_id

pattern_def

support

exceptions

confidence

P values

Q values

Insurer 10

False

equal values

{Own funds} = {Excess}

9

1

0.9

200

199.99

Other patterns you can use are ‘>’, ‘<’, ‘<=’, ‘>=’, ‘!=’, ‘sum’, and ‘–>’.

Read the documentation for more features.

Upload to Pypi (for developers)

  1. Change the version in setup.py and setup.cfg

  2. Go to github.com and navigate to the repository. Next, click on the tab “releases” and then on “Create a new release”. Now, define a Tag verion (it is best to use the same number as you used in your setup.py version-field: v0.1.15 for example). Then click on “publish release”.

  3. Make a Pypi account here: https://pypi.org/manage/projects/

  4. Download twine by typing in your command prompt:

    pip install twine
  5. Get admin rights of the owner of the data_patterns package.

  6. Delete the old files in the dist folder

  7. Open your command prompt and go to the folder of data_patterns. Then type

    python setup.py sdist

    twine upload dist/*

A good reference is here: https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56

History

0.1.0 (2019-10-27)

  • Development release.

0.1.11 (2019-11-6)

  • First release on PyPI.

< 0.1.17 (2020-10-6)

Expression

You can now use expressions to find patterns. This is a string such as ‘{.*}={.*}’ (this one will find columns that are equal to eachother). See example in usage as how to do it, also with unknown values.

Patterns of the for IF THEN will be done through a pandas expression and quantitative patterns will be found using numpy (quicker). Expression will be split up in parts if it is quantitative

Function

Added the function correct_data. This corrects data based on the most common value if grouped with another column, e.g. changes the names in a column if there are multiple names per LEI code.

Other

  1. Added P and Q values to analyze

  2. highest_conf option to find the pattern with the highest conf based on P value.

  3. Possible to use with EVA2 rules

0.1.17 (2020-10-6)

Parameters

  1. ‘window’ (boolean): Only compares columns in a window of n, so [column-n, column+n].

  2. ‘disable’ (boolean): If you set this to True, it will disable all tqdm progress bars for finding and analyzing patterns.

  3. ‘expres’ (boolean): If you use an expression, it will only directly work with the expression if it is an IF THEN statement. Otherwise it is a quantitative pattern and it will be split up in parts and it uses numpy to find the patterns (this is quicker). However sometimes you want to work with an expression directly, such as the difference between two columns is lower than 5%. If you set expres to True, it will work directly with the expression.

    Expression

  1. You can use ABS in expressions. This calculates the absolute value. So something like ‘ABS({‘X’} - {‘Y’}) = {‘Z’})’

    cluster

  1. You can now add the column name on which you want to cluster

    Function

  1. Convert_to_time: merge periodes together by adding suffix to columns (t-1) and (t).

  2. convert_columns_to_time: Make the periods into columns so that you have years as columns.

    Other

  1. Add tqdm progress bars

0.1.18 (16-11-2020)

variables to miner

You can now add a boolean to the miner. If you give the boolean True to the miner, it will get rid of all the “ and ‘ in the string data. This is needed for some data where name have those characters in their name. This will give errors later on if not removed.

Function to read overzicht

Changed the IF THEN expression so that we can use decimals when numeric

Parameters

  1. ‘notNaN’ (boolean): Only takes not NaN columns

    Function changes

  1. Convert_to_time: add boolean set_year. If true then only use the years (this is for yearly data), otherwise keep whole date. Set to True standard

  2. update_statistics: Remove patterns that contain columns which are not in the data. This is necessary for some insurers so that they do not get errors

0.1.19 (10-2-2020)

Bug fixes with expressions including regex

0.1.20 (29-4-2021)

Suppress Pandas slice error is some cases

Deleted logging.basicConfig (to avoid that initial config is overwritten)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

data_patterns-0.1.24-py2.py3-none-any.whl (27.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page