No project description provided

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

Welcome to the DSDS

This package is in pre-alpha stage. There is currently no documentation webpage. Only docstrings and github. Please read CONTRIBUTING.md if you are a developer interested in contributing to this package.

A general purpose traditional data science package for large and scalable data operations. It aims to be an improvement over a subset of functionalities in other packages like Scikit-learn, category_encoders, or feature engine, etc. The primary focus right now is:

Treating dataframes as first-class citizen, using Rust, NumPy to support. The only other core dependency is SciPy and NumPy. Will rely more on Rust as time goes. Python side dependency should be minimized.
Providing practical feature prescreen (immediate detection and removal of useless featuers, data profiling, over time report etc.)
Fast and furious feature selection and engineering using simple methods. It has significantly faster F-score, mutual_info_score, better feature extraction and engineering, etc. Currently, DSDS provides MRMR with more than 90% speed up than the most popular MRMR package on github. For model details, see here
Cleaner pipeline construction and management, much easier to create custom "transformers" (All Polars Expressions are acceptable as "transformers".)
Consistent API with intuitive argument names and extensive docstring and examples. No function side effect.
Functional interface and fully typed functions for a better developer experience. No mixins, no multiple inheritance. No classes. Just Polars.
Better performance and faster iteration

At this moment, it will not include traditional data science models, like SVM, random forest, etc. This may change once in the future when the Rust side of data science models catch up.

DSDS is built upon your favorite: Polars Dataframe

Here is why in my opinion data scientists, analysts and engineers should switch to Polars!

Usage

Practical Feature Prescreen and Data

from dsds.prescreen import (
    infer_highly_correlated,
    infer_str_cols,
    infer_by_var,
    infer_invalid_numeric,
    infer_conti,
    infer_emails,
    infer_nulls,
    infer_by_pattern,
    over_time_report_num,
) # and a lot more!

Functional Pipeline Interface which supports both Eager and LazyFrames! And it can be pickled and load back and reapplied! See more in the examples on github.

from dsds.prescreen import *
from dsds.transform import *

input_df = df.lazy()
output = input_df.pipe(var_removal, threshold = 0.5, target = "Clicked on Ad")\
    .pipe(binary_encode)\
    .pipe(ordinal_auto_encode, cols = ["City", "Country"])\
    .pipe(impute, cols=["Daily Internet Usage", "Daily Internet Usage Band", "Area Income Band"], strategy="median")\
    .pipe(impute, cols=["Area Income"], strategy = "mean")\
    .pipe(scale, cols=["Area Income", "Daily Internet Usage"])\
    .pipe(one_hot_encode, cols= ["One_Hot_Test"])\
    .pipe(remove_if_exists, cols = ["Ad Topic Line", "Timestamp"])\
    .pipe(mutual_info_selector, target = "Clicked on Ad", top_k = 12)

Common practical Transformations for feature extraction or engineering:

import dsds.transform as t
df = pl.DataFrame({
    "survey":["0% of my time", "1% to 25% of my time", "75% to 99% of my time", 
            "50% to 74% of my time", "75% to 99% of my time", 
          "50% to 74% of my time"]
})
print(t.extract_patterns(df, "(\d*\.?\d+)%", ["survey"], join_by=" to "))

shape: (6, 1)
┌────────────┐
│ survey     │
│ ---        │
│ str        │
╞════════════╡
│ 0%         │
│ 1% to 25%  │
│ 75% to 99% │
│ 50% to 74% │
│ 75% to 99% │
│ 50% to 74% │
└────────────┘
#-----------------------------------------------------------------------------
import dsds.transform as t
df = pl.DataFrame({
    "a":[1, 2, 3],
    "b":[1, 2, 3],
    "c":[1, 2, 3]
})
t.extract_horizontally(df, cols=["a", "b", "c"], extract=["min", "max", "sum"])

shape: (3, 3)
┌────────────┬────────────┬────────────┐
│ min(a,b,c) ┆ max(a,b,c) ┆ sum(a,b,c) │
│ ---        ┆ ---        ┆ ---        │
│ i64        ┆ i64        ┆ i64        │
╞════════════╪════════════╪════════════╡
│ 1          ┆ 1          ┆ 3          │
│ 2          ┆ 2          ┆ 6          │
│ 3          ┆ 3          ┆ 9          │
└────────────┴────────────┴────────────┘
#-----------------------------------------------------------------------------
import dsds.transform as t
df = pl.DataFrame({
    "date1":["2021-01-01", "2022-02-03", "2023-11-23"]
    , "date2":["2021-01-01", "2022-02-03", "2023-11-23"]
}).with_columns(
    pl.col(c).str.to_date() for c in ["date1", "date2"]
)
print(df)

shape: (3, 2)
┌────────────┬────────────┐
│ date1      ┆ date2      │
│ ---        ┆ ---        │
│ date       ┆ date       │
╞════════════╪════════════╡
│ 2021-01-01 ┆ 2021-01-01 │
│ 2022-02-03 ┆ 2022-02-03 │
│ 2023-11-23 ┆ 2023-11-23 │
└────────────┴────────────┘

cols = ["date1", "date2"]
print(t.extract_dt_features(df, cols=cols))

shape: (3, 8)
┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ date1      ┆ date2      ┆ date1_year ┆ date2_yea ┆ date1_qua ┆ date2_qua ┆ date1_mon ┆ date2_mon │
│ ---        ┆ ---        ┆ ---        ┆ r         ┆ rter      ┆ rter      ┆ th        ┆ th        │
│ date       ┆ date       ┆ u32        ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆            ┆            ┆ u32       ┆ u32       ┆ u32       ┆ u32       ┆ u32       │
╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2021-01-01 ┆ 2021-01-01 ┆ 2021       ┆ 2021      ┆ 1         ┆ 1         ┆ 1         ┆ 1         │
│ 2022-02-03 ┆ 2022-02-03 ┆ 2022       ┆ 2022      ┆ 1         ┆ 1         ┆ 2         ┆ 2         │
│ 2023-11-23 ┆ 2023-11-23 ┆ 2023       ┆ 2023      ┆ 4         ┆ 4         ┆ 11        ┆ 11        │
└────────────┴────────────┴────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

Benchmarks:

Screenshot

DSDS currently provides Polars plugins but the API is subject to change.

Plugin

Dependencies

Python 3.9, 3.10, 3.11+ is recommended.

It should run on all versions >= 3.9.

Why the name DSDS?

Originally I choose the name DSDS because it stands for Dark Side of Data Science. I thought it is kind of cool and since I am doing things differently, it justifies "dark side". I think data science/modelling infrastructure is the most foundational work that is also the most under-appreciated. Pipelines are buried "in darkness" most of the time. Feature selection is often considered a dark art, too. So the name DarkSide/dsds really makes sense to me. But right now, DSDS is just a name. I don't know think calling it Dark Side of Data Science helps with recoginition and I personally wouldn't call the way I am doing things the "dark side" any more. It is simply a new way of doing things not explored by people before.

Why is this package dependent on Sklearn?

As of 0.0.34, the package does not explicitly rely on Scikit-learn any more. The only model-dependent "feature_importance" the package will use is LightGBM, which is an optional dependency.

Why not write more in Rust?

Yes. I am. I recently resumed working on traditional NLP, and this is an area where Rust really shines. It is well-known that Stemmers in NLTK has terrible performance, but stemming is an important operations for many traditional NLP algorithms. To combat this, I have decided to move stemming and other heavy string manipulations to Rust and leave only a very thin wrapper in Python. That said, using Rust can greatly improve performance in many other modules, not just dsds.text. But because development in Rust is relatively slow, I do not want to blindly 'rewrite' in Rust.

Contribution

See CONTRIBUTING.md for my contact info.

Disclaimer

Snowball Stemming implementation for Rust is taken from Tsoding's Seroost project.

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.0.34

Oct 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsds-0.0.34.tar.gz (6.3 MB view hashes)

Uploaded Oct 21, 2023 Source

Built Distribution

dsds-0.0.34-cp311-none-win_amd64.whl (8.4 MB view hashes)

Uploaded Oct 21, 2023 CPython 3.11 Windows x86-64

Hashes for dsds-0.0.34.tar.gz

Hashes for dsds-0.0.34.tar.gz
Algorithm	Hash digest
SHA256	`11682dd3b0a4bd1821e3f7e6bb96e6feb793c60050480299dd3d36eb027a3de0`
MD5	`8718eb016348f96cc4a882a39551e920`
BLAKE2b-256	`49829ab3d97e2ebf04ec480c20bb6c153f8b1027e949bf1aa5073e6e5f8b8e75`

Hashes for dsds-0.0.34-cp311-none-win_amd64.whl

Hashes for dsds-0.0.34-cp311-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`83aae668c8b4560bcfb84d6fe8908f5d8b0b5cf97472b9bc37efeb52365a39f4`
MD5	`7cc01e5a5fbd100bd604116ff75c6086`
BLAKE2b-256	`d56fd0341f6bdf7b74e4276c73e3ca1367b1e80fca000118606b0f70cfff72b9`