CausalNLP

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

CausalNLP

CausalNLP is a practical toolkit for causal inference with text

Install

pip install -U pip
pip install causalnlp

Usage

What is the causal impact of a positive review on a product click?

import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)

The file music_seed50.tsv is a semi-simulated dataset from here. Columns of relevance include:

Y_sim: simulated outcome, where 1 means product was clicked and 0 means not.
C_true:confounding categorical variable (1=audio CD, 0=other)
T_true: 1 means rating less than 3, 0 means rating of 5, where T_true affects the outcome Y_sim.
T_ac: An approximation of true review sentiment (T_true) created with Autocoder.

We'll pretend the rating and T_true are unobserved and only use T_ac as the treatment variable. Using the text_col parameter, we include raw text as covariates for which adjustments can be made to improve causal estimates.

from causalnlp.causalinference import CausalInferenceModel
from lightgbm import LGBMClassifier
cm = CausalInferenceModel(df, 
                         metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
                         treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
                         include_cols=['C_true'])
cm.fit()

outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['C_true']
text covariate: text
preprocess time:  1.1216762065887451  sec
start fitting causal inference model
time to fit causal inference model:  9.701336860656738  sec

The average treatment effect (ATE):

print( cm.estimate_ate() )

{'ate': 0.1309311542209525}

The conditional average treatment effect (CATE) for those reviews that mention the word "toddler":

print( cm.estimate_ate(df['text'].str.contains('toddler')) )

{'ate': 0.15559234254638685}

Features most predictive of the treatment effects (e.g., increase in probability of clicking product):

print( cm.interpret(plot=False)[1][:10] )

v_music    0.079042
v_cd       0.066838
v_album    0.055168
v_like     0.040784
v_love     0.040635
C_true     0.039949
v_just     0.035671
v_song     0.035362
v_great    0.029918
v_heard    0.028373
dtype: float64

Features with the v_ prefix are word features. C_true is the categorical variable indicating whether or not the product is a CD.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.7.0

Aug 2, 2022

0.6.0

Oct 20, 2021

0.5.0

Sep 3, 2021

0.4.0

Jul 20, 2021

0.3.1

Jul 19, 2021

0.3.0

Jul 15, 2021

0.2.0

Jun 21, 2021

0.1.3

Jun 17, 2021

0.1.2

Jun 17, 2021

0.1.1

Jun 17, 2021

0.1.0

Jun 16, 2021

0.1.0b1 pre-release

Jun 15, 2021

This version

0.1.0b0 pre-release

Jun 15, 2021

0.0.1b0 pre-release

Jun 14, 2021

0.0.1a0 pre-release

May 30, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causalnlp-0.1.0b0.tar.gz (21.6 kB view hashes)

Uploaded Jun 15, 2021 Source

Hashes for causalnlp-0.1.0b0.tar.gz

Hashes for causalnlp-0.1.0b0.tar.gz
Algorithm	Hash digest
SHA256	`2aa32714c2b529d8b773c426659ecce1533083e2d3156f4b01ec5c4f689091fc`
MD5	`62cc030aae52c96b9f54fc47bfec4c65`
BLAKE2b-256	`758396681d3213484806530f6b46fd594c111cbf1ef71cf1294c353e0e446582`