Open Bandit Dataset and Pipeline

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Open Bandit Dataset & Pipeline

Table of Contents

Overview
Installation
- Requirements
Usage
Citation
License
Main Contributor
References
- Papers
- Projects

Overview

Open Bandit Dataset (OBD)

Open Bandit Dataset is a public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents examples of displayed fashion items as actions.

Recommended fashion items as actions in ZOZOTOWN

We collected the data in a 7-day experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively. Each campaign randomly used either the Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm for each user impression.

Please see ./obd/README.md for the description of the dataset.

Open Bandit Pipeline (OBP)

Open Bandit Pipeline is a series of implementations of dataset preprocessing, offline bandit simulation, and evaluation of OPE estimators. This pipeline allows researchers to focus on building their OPE estimator and easily compare it with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.

Structure of Open Bandit Pipeline

Open Bandit Pipeline consists of the following main modules.

dataset module: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback.
policy module: This module provides interfaces for bandit algorithms and several standard algorithms.
simulator module: This module provides functions for conducting offline bandit simulation.
ope module: This module provides interfaces for bandit algorithms and several standard OPE estimators.

Supported Algorithms and OPE Estimators

Bandit Algorithms (implemented in policy module)
- Context-free
  - Random
  - Epsilon Greedy
  - Bernoulli Thompson Sampling
- Contextual
  - Logistic Epsilon Greedy
  - Logistic Thompson Sampling [Chapelle and Li. 2011]
  - Logistic Upper Confidence Bound [Mahajan et al. 2012]
OPE Estimators (implemented in ope module)
- Replay Method [Li et al. 2011]
- Direct Method [Beygelzimer and Langford 2009]
- Inverse Probability Weighting [Precup et al. 2000] [Strehl et al. 2010]
- Self-Normalized Inverse Probability Weighting [Swaminathan and Joachims. 2015]
- Doubly Robust [Dudík et al. 2014]
- Switch Estimator [Wang et al. 2016]
- More Robust Doubly Robust [Farajtabar et al. 2018]

In addition to the above algorithms and estimators, the pipeline also provides flexible interfaces. Therefore, researchers can easily implement their own algorithms or estimators and evaluate them with our data and pipeline. Moreover, the pipeline provides an interface for logged bandit feedback datasets. Thus, practitioners can combine their own datasets with the pipeline and easily evaluate bandit algorithms' performances in their settings.

Topics and Tasks

Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison related to the following research topics.

Bandit Algorithms: Our data include large-scale logged bandit feedback collected by the uniform random policy. Therefore, it enables the evaluation of new online bandit algorithms, including contextual and combinatorial algorithms, in a large real-world setting.
Off-Policy Evaluation: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by multiple behavior policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of counterfactual policies.

Installation

You can install OBP using Python's package manager pip.

pip install obp

You can install OBP from source.

git clone https://github.com/st-tech/zr-obp
cd obp
python setup.py install

Requirements

python>=3.7.0
matplotlib>=3.2.2
numpy>=1.18.1
pandas>=0.25.1
pyyaml>=5.1
seaborn>=0.10.1
scikit-learn>=0.23.1
scipy>=1.4.1
tqdm>=4.41.1

Usage

We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as a counterfactual policy using the Replay Method and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch.

# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.simulator import run_bandit_simulation
from obp.ope import OffPolicyEvaluation, ReplayMethod

# (1) Data loading and preprocessing
dataset = OpenBanditDataset(behavior_policy='random', campaign='women')
bandit_feedback = dataset.obtain_batch_bandit_feedback()

# (2) Offline Bandit Simulation
counterfactual_policy = BernoulliTS(n_actions=dataset.n_actions, len_list=dataset.len_list)
selected_actions = run_bandit_simulation(bandit_feedback=bandit_feedback, policy=counterfactual_policy)

# (3) Off-Policy Evaluation
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[ReplayMethod()])
estimated_policy_value = ope.estimate_policy_values(selected_actions=selected_actions)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value['rm'] / bandit_feedback['reward'].mean()
print(relative_policy_value_of_bernoulli_ts) # 1.120574...

A formal introduction with the same example can be found at quickstart. Below, we explain some important features in the example flow.

(1) Data loading and preprocessing

We prepare an easy-to-use data loader for Open Bandit Dataset.

# load and preprocess raw data in "Women" campaign collected by the Random policy
dataset = OpenBanditDataset(behavior_policy='random', campaign='women')
# obtain logged bandit feedback generated by behavior policy
bandit_feedback = dataset.obtain_batch_bandit_feedback()

print(bandit_feedback.keys())
# dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

Users can implement their own feature engineering in the pre_process method of obp.dataset.OpenBanditDataset class. We show an example of implementing some new feature engineering processes in ./examples/obd/dataset.py. Moreover, by following the interface of obp.dataset.BaseBanditDataset class, one can handle future open datasets for bandit algorithms other than our OBD.

(2) Offline Bandit Simulation

After preparing a dataset, we now run offline bandit simulation on the logged bandit feedback as follows.

# define a counterfacutal policy (the Bernoulli TS policy here)
counterfactual_policy = BernoulliTS(n_actions=dataset.n_actions, len_list=dataset.len_list)
# `selected_actions` is an array containing selected actions by counterfactual policy in an simulation
selected_actions = run_bandit_simulation(bandit_feedback=bandit_feedback, policy=counterfactual_policy)

obp.simulator.run_bandit_simulation function takes obp.policy.BanditPolicy class and bandit_feedback (a dictionary storing logged bandit feedback) as inputs and runs offline bandit simulation of a given counterfactual bandit policy. selected_actions is an array of selected actions during the offline bandit simulation by the counterfactual policy. Users can implement their own bandit algorithms by following the interface of obp.policy.BanditPolicy.

(3) Off-Policy Evaluation

Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of bandit algorithms using log data generated by offline bandit simulation. Our pipeline also provides an easy procedure for doing OPE as follows.

# estimate the policy value of BernoulliTS based on actions selected by that policy in offline bandit simulation
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[ReplayMethod()])
estimated_policy_value = ope.estimate_policy_values(selected_actions=selected_actions)
print(estimated_policy_value) # {'rm': 0.005155..} dictionary containing estimated policy values by each OPE estimator.

# comapre the estimated performance of BernoulliTS (counterfactual policy)
# with the ground-truth performance of Random (behavior policy)
relative_policy_value_of_bernoulli_ts = estimated_policy_value['rm'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 12.05%
print(relative_policy_value_of_bernoulli_ts) # 1.120574...

Users can implement their own OPE estimator by following the interface of obp.ope.BaseOffPolicyEstimator class. obp.ope.OffPolicyEvaluation class summarizes and compares the estimated policy values by several off-policy estimators. A detailed usage of this class can be found at quickstart. bandit_feedback['reward'].mean() is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).

Citation

If you use this project in your work, please cite our paper below.

# TODO: add bibtex
@article{
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Main Contributor

Yuta Saito

References

Papers

Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009.
Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 297–306, 2011.
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, pages 2217–2225, 2010.
Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. 2000.
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.
Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, pages 3231–3239, 2015.
Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. LogUCB: An Explore-Exploit Algorithm for Comments Recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, 6–15. 2012.
Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, volume 26, 19–36. 2012.
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, 3589–3597. 2017.
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. 2018.
Nathan Kallus and Masatoshi Uehara. Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2019.
Yusuke Narita, Shota Yasui, and Kohei Yata. Off-policy Bandit and Reinforcement Learning. arXiv preprint arXiv:2002.08536, 2020.
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687, 2020.

Projects

This project is strongly inspired by Open Graph Benchmark --a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.7

Apr 14, 2023

0.5.6

Apr 14, 2023

0.5.5

Jun 15, 2022

0.5.4

Apr 4, 2022

0.5.3

Apr 3, 2022

0.5.2

Jan 13, 2022

0.5.1

Sep 27, 2021

0.5.0

Sep 7, 2021

0.4.1

Jun 30, 2021

0.4.0

Mar 20, 2021

0.3.3

Nov 13, 2020

0.3.2

Nov 7, 2020

0.3.1

Oct 22, 2020

0.3.0

Oct 7, 2020

0.2.2

Aug 17, 2020

0.2.1

Aug 15, 2020

0.2.0

Jul 24, 2020

This version

0.1.0

Jul 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

obp-0.1.0-py3.7.egg (79.7 kB view hashes)

Uploaded Jul 24, 2020 Source

Hashes for obp-0.1.0-py3.7.egg

Hashes for obp-0.1.0-py3.7.egg
Algorithm	Hash digest
SHA256	`d756ecbbddfc3359393543ccbe27fb26f9189c7d3e2ccfa4a147b1dc5f4d917b`
MD5	`4d7410dbccfd4e98f803ab045eb51621`
BLAKE2b-256	`4dbdd52ebaf2167f001302f64d55b625e380536839c2e9963c0e88118f4c0519`