Extreme fast factor expression & computation library for Python.
Project description
Factor Expr
Factor Expression | + | Historical Data | = | Factor Values |
---|---|---|---|---|
(TSLogReturn 30 :close) | + | 2019-12-27~2020-01-14.pq | = | [0.01, 0.035, ...] |
Extreme fast factor expression & computation library for Python.
On a server with an E7-4830 CPU (16 cores, 2000MHz), replaying 48 factors over a dataset with 24,513,435 rows x 683 columns (12GB) takes 150s.
Join [Discussions] for Q&A and feature proposal!
Usage
There are three steps to use this library.
- Prepare your dataset into a file. Currently, only the Parquet format is supported.
- Define your factors using S-Expression.
- Run
replay
to compute the factors on the dataset.
1. Prepare the dataset
A tabular format with at least a time
column is required for the dataset.
This means except for the time
column, you can have other columns with any name you want in the dataset.
For example, here is an OHLC candle dataset with 2 rows:
df = pd.DataFrame({
"time": [DateTime(2021,4,23), DateTime(2021,4,24)],
"open": [3.1, 5.8],
"high": [8.8, 7.7],
"low": [1.1, 2.1],
"close": [4.4, 3.4]
})
You can use the following code to store the DataFrame into a Parquet file:
import pyarrow as pa
import pyarrow.parquet as pq
tb = pa.Table.from_pandas(df)
tb = tb.cast(
pa.schema(
[
("time", pa.timestamp("ms")),
("open", pa.float64()),
("high", pa.float64()),
("low", pa.float64()),
("close", pa.float64()),
]
)
)
pq.write_table(tb, f"data.pq", version="2.0")
Several things need to be noticed:
- The time column is required and the data type must be
pa.timestamp("ms")
. - Other columns must have the
pa.float64()
data type. - The version for the Parquet file must be "2.0".
In the future 1 and 3 might be relaxed.
2. Define your factors
Factor Expr
uses the S-Expression to describe a factor.
For example, on a daily OHLC dataset, the 30 days log return on the column close
is expressed as:
from factor_expr import Factor
Factor("(TSLogReturn 30 :close)")
Note, in Factor Expr
, column names are referred to by using :column-name
.
3. Compute the factors on the prepared dataset
Following step 1 and 2, you can now compute the factors using the replay
function:
from factor_expr import Factor, replay
result = replay(
["data.pq"],
[Factor("(TSLogReturn 30 :close)")]
)
The first parameter of replay
is a list of dataset files and the second parameter is a list of Factors. This gives you the ability to compute multiple factors on multiple datasets. Don't worry about the performance! Factor Expr
will automatically parallelize over the Factors as well as the datasets.
The returned result is a pandas DataFrame with factors as the column names and time
as the index.
In case of multiple datasets are passed in, the results will be concatenated with the exact order of the datasets. This is useful if you have a scattered dataset. E.g. one file for each year.
For example, the code above will give you a DataFrame looks similar to this:
index | (TSLogReturn 30 :close) |
---|---|
2021-04-24 | 0.23 |
... | ... |
Checkout the docstring of replay
for more information!
Installation
pip install factor-expr
Supported Functions
Notations:
<const>
means a constant, e.g.3
.<expr>
means either a constant or an S-Expression or a column name, e.g.3
or(+ :close 3)
or:open
.
Here's the full list of supported functions. If you didn't find one you need, consider asking on Discussions or creating a PR!
Arithmetics
- Addition:
(+ <expr> <expr>)
- Subtraction:
(- <expr> <expr>)
- Multiplication:
(* <expr> <expr>)
- Division:
(/ <expr> <expr>)
- Power:
(^ <const> <expr>)
- compute<expr> ^ <const>
- Negation:
(Neg <expr>)
- Signed Power:
(SPow <const> <expr>)
- computesign(<expr>) * abs(<expr>) ^ <const>
- Natural Logarithm after Absolute:
(LogAbs <expr>)
- Sign:
(Sign <expr>)
- Abs:
(Abs <expr>)
Logics
- If:
(If <expr> <expr> <expr>)
- if the first<expr>
is larger than 0, return the second<expr>
otherwise return the third<expr>
- And:
(And <expr> <expr>)
- Or:
(Or <expr> <expr>)
- Less Than:
(< <expr> <expr>)
- Less Than or Equal:
(<= <expr> <expr>)
- Great Than:
(> <expr> <expr>)
- Greate Than or Equal:
(>= <expr> <expr>)
- Equal:
(== <expr> <expr>)
- Not:
(! <expr>)
Window Functions
All the window functions take a window size as the first argument. The computation will be done on the look-back window with the size given in <const>
.
- Sum of the window elements:
(TSSum <const> <expr>)
- Mean of the window elements:
(TSMean <const> <expr>)
- Min of the window elements:
(TSMin <const> <expr>)
- Max of the window elements:
(TSMax <const> <expr>)
- The index of the min of the window elements:
(TSArgMin <const> <expr>)
- The index of the max of the window elements:
(TSArgMax <const> <expr>)
- Stdev of the window elements:
(TSStd <const> <expr>)
- Skew of the window elements:
(TSSkew <const> <expr>)
- The rank (ascending) of the current element in the window:
(TSRank <const> <expr>)
- The value
<const>
ticks back:(Delay <const> <expr>)
- The log return of the value
<const>
ticks back to current value:(TSLogReturn <const> <expr>)
- Rolling correlation between two series:
(TSCorrelation <const> <expr> <expr>)
Warm-up Period for Window Functions
Factors containing window functions require a warm-up period. For example, for
(TSSum 10 :close)
, it will not generate data until the 10th tick is replayed.
In this case, replay
will write NaN
into the result by default,
so that the length of the output will be the same as the input dataset. You can use the trim
parameter to control this behaviour.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for factor_expr-0.1.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed9a4f65a41557dd4c54812858b5303c3b5f6ca0b250d17797a05a341424c2f0 |
|
MD5 | 1e4d86f60ba1a59068df12b1aaf17779 |
|
BLAKE2b-256 | 8ce9bc6e291aebe1963038a812b95b0a24ec560a60bcc9f2dc3af1ae05f3ff05 |
Hashes for factor_expr-0.1.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a94fc73ddc415baf1c227e6a54a610a1845c343f8fc0353153a2d83a598ff90e |
|
MD5 | 6ed6f65b550ad98be29b6c7730a59a9e |
|
BLAKE2b-256 | 1da1d97dfb151e189af2f0af87d4602fd77163bcce8186a752a395dfc106b510 |
Hashes for factor_expr-0.1.1-cp39-cp39-macosx_10_15_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7760e864be87531f1cf843e91b6ec1c40b9571f9b0545510039ea4be909e22ce |
|
MD5 | aa6052fd7d067c64127db1446e307443 |
|
BLAKE2b-256 | 31d6ed500c4e0c3f38986b29a4d79fbab7e9ac447e0da377f3482c86a261277a |
Hashes for factor_expr-0.1.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da386eb327e11719c42c3e03e914a5916c2330594115771e9f7b2209a8a4f7ae |
|
MD5 | 5aa09aebb7735f650b0fb375f242ee31 |
|
BLAKE2b-256 | 477d1bcbf768ab65b8fce23515f7b08e4a60f68670d4dc100307429f4340bc02 |
Hashes for factor_expr-0.1.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 130f9ad356aad725f8c8c08cf29b05ed1ba2fadfd3c4831941a5829a1306cbae |
|
MD5 | cf0be268252750963d56252627ac34fc |
|
BLAKE2b-256 | 6d8c3da42c53372c0bf17702b4448d9b4cb2a46cd3a9149ab7075746461d58b9 |
Hashes for factor_expr-0.1.1-cp38-cp38-macosx_10_15_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ccb1326260e90365e6bc2e1cf1a70b646d17336439133f76b541c56be4cb09e |
|
MD5 | 0547daf0114820166ed64025df8c5e4b |
|
BLAKE2b-256 | e11912e5e46d5f70381ebd6e1f890c0b1cb6f81451c92d07b7ace69b903ce836 |