Speed up regex matching with non-regex substring "prematchers", similar to Bloom filters.
Project description
multiregex
Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.
Installation
You can install the package in development mode using:
git clone git@github.com:quantco/multiregex.git
cd multiregex
# create and activate a fresh environment named multiregex
# see environment.yml for details
mamba env create
conda activate multiregex
pre-commit install
pip install --no-build-isolation -e .
Usage
import multiregex
# Create matcher from multiple regexes.
my_patterns = [r"\w+@\w+\.com", r"\w\.com"]
matcher = multiregex.RegexMatcher(my_patterns)
# Run `re.search` for all regexes.
# Returns a set of matches as (re.Pattern, re.Match) tuples.
matcher.search("john.doe@example.com")
# => [(re.compile('\\w+@\\w+\\.com'), <re.Match ... 'doe@example.com'>),
# (re.compile('\\w+\\.com'), <re.Match ... 'example.com'>)]
# Same as above, but with `re.match`.
matcher.match(...)
Custom prematchers
To be able to quickly match many regexes against a string, multiregex
uses
"prematchers" under the hood. Prematchers are lists of non-regex strings of which
at least one can be assumed to be present in the haystack if the corresponding regex matches.
As an example, a valid prematcher of r"\w+\.com"
could be [".com"]
and a valid
prematcher of r"(B|b)anana"
could be ["B", "b"]
or ["anana"]
.
You will likely have to provide your own prematchers for all but the simplest regex patterns:
multiregex.RegexMatcher([r"\d+"])
# => ValueError: Could not generate prematcher : '\\d+'
To provide custom prematchers, pass (pattern, prematchers)
tuples:
multiregex.RegexMatcher([(r"\d+", map(str, range(10)))])
To use a mixture of automatic and custom prematchers, pass prematchers=None
:
matcher = multiregex.RegexMatcher([(r"\d+", map(str, range(10))), (r"\w+\.com", None)])
matcher.patterns
# => [(re.compile('\\d+'), ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']),
# (re.compile('\\w+\\.com'), ['com'])]
Disabling prematchers
To disable prematching for certain pattern entirely (ie., always run the regex without first running any prematchers), pass an empty list of prematchers:
multiregex.RegexMatcher([(r"super complicated regex", [])])
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for multiregex-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46fb3545c5a377d9b4756cb197e05f32d6b524014fb17f33c3b5fba8e94c7aea |
|
MD5 | d9d0744fd33fcec0f4243531653483fd |
|
BLAKE2b-256 | 892b371543601c6b1a6c4280b54a87754090603f8733e45e926f455d59404899 |