fuzzysearch is useful for finding approximate subsequence matches
Project description
fuzzysearch is a Python library for fuzzy substring searches. It implements efficient ad-hoc searching for approximate sub-sequences. Matching is done using a generalized Levenshtein Distance metric, with configurable parameters.
Free software: MIT license
Documentation: http://fuzzysearch.rtfd.org.
Installation
Just install using pip:
$ pip install fuzzysearch
Features
Fuzzy sub-sequence search: Find parts of a sequence which match a given sub-sequence.
Easy to use: A single function to call which returns a list of matches.
Set a maximum Levenshtein Distance for matches, including individual limits for the number of substitutions, insertions and/or deletions allowed for near-matches.
Includes optimized implementations for specific use-cases, e.g. allowing only substitutions.
Simple Examples
Just call find_near_matches() with the sequence to search, the sub-sequence you’re looking for, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1)]
Advanced Search Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance
maximum # of subsitutions
maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
maximum # of insertions (elements added in the matching sub-sequence which don’t appear in the pattern search for)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2)]
History
0.6.1 (2018-12-08)
Fixed some C compiler warnings for the C and Cython modules
0.6.0 (2018-12-07)
Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search
0.5.0 (2017-09-05)
Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.
0.4.0 (2017-07-06)
Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.6.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf35b9d05c6f1ff990727323a615f6400f20fd54a271980c8b25651e26409a90 |
|
MD5 | 12e0f695f71ff2832d6816c845470f89 |
|
BLAKE2b-256 | 99b6339c466115d982fdb525a23b153a0984ee2b635f1f3688c9d117773119c8 |
Hashes for fuzzysearch-0.6.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ced7aa4706a4a2fcf4dffefbb63b106d27fc86a65821dc19d3a02b8bea163714 |
|
MD5 | bd264915ce10656658413e87008bc2cf |
|
BLAKE2b-256 | a70539aa0dd869a2d4f68ed1e2efacfb9f95286e69e4d9e9a45922fb3f898466 |
Hashes for fuzzysearch-0.6.1-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 179f10ba7379a31789866b85a0aee576a57c35810be15a53a8ef3070de7eba6b |
|
MD5 | adbccfe6a19b4eb5bf232474a458f93d |
|
BLAKE2b-256 | d7aaeb490a4cbaa9f40c042ec1a4418b371e6bd2bdfae010df1a1ba41d0ea9ca |
Hashes for fuzzysearch-0.6.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa70a48e96a197df7c8c5afef54b2c6987a56ce2fabfd5dcb2c4a815729f3a3d |
|
MD5 | 3bd90a7fdae82f32aa1b0414fe75025d |
|
BLAKE2b-256 | 15f4a2d37889a60a00480e759b19f7dc5d2c6bc46721e02840982fccc69f5d72 |
Hashes for fuzzysearch-0.6.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd214f4fa37a8b352f28c5bb85fb4f682356ff838593c88b96825155356f72ee |
|
MD5 | 54771ad006eea8d2c20d13732a88a720 |
|
BLAKE2b-256 | aac10e32d7578548920364b8ed9b01528c98c5e919245b343c952d51a6f28fac |
Hashes for fuzzysearch-0.6.1-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f45dfe891c30862859b150b99d0f676fa6f76d8ce973012e44aae55b537984d |
|
MD5 | 3d003439a39c4a63299e0cf86bb45479 |
|
BLAKE2b-256 | 58751b93fb60d56ffc2885f47f587387d09e831d4d63974673e718ef87e64d59 |
Hashes for fuzzysearch-0.6.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab793a7bb31c25dee6d285b709c2c266c388f31792cc24a86b60eef819c4ec7d |
|
MD5 | 13bb93dba3956aa86464e809080d59e8 |
|
BLAKE2b-256 | fbc45913b8feb20dfc7003929177fe2902e73b42696561e0e6e92be070557cb2 |
Hashes for fuzzysearch-0.6.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ad72cc9e8275924f825edb627d7bd13d4e63346a81d6405126a5f42418912a1 |
|
MD5 | db21c914ebc7d629008e4d695641f937 |
|
BLAKE2b-256 | a39c0e06a9e1fcc16808540bed6ada66a5cfd5f81bf73ed54f3a1eb0d9398bba |
Hashes for fuzzysearch-0.6.1-cp35-cp35m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fe43ed0e67594d3f8c373b94fac282d00ae40dd598f41022644cfc4c18afa60 |
|
MD5 | 042dea4279d40823e93c1be95e11cce7 |
|
BLAKE2b-256 | 2913868c4ab4f424dc6ceb7428d3de01e3aeabd75b6bb510628bd1131aedd8f2 |
Hashes for fuzzysearch-0.6.1-cp34-cp34m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf7ec50944f43e4231b09b81672591e48b110d2af6d9d36c00b886b9ad15278b |
|
MD5 | 8715191bd09a7a20797fe5556b4fc1af |
|
BLAKE2b-256 | 1870815c0764ddf94e9e64fe9bac0c4907c80ff5018bda248f6bf3de6febcc7a |
Hashes for fuzzysearch-0.6.1-cp34-cp34m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d78751f6e5a581eeb9d22e7192f0fb644ac47327f77f9c0d009785c0a5223093 |
|
MD5 | 83a0de3f21e623667c4871f2979942ae |
|
BLAKE2b-256 | ee2b1c4c71a48a4bc27c17e25085e3335e60117cf35000640a59351727755d04 |
Hashes for fuzzysearch-0.6.1-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b275231feb4c85ab8a8c9ee1b036156654fee7df25108f7a420f2a0c549d7968 |
|
MD5 | a062f7f560d1700ea547c2de2437553e |
|
BLAKE2b-256 | 1e9f0da52741ec3a71d17b546041c591ff8c2db7201ef09fa079a37ea293a073 |
Hashes for fuzzysearch-0.6.1-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1192c7617173607427408a123bc054e49f965374156b6b21bc40d26b64105032 |
|
MD5 | 50578682fea171c5040610d6e1739793 |
|
BLAKE2b-256 | 68beaf69b14ecd519e864b72dab40ba24b3f632c049279c6a3cc645d683b5627 |
Hashes for fuzzysearch-0.6.1-cp27-cp27m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d6644f4dce13edc133fdf7611ff7b3483110ab8ea972094e20f17d826382e1e |
|
MD5 | 5cfb5c826d3b99aa4e7901ae3a3d6c32 |
|
BLAKE2b-256 | b888eb129e946909f01a7862e8a4ce8a5c18021954ba8bde9e3133f092027b75 |