Fuzzy matching utilities for scholarly metadata

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

fuzzycat (wip)

Fuzzy matching publications for fatcat.

Example Run

Run any clustering algorithm.

$ time python -m fuzzycat cluster -t tsandcrawler < data/sample10m.json | \
    zstd -c9 > sample_cluster.json.zst
2020-11-18 00:19:48.194 DEBUG __main__ - run_cluster:
    {"key_fail": 0, "key_ok": 9999938, "key_empty": 62, "key_denylist": 0, "num_clusters": 9040789}

real    75m23.045s
user    95m14.455s
sys     3m39.121s

Run verification.

$ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt

real    7m56.713s
user    8m50.703s
sys     0m29.262s

Example results over 10M docs:

{
  "miss.appendix": 176,
  "miss.blacklisted": 12124,
  "miss.blacklisted_fragment": 9,
  "miss.book_chapter": 46733,
  "miss.component": 2173,
  "miss.contrib_intersection_empty": 73592,
  "miss.dataset_doi": 30806,
  "miss.num_diff": 1,
  "miss.release_type": 19767,
  "miss.short_title": 16737,
  "miss.subtitle": 11975,
  "miss.title_filename": 87,
  "miss.year": 123288,
  "ok.arxiv_version": 90726,
  "ok.dummy": 106196,
  "ok.preprint_published": 10495,
  "ok.slug_title_author_match": 47285,
  "ok.title_author_match": 65685,
  "ok.tokenized_authors": 7592,
  "skip.container_name_blacklist": 20,
  "skip.publisher_blacklist": 456,
  "skip.too_large": 7430,
  "skip.unique": 8808462,
  "total": 9481815
}

A full run

Single threaded, 42h.

$ time zstdcat -T0 release_export_expanded.json.zst | \
    TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler | \
    zstd -c9 > cluster_tsandcrawler.json.zst
{
  "key_fail": 0,
  "key_ok": 154202433,
  "key_empty": 942,
  "key_denylist": 0,
  "num_clusters": 124321361
}

real    2559m7.880s
user    2605m41.347s
sys     118m38.141s

So, 29881072 (about 20%) docs in the potentially duplicated set.

Verification (about 15h):

$ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \
    zstd -c9 > cluster_tsandcrawler_verified_3c7378.tsv.zst

...

real    927m28.631s
user    939m32.761s
sys     36m47.602s

Use cases

take a release entity database dump as JSON lines and cluster releases (according to various algorithms)
take cluster information and run a verification step (misc algorithms)
create a dataset that contains grouping of releases under works
command line tools to generate cache keys, e.g. to match reference strings to release titles (this needs some transparent setup, e.g. filling of a cache before ops)

Usage

Release clusters start with release entities json lines.

$ cat data/sample.json | python -m fuzzycat cluster -t title > out.json

Clustering 1M records (single core) takes about 64s (15K docs/s).

$ head -1 out.json
{
  "k": "裏表紙",
  "v": [
    ...
  ]
}

Using GNU parallel to make it faster.

$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title

Interestingly, the parallel variants detects fewer clusters (because data is split and clusters are searched within each batch). TODO(miku): sort out sharding bug.

QA

10M release dataset

Notes on cadd28a version clustering (nysiis) and verification.

10M docs
9040789 groups
665447 verification pairs

3578378 OK.TITLE_AUTHOR_MATCH
2989618 Miss.CONTRIB_INTERSECTION_EMPTY
2731528 OK.SLUG_TITLE_AUTHOR_MATCH
2654787 Miss.YEAR
2434532 OK.WORK_ID
2050468 OK.DUMMY
1619330 Miss.SHARED_DOI_PREFIX
1145571 Miss.BOOK_CHAPTER
1023925 Miss.DATASET_DOI
 934075 OK.DATACITE_RELATED_ID
 868951 OK.DATACITE_VERSION
 704154 OK.FIGSHARE_VERSION
 682784 Miss.RELEASE_TYPE
 607117 OK.TOKENIZED_AUTHORS
 298928 OK.PREPRINT_PUBLISHED
 270658 Miss.SUBTITLE
 227537 Miss.SHORT_TITLE
 196402 Miss.COMPONENT
 163158 Miss.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
 122614 Miss.CUSTOM_PREFIX_10_7916
  79687 OK.CUSTOM_IEEE_ARXIV
  69648 OK.PMID_DOI_PAIR
  46649 Miss.CUSTOM_PREFIX_10_14288
  38598 OK.CUSTOM_BSI_UNDATED
  15465 OK.DOI
  13393 Miss.CUSTOM_IOP_MA_PATTERN
  10378 Miss.CONTAINER
   3045 Miss.BLACKLISTED
   2504 Miss.BLACKLISTED_FRAGMENT
   1574 Miss.TITLE_FILENAME
   1273 Miss.APPENDIX
    104 Miss.NUM_DIFF
      4 OK.ARXIV_VERSION

Cases

common title, "Books by Our Readers", https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq, https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq
common title, "The Future of Imprisonment"
common title, "In This Issue/Research Watch/News-in-Brief/News from the IASLC Tobacco Control Committee"
common title, "IEEE Transactions on Wireless Communications", same publisher, different year
common title, "ASMS News" (also different year)
common title, "AMERICAN INSTITUTE OF INSTRUCTION"
common title, "Contents lists"
common title, "Submissions"
same, except DOI, but maybe the same item, after all? https://fatcat.wiki/release/kxgsbh66v5bwhobcaiuh4i7dwy, https://fatcat.wiki/release/thl7o44z3jgk3njdypixwrdbve

Authors may be messy:

IR and published, be we currently yield Miss.CONTRIB_INTERSECTION_EMPTY - https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm, https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy - may need to tokenize authors

A DOI prefix (10.1210, The Endocrine Society) may choose to include the same document in different publications:

Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g.:

Possible fixes

when title and authors match, check the year, and maybe the doi prefix; doi with the same prefix may not be duplicates
detect arxiv versions directly
if multiple authors, may require more than one overlap, e.g. "by Yuting Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College London" - will overlap with any other author including "Imperial College London" -- we label OK.SLUG_TITLE_AUTHOR_MATCH, https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a, https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
"article-journal" and "article" release_type should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
if title and publisher matches, but DOI and year is different, assume different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty, https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
zenodo has no explicit versions, but ids might be closeby, e.g. https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga, https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga

100 examples

accuracy at around 0.8
while the results look ok, the reasons are not always the ones that stand out the most (while checking manually)

78 [x]
11 [o]
11 [ ]

Ok cases are now in verify.csv.

https://fatcat.wiki/release/i2ziaqjrovh3rfrojcaf2xqidy https://fatcat.wiki/release/4rbsv4kplnf4tny22px5z35vty Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY
[o] https://fatcat.wiki/release/65qk35lrxfbqxnpjfpra3ankxe https://fatcat.wiki/release/tovzgangzbfm5bc2qriyh2k6da Status.AMBIGUOUS OK.DUMMY
https://fatcat.wiki/release/qvlzvflp6vhojdm3uyvj2d6keq https://fatcat.wiki/release/vynqlyi2xjdexmf54a5yfidx6m Status.DIFFERENT Miss.RELEASE_TYPE
[o] https://fatcat.wiki/release/hfewgpty4ne3zn7rg32z5npdxy https://fatcat.wiki/release/3djtma4xrjh2pcxy4gu6pafqji Status.AMBIGUOUS OK.DUMMY
https://fatcat.wiki/release/ybxygpeypbaq5pfrztu3z2itw4 https://fatcat.wiki/release/2c2ztrtlkzdhfmzpf7fbindpjq Status.DIFFERENT Miss.DATASET_DOI
[o] https://fatcat.wiki/release/eyol2bjf6jawhjnote73ej5v24 https://fatcat.wiki/release/jowohxiuuncqbdidvqjrrb5324 Status.AMBIGUOUS OK.DUMMY
https://fatcat.wiki/release/d5bqydkylzelpmdfcks2v5th7q https://fatcat.wiki/release/lzcgl52npjaf3etfhhnb3d46da Status.DIFFERENT Miss.DATASET_DOI
[o] https://fatcat.wiki/release/5ysvoxjj4jcxbji42nnzapr6n4 https://fatcat.wiki/release/dx6wevs345cjfejokze2te6sia Status.AMBIGUOUS OK.DUMMY
[o] https://fatcat.wiki/release/xdclbyjgjnbehchrl7l2vi3274 https://fatcat.wiki/release/t3kqh6lfprfaff5zovh6qlodxy Status.AMBIGUOUS OK.DUMMY
[o] https://fatcat.wiki/release/aogvyiw67vdsnf26bufauy2rqa https://fatcat.wiki/release/aofedljjhbhajmx5doxfcv43fa Status.AMBIGUOUS OK.DUMMY
[o] https://fatcat.wiki/release/cjal2f6k5zesxcnrnyhc6ftg5e https://fatcat.wiki/release/oi5kzjlku5gpxjc247v6zjzosa Status.AMBIGUOUS OK.DUMMY
[o] https://fatcat.wiki/release/o6e6yf37y5bttbrpo4piska4gq https://fatcat.wiki/release/pchjd5fwqjdqfevphjff7ydeae Status.AMBIGUOUS OK.DUMMY
https://fatcat.wiki/release/l4fyyvsckneuxkq7d3y2zvkvbe https://fatcat.wiki/release/gf5hriyvuvarhcvttnooaffksi Status.DIFFERENT Miss.RELEASE_TYPE
https://fatcat.wiki/release/7nbcgsohrrak5cuyk6dnit6ega https://fatcat.wiki/release/q66xv7drk5fnph7enwwlkyuwqm Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY
https://fatcat.wiki/release/2tzvdvx4t5hfxnqlnyt4rqenly https://fatcat.wiki/release/houszjo2ejbjhljxvxz23whgua Status.DIFFERENT Miss.DATASET_DOI
https://fatcat.wiki/release/qsxbwvreu5ehrbz65ngh2ghcra https://fatcat.wiki/release/xjvo37ynxvc3zm55bxoa545gvq Status.EXACT OK.TITLE_AUTHOR_MATCH
https://fatcat.wiki/release/ggzzwt6deneyrna5h65mvv7sfe https://fatcat.wiki/release/h4rnaxua75dndmq4x4snnw3qxe Status.AMBIGUOUS Miss.SHORT_TITLE
https://fatcat.wiki/release/skxiyp7qmraqhe2o4zvo7iq6sq https://fatcat.wiki/release/qyqre3mzgbha7hhfarn5absqnq Status.EXACT OK.TITLE_AUTHOR_MATCH
[o] https://fatcat.wiki/release/am53f7iyyvcjnjsgjbz7pu7dii https://fatcat.wiki/release/kdubht33hfb4dmghm2g27ck24i Status.AMBIGUOUS OK.DUMMY
https://fatcat.wiki/release/ofmeeajuovbqbhkgh4rujkd3xu https://fatcat.wiki/release/r6bvy6cglfe5xgafvdcokawkue Status.DIFFERENT Miss.RELEASE_TYPE
[o] https://fatcat.wiki/release/lezvxt2oong6xm3e3cgp47wsla https://fatcat.wiki/release/aad6r5am6vfxpbfwycmyudp2qe Status.AMBIGUOUS OK.DUMMY
[o] https://fatcat.wiki/release/5mzzswgebze2tk4apmbwjahp34 https://fatcat.wiki/release/vl7r3uewvvbo5i2gntocy3y2ey Status.AMBIGUOUS OK.DUMMY

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.23

Dec 6, 2021

0.1.22

Sep 13, 2021

0.1.21

Jun 1, 2021

0.1.20

Apr 15, 2021

0.1.19

Apr 12, 2021

0.1.18

Mar 16, 2021

0.1.17

Feb 19, 2021

0.1.16

Feb 18, 2021

0.1.15

Feb 18, 2021

0.1.14

Feb 18, 2021

0.1.13

Feb 2, 2021

0.1.12

Jan 12, 2021

0.1.11

Jan 12, 2021

0.1.10

Dec 23, 2020

0.1.9

Dec 17, 2020

This version

0.1.8

Dec 16, 2020

0.1.7

Dec 16, 2020

0.1.5

Dec 16, 2020

0.1.4

Nov 2, 2020

0.1.3

Oct 21, 2020

0.1.2

Oct 21, 2020

0.1.1

Aug 12, 2020

0.1.0

Aug 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzycat-0.1.8.tar.gz (72.6 kB view hashes)

Uploaded Dec 16, 2020 Source

Built Distribution

fuzzycat-0.1.8-py3-none-any.whl (70.6 kB view hashes)

Uploaded Dec 16, 2020 Python 3

Hashes for fuzzycat-0.1.8.tar.gz

Hashes for fuzzycat-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`4bc290f5cd1dc78bc3241a386d5d1ecf96366ee8330618c78a7d026b9b9dc390`
MD5	`4d7a639b2088384c3ed96d93de48dce2`
BLAKE2b-256	`48613ce1587573507eea8b4f56c2cad8f2d958689dcc03e8670d5a1cd4c6c849`

Hashes for fuzzycat-0.1.8-py3-none-any.whl

Hashes for fuzzycat-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10db412121d86532eba2136f7e79d9fa90ddada54cdf13f2db757175ac1a02c0`
MD5	`b909d3b8b9305d740a64aee54b6bb5ad`
BLAKE2b-256	`95d4ceeb33ad6b9321912cc4c3cca38542d61e0a89496042a69143fd81fe8f7a`