commasearch

Search for data tables.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

When we search for ordinary written documents, we send words into a search engine and get pages of words back.

What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back? The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets.

Indexing

Comma Search indexes only spreadsheets that are stored locally. To index a new spreadsheet, run

, --index [csv file]

Regardless of what path you give for the csv file, Comma Search will expand the path to an absolute path and then use this as the key to meta-index the cached results of the indexing. These caches are all stored in the ~/., directory.

By default, CSV files that have already been indexed will be skipped; to index the same CSV file again, run with the --force or -f option.

, --index --force [csv file]

Once you have indexed a bunch of CSV files, you can search.

, [csv file]

You’ll see a bunch of file paths as results

$ , ‘Math Scores 2009.csv’ /home/tlevine/math-scores-2010-gender.csv /home/tlevine/Math Scores 2009.csv /home/tlevine/Math Scores 2009 Copy (1).csv /home/tlevine/math-scores-2009-ethnicity.csv

Implementation details

When we index a table, we first figure out the unique indices

import special_snowflake
indices = special_snowflake.fromcsv(open(filepath))

and save them.

from pickle_warehouse import Warehouse
Warehouse(os.path.expathuser('~/.,/indices'))[filepath] = indices

Then we look at all of the values of all of the unique indices and save them.

Warehouse(os.path.expanduser(‘~/.,/values/%d’ % hash(index)))[filepath] = set_of_indexed_values

When we search for a table, it actually gets indexed first. Once it has been indexed, we know the unique keys of the table. We look up the indices,

indices = Warehouse(os.path.expathuser('~/.,/indices'))[path]

then we look up all of the tables that contain this index,

tables = Warehouse(os.path.expanduser('~/.,/values/%d' % hash(index)))

and the values of this tables object are sets of hashes of the different values. I can then count how many items are in the intersection between the set for the table that is used as the query and the every other particular table.

If I want to go crazy, I might do this for combinations of columns that aren’t unique indices, and I’d use collections.Counter objects to represent the distributions of the values.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.3

Jun 5, 2014

This version

0.0.2

May 18, 2014

0.0.1

May 5, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

commasearch-0.0.2.tar.gz (3.3 kB view hashes)

Uploaded May 18, 2014 Source

Hashes for commasearch-0.0.2.tar.gz

Hashes for commasearch-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`2112987dbf014358cb96356a28c1807a507ceaf5acc614b0cf221f865b256529`
MD5	`81a9715d6473189b125bd0bee6781167`
BLAKE2b-256	`b19acf60dbb12a676d3a26f23f92e66199b6857f79fb7ebd424c467685a965ca`