Skip to main content

Search for data tables.

Project description

When we search for ordinary written documents, we send words into a search engine and get pages of words back.

What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back? The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets.

Indexing

Comma Search indexes only spreadsheets that are stored locally. To index a new spreadsheet, run

, --index [csv file]

Regardless of what path you give for the csv file, Comma Search will expand the path to an absolute path and then use this as the key to meta-index the cached results of the indexing. These caches are all stored in the ~/., directory.

By default, CSV files that have already been indexed will be skipped; to index the same CSV file again, run with the --force or -f option.

, --index --force [csv file]

Once you have indexed a bunch of CSV files, you can search.

, [csv file]

You’ll see a bunch of file paths as results

$ , ‘Math Scores 2009.csv’ /home/tlevine/math-scores-2010-gender.csv /home/tlevine/Math Scores 2009.csv /home/tlevine/Math Scores 2009 Copy (1).csv /home/tlevine/math-scores-2009-ethnicity.csv

Implementation details

When we index a table, we first figure out the unique indices

import special_snowflake
indices = special_snowflake.fromcsv(open(filepath))

and save them.

from pickle_warehouse import Warehouse
Warehouse(os.path.expathuser('~/.,/indices'))[filepath] = indices

Then we look at all of the values of all of the unique indices and save them.

Warehouse(os.path.expanduser(‘~/.,/values/%d’ % hash(index)))[filepath] = set_of_indexed_values

When we search for a table, it actually gets indexed first. Once it has been indexed, we know the unique keys of the table. We look up the indices,

indices = Warehouse(os.path.expathuser('~/.,/indices'))[path]

then we look up all of the tables that contain this index,

tables = Warehouse(os.path.expanduser('~/.,/values/%d' % hash(index)))

and the values of this tables object are sets of hashes of the different values. I can then count how many items are in the intersection between the set for the table that is used as the query and the every other particular table.

If I want to go crazy, I might do this for combinations of columns that aren’t unique indices, and I’d use collections.Counter objects to represent the distributions of the values.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

commasearch-0.0.2.tar.gz (3.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page