Skip to main content

Get company registration data for the State of Delaware.

Project description

Some people think it would be good for data about company registrations to be freely distributed and free of charge. The State of Delaware apparently doesn’t think so; Delaware blocks computers from accessing the General Information Name Search site if they make more than a few hundred (I’m not sure of the exact number.) requests in a short time.

In order to download all of the data, we are thus using a swarm of computers from different IP addresses, each making very few requests to the site. You can help!

You just need to install a program and let it keep running. It will periodically contact a central server for directions, and it will query Delaware’s General Information Name Search accordingly. It is very careful to avoid being blocked. But if it detects that you are on a new IP address it will take advantage of that.

Installing

The installation process involves running things in a terminal. Remind me to put some directions here about how to do that in Mac and Windows.

If you already have Python and Pip installed, you can just do this.

sudo pip install delaware

If you have Python but not Pip, you can download a standalone package (I have to make this.) and run the setup like so.

tar xzf delaware.tar.gz
cd delaware
sudo python setup.py

If you don’t have Python installed, follow these directions.

If you are on any operating system other than Windows, you probably already have Python installed.

Add a note about Enthought, Continuum, &c.

Running

Once you’ve installed the program, type this into a terminal.

deleworker

It’ll ask you a few questions the first time you run it, but you can totally ignore it after that.

If errors come up

If the program stops running, please send the error message to _@thomaslevine.com. Also, please save the ~/.delaware directory, as it contains files that can be helpful for figuring out what went wrong.

How it works

I went with worker-manager architecture, but maybe I should have gone with something less classist? Peer-to-peer connections are annoying because of port blocking of various sorts, but that would be nice because then I don’t need to be responsible. Well anyway, here’s how it works.

Asking for directions

The worker contacts the manager asking for a job. It provides the following information.

Username

Chosen by the user

Password-like thing

Hash of a salted installation ID, which is created when the program is first run

IP address (implicitly)

The manager is able to determine the IP address from which the request came.

The username is there so that the person can be recognized for her efforts.

The password-like thing is there to trace provenance of the data. This is mainly here in case someone fakes the data, so that I can figure out which data not to trust. It could also be helpful for debugging issues specific to certain systems.

The IP address is used for determining whether the rate limit is close to being reached. The manager directs workers not to query the Delaware site if they are approaching rate limit. The IP address is wholy separate from username and installation ID, as the same IP address can be accessed by multiple devices associated with the same user and by devices associated with multiple users.

Receiving work orders

In response to the above directions request, the worker will receive either a status code of 429 (too many requests) or a status code of 200. The manager decides which one based on how many requests have come from this IP address recently.

If the manager provides a status code of 200, it also provides the following information.

File number

The company to look up

An IP address

This will be passed back to the manager for rate limiting purposes.

The IP address is the worker’s own IP address, but it needed to contact the manager to figure that out.

The file number is chosen randomly (with uniform weights) from the file numbers with the lowest amount of responses so far.

For example, all file numbers (0 to 8 million) are possible when we start because there have been zero responses so far. Soon, some file numbers will be selected, so there will be some file numbers with zero responses and some with one response. Once all file numbers have been chosen at least once, the manager will begin repeating file numbers. By repeating file numbers, we check for consistency between different responses (in case someone is trying to fake data), and we continue to update the data (in case companies change).

I chose this approach so that we can be intelligent about which file numbers we query without assigning jobs to particular workers.

Querying the website

Once the bot has been directed to look up a particular file number, it queries the Deleware corporations site accordingly. It goes to the starting page for the General Information Name Search (called home in the code). It enters the file number and receives a list of up to one company. (This page is called a search in the code.) It then goes to this maybe-company page (called result in the code).

At every step, the bot

  1. minimally parses the web page so that it may advance to the next step,

  2. sends information about the HTTP response to the manager

  3. pauses randomly for a time on the order of a second to avoid looking so obviously like a bot

When it sends the response information to the manager,

“Before” IP address

The previous IP address that the manager told the worker

Current IP address (implicitly)

The IP address that the manager currently detects from the worker

Simplified HTTP response from Delaware

This the main information that we are looking for.

Whether the request appeared successful

Based on a rough parse, the worker says whether the request was successful. The manager uses this for selecting file numbers for job assignments (in the first step of the process)

Saving information on the manager

XXX FIX THIS SECTION XXX

When the manager recieves a response, it first needs to determine an additional piece of information. The worker has provided the “before” IP address; the manager now determines the “after” IP address.

Having determined this, it writes the following stuff to a simple log file.

  • username

  • installation id

  • before ip address

  • after ip address

  • serialized request

It also saves the IP address(es) in an IP address table. We maintain this table so we can avoid exceeding thresholds for IP blocking. If the before and after IP addresses are different, we conservatively count the request as having come from both addresses.

Finally, it parses the file number from the response and updates the sampling weights for the file number selection.

A separate process comes along later, reads the log files, and reads more information from the response. The involved parsing is moved to a separate task for two main reasons. First, this reduces the load of the manager. Second, we can reuse the separate task for loading backups; we don’t need to write a separate thing for that.

Waiting

The worker waits a random time on the order of seconds before repeating the above process. This way, the bots may look a bit less like bots and thus be harder to block.

Questions you might have

Why not just in-browser Javascript?

We can’t make cross-domain requests, so we’d have to inject something into the Deleware page, and that’s annoying, especially for this site.

Doesn’t OpenCorporates already have it?

OpenCorporates doesn’t have it.

Have people done similar things in terms of this distibuted API?

Probably

Why Python rather than something that people with Windows can run?

Because it’s easier

Has anyone tried talking to Delaware?

Dunno

How many companies?

Dunno, but less than 600,000

Other references

To do

In order to avoid faking of data, enforce that the worker only complete work that it has been ordered to. This could happen through some form of encryption or just by looking for strange patterns in the server logs.

The rate-limit query on the database isn’t working. Fix it.

Figure out what the actual rate limit is.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delaware-0.0.1.tar.gz (10.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page