Skip to main content

Discover the geography of open-source software. Explore the geographic locations of software developers associated with a GitHub repository or a Python (PyPI) package.

Project description

GitHub Actions Unit Tests codecov Codacy Badge pylint Score Python Versions Supported CodeQL security: bandit Code style: black

GitGeo

Discover the geography of open-source software. Explore the geographic locations of software developers associated with a GitHub repository or a Python (PyPI) package.

See, for instance, the geography of the contributors to the Python package requests.

map_image

Why use GitGeo?

  • Curiosity

  • Open source software community management

  • Research on open source software ecosystems

  • IT security compliance

Installation

pip install gitgeo

Usage

(requires internet connection)

  • First, create one or more GitHub personal access tokens.

  • Second, run these commands in the command line to set environmental variables:

    export GITHUB_USERNAME='[github_username]'
    export GITHUB_TOKEN='[github_token]'
  • Alternatively, to use multiple tokens, create a file called tokens.txt in the code’s directory and enter a GitHub personal access token on each line.

  • Third, run these commands in the command line:

gitgeo --package [package_name]

gitgeo --repo [github_repo_url]

For example:

>>> gitgeo --package requests

-----------------
PACKAGE: requests
-----------------
CONTRIBUTOR, LOCATION
* indicates PyPI maintainer
---------------------
kennethreitz42 | Virginia, USA
Lukasa * | London, England
sigmavirus24 | Madison, WI
nateprewitt * | None
slingamn | None
BraulioVM | Malaga & Granada, Spain
dpursehouse | Kawasaki
jgorset | Oslo, Norway
...

Or:

>>> gitgeo --repo www.github.com/psf/requests

-----------------
GITHUB REPO: psf/requests
-----------------
CONTRIBUTOR, LOCATION
---------------------
kennethreitz42 | Virginia, USA | United States
Lukasa | London, England | United Kingdom
sigmavirus24 | Madison, WI | United States
nateprewitt | None | None
...

There are other command line options too:

Add --summary to get the results summarized by country. e.g.

>>> gitgeo --package requests --summary

-----------------
PACKAGE: requests
GITHUB REPO: psf/requests
-----------------
COUNTRY | # OF CONTRIBUTORS
---------------------------
United States 37
None 23
United Kingdom 4
Canada 4
Germany 4
Switzerland 4
Spain 2
Russia 2
...

Add --map when using the --repo option to create an html map saved in the results folder. See image above for static example. Real map includes zooming and tooltip capability.

Add --ouput_csv to output csv of results to results folder.

To create a csv of contributors from many repositories, enter repositories on separate lines in the repos.txt file. Then use the --multirepo flag.

Add multirepo_map and then a filename to create a map of csv ouput. csv output must be located in the results folder.

Add --num and specify a multiple of 100 from 100 (default) to 500 to specify the number of contributors analyzed per repo.

Run tests:

pytest

Roadmap

  • Investigate capability of predicting location via a model given only timestamp from commit and commit-related data. (Kinga)

  • Investigate GitHub API for examining merges and who has merge rights.

  • Add capability of reading through commits and, specifically, (1) determine if GitHub commit rights can be inferred.

  • Investigate capability of extracting all users associated with a GitHub group

  • Investigate capability to determine authenticity of location information

  • Investigate possibility of geographic diversity score for a repo or package

  • Investigate possibility of linking emails in commits to email breach lists.

  • Investigate possibility of determining whether a project is a “hobby” project (outside of working hours) or a “work” project (within working hours)?

  • Investigate possibility of using NLP to determine codebase specialties of each contributor. e.g. This person is the “auth” person.

  • Investigate over time commit analysis visualization

  • Add dump multirepo results (or similar aggregate scan) to s3 capability

  • Investigate diff to tweet capability. Reveal major contributor changes in critical projects to an open feed.

  • Investigate switching ownership data. Would be interesting to alert users to this.

  • Investigate by user capability. Determine all repo’s a user has contributed to. Do a quick git blame for a user.

Rainy Day Options

  • Access commercial API’s to enrich data on GitHub usernames or, if included in GitHub profile, email handles, etc. Perhaps People Data Labs or Explorium. (MK)

Potential Research Questions

  • Are there places in the world with unrecognized pockets of software developers?

  • Where are maintainers associated with the most critical python packages?

    • Who are the maintainers that are associated with multiple critical python packages?

    • What about contribution-related weighting?

  • Where are the maintainers associated with the top GitHub packages by stars? Top data science packages? Quantum computing packages? Blockhcain packages? Etc? (RP)

    • Then do sub-analysis that asks on what repos or types of repos developers of a given country are most active

  • What predicts the number of top python packages software developers by country?

    • Total number of coders per country?

    • Total number of python coders per country?

    • GDP per capita per country?

  • Is it possible to “verify” user information?

Known bugs

Want to contribute?

  • Open a PR. We are glad to accept pull requests. We use black and pylint and pydocstyle, though we are glad to help if you haven’t used those tools before.

  • Open an issue. Tell us your problem or a functionality you want.

  • Want to help build a community related to GitGeo and similar open source software ecosystem exploration tools? Please send an email to jmeyers@iqt.org.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitgeo-1.0.1.tar.gz (275.0 kB view hashes)

Uploaded Source

Built Distribution

gitgeo-1.0.1-py3-none-any.whl (280.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page