Skip to main content

Package defining the Lancaster Observational Astronomy group

Project description

Lancstro: an example of creating a Python package

This repository and the following text is intended as a basic tutorial on creating and publishing a Python package. It was created for a seminar given to the Lancaster University Observational Astrophysics group, but may be more widely applicable.

What is a Python package

In general, when talking about a Python package it means an set of Python modules and/or scripts and/or data, that are installable under a common namespace (the package's name). A package might also be referred to as a library. This is different from a collection of individual Python files that you have in a folder, which will not be under a common namespace and are only accessible if their path is in your PYTHONPATH or you use them from the directory in which they live.

A couple of examples of common Python packages used in research in the physical sciences are:

  1. NumPy
  2. SciPy

Note: "namespace" basically refers to the name of the package as you would import it, e.g., if you import numpy with import numpy, then you will access all NumPy's functions/classes/modules via the numpy namespace:

numpy.sin(2.3)

A package can contain everything within a single namespace, or contain various submodules, e.g., parts that contain common functionality that naturally fits together in it's own namespace. For example, in NumPy, the random submodule contains functions and classes for generating random numbers:

import numpy
numpy.random.randn()  # generate a normally distributed random number

Why package my code?

So, why should you package (and publish) your Python code rather than just having local scripts? Well, there are several reasons:

  • It creates an installable package that can be imported without having to have the Python script/file in your path.
  • It creates a “versioned” package that can have specified features/dependencies. This is very important for reproducibility of results, where a specific code version used for an analysis can be pointed to.
  • You can share you package with others (you can make it pip installable via PyPI, or conda installable via conda-forge), which can be important when working with collaborators.
  • You will gain developer kudos! Software development is a major skill you learn during your research, so show off what you’ve done and add it to your CV.

Project structure

To create a Python package you should structure the directory containing you code in the following way (the directory name containing this information does not have to match the package name, but often they will):

repo/
├── LICENSE
├── pyproject.toml
├── README.md
├── setup.cfg
├── setup.py
├── pkgname/
│   ├── __init__.py
│   └── example.py
└── bin/
    └── executable_script.py

There are other slight variations on this, for example, using a src directory in which your package directories live, as described in the official guidelines).

In this project the structure is:

lancstro/
├── LICENSE
├── pyproject.toml
├── README.md
├── setup.cfg
├── setup.py
├── lancstro/
│   ├── __init__.py
│   ├── base.py
|   ├── members/
|   |   ├── __init__.py
|   |   └── staff.py
|   └── data/
|       └── office_numbers.txt
└── bin/
    └── favourite_object.py

Here, there is a "submodule" called members within the main lancstro package.

Using Github

Your package should be in a version control system and ideally hosted somewhere that provides a backup. It is now very common to use git for version control and it is sensible to host the project on Github/Gitlab/bitbucket or similar. On Github you can have public or private repositories.

If using Github, it is best to start the project by creating new repository there first, then cloning that repository to you machine before then adding in your code. When creating a Github repository (I might use "repo" for short later) you can initialise it with a license file and a README file.

Note: this is not a tutorial on using git, so you'll have to find that elsewhere.

The LICENSE file

You should give your code a license describing the terms of use and copyright. Often you'll want your code to be open source, so a good choice is the MIT license, which is very permissive in terms of reuse of the code. A variety of other open source licenses are available, although these often differ slighty on the permissiveness, i.e., whether others can use your code in commercial and non-open source projects or not.

The LICENSE file will contain a plain ascii text copy of your license.

The pyproject.toml file

This file tells the pip tool used for installing packages how it should build the package. In this repo we have used the file contents suggested here, which means that the setuptools package is used for the build.

The README.md file

This is the file that you are currently reading! It should provide a basic description of your package, maybe including information about how to install it. Ideally it should be brief and not be seen as a replacement for having proper documentation for you code available elsewhere.

In this case the suggested format for the file is Markdown (the .md extension), but it could be a plain ascii text file or reStructedText. Markdown and reStructuredText will be automatically rendered if you host your package on, e.g., Github.

The setup.cfg and setup.py files

In many packages you might just see a setup.py file, which is the build script used by setuptools. However, it is now good practice to put "static" metadata about your package in the setup.cfg configuration file. By "static" I mean any package information that does not have to be dynamically defined during the build process (such as defining and building Cython extensions). In many cases, like this repository, this can mean the setup.py file can be very simple and just contain:

from setuptools import setup

setup()

The layout of the configuration file is described here. I'll reproduce the one from this project below with additional inline comments:

[metadata]
# the name of the package
name = lancstro

# the package author information (multiple authors can just be separated by commas)
author = Matthew Pitkin
author_email = m.pitkin@lancaster.ac.uk

# a brief description of the package
description = Package defining the Lancaster Observational Astronomy group

# the license type and license file
license = MIT
license_files = LICENSE

# a more in-depth description of the project that will appear on it's PyPI page,
# in this case read in from the README.md file
long_description = file: README.md
long_description_content_type = text/markdown

# the projects URL (often the Github repo URL)
url = https://github.com/mattpitkin/lancstro

# standard classifiers giving some information about the project
classifiers =
    Intended Audience :: Science/Research
    License :: OSI Approved :: MIT License
    Natural Language :: English
    Programming Language :: Python
    Programming Language :: Python :: 3
    Programming Language :: Python :: 3.6
    Programming Language :: Python :: 3.7
    Programming Language :: Python :: 3.8
    Programming Language :: Python :: 3.9
    Topic :: Scientific/Engineering
    Topic :: Scientific/Engineering :: Astronomy
    Topic :: Scientific/Engineering :: Physics

# the package's current version (this isn't actually in the file in this repo, see later!)
version = 0.0.1

[options]
# state the Python versions that the package requires/supports
python_requires = >=3.6

# state packages and versions (of necessary) required for running the setup
setup_requires =
    setuptools >= 43
    wheel

# state packages and versions (if necessary) required for installing and using the package
install_requires =
    astropy
    astroquery >= 0.4.3

# automatically find all modules within this package
packages = find:

# include data in the package defined below
include_package_data = True

# any executable scripts to include in the package
scripts =
    bin/favourite_object.py

[options.package_data]
# any data files to include in the package (lancsrto shows they are in the
# lancstro package and then the paths are given)
lancstro = 
    data/office_numbers.txt

For a list of the standard "classifiers" that you can add see here.

In this project, we have added a "data" file that come bundled with the package. It is not required to include data in your package.

Adding a package version

In the above case the package version is set manually in the setup.cfg file. It is up to you how you define the version string, but it is often good to use Semantic Versioning. In this format the version consists of three full-stop separated numbers: MAJOR.MINOR.PATCH.

The Semantic Versioning site gives the following definitions of when to change the numbers:

  1. MAJOR version when you make incompatible API changes,
  2. MINOR version when you add functionality in a backwards compatible manner, and
  3. PATCH version when you make backwards compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

To update the version you can just edit the value in the setup.cfg file. When you install this will be the package's version.

This allows the package manager (e.g., pip) to know what version of the package is installed. However, it is often useful to provide the version number as a variable within the package itself, so that the user can check it if necessary. Most often you will find this as a variable called __version__, e.g.,:

import numpy
print(numpy.__version__)
1.21.2

There are several ways to set this, but it is best to make sure that there's only one place that you have to edit the version number rather than multiple places. One method (used in this package) is to include the version number in your package's main __init__.py file by adding the line:

__version__ = "0.0.1"

Then, within setup.cfg, the version line can be:

version = attr: lancstro.__version__

Among the other options, a good one to use is through setting the version with a tools such as setuptools-scm, which gathers the version information from git tags in your repo.

The MANIFEST.in file

You can specify which additional files that you want to be bundled with the package's source distribution using a MANIFEST.in. With modern versions of setuptools (e.g., greater than 43) most of the standard files such as the README file and setup files, and any license file given in setup.cfg, are automatically included in the source distribution by default. Hence, not include a MANIFEST.in file in this repository.

However, you may want to include other files. If you had, say, a test directory with multiple Python test scripts that you wanted in the package, you could add and MANIFEST.in file containing:

recursive-include test/ *.py

which will include all .py files within test.

The package source directory

In this project the directory containing the package source code, i.e., the Python files, is called lancstro/. In this case has two files in it (although it can contain any number of Python files, each of which will be a module that is available in the package):

  1. __init__.py
  2. base.py

The base.py file contains some Python code, in this case a class called GroupMembers, which is part of our package.

The __init__.py file is very important. It is what tells Python that this directory is a package. The __init__.py file can be completely empty, but it does need to be present. It can contain any Python code (you could define your whole package in the __init__.py file if you wanted), but often it is used to import things from submodules/subpackages into the package's namespace. In this case the __init__.py file contains the following code:

from .base import GroupMember
from . import members

__version__ = "0.0.2"  # the version number of the code

The first line imports the GroupMember class from the base.py file, so that the GroupMember class can be used from the lancstro namespace rather than the lancstro.base namespace. E.g., this means that when using the package we could do:

from lancstro import GroupMember

rather than

from lancstro.base import GroupMember

although both will work. You may want to do this for commonly used function or classes, but it is not necessary.

The lancstro/ directory also contains the directory members/, which is a subpackage of the package (any subpackage must also contain their own __init__.py file). The second line of the __init__.py file imports the members submodule into the lancstro namespace. E.g., if I just do:

import lancstro

then I can access things from the members subpackage using

lancstro.members.staff

rather than doing:

from lancstro.members import staff

although (again) both will work.

The final line in the __init__.py file sets the version number of the package.

The data directory

You might want to include some data files in your package, e.g., a look-up table for a calculation, a catalogue, etc. In this case I've added a JSON file, office_numbers.txt, in a directory called data/ (any name can be used, but data seems quite sensible!). This directory does not need an __init__.py as it is not a package. To include this file in the package you need to have the line:

include_package_data = True

in your setup.cfg file and also list it in the [options.package_data] section, e.g.,:

[options.package_data]
lancstro = 
    data/office_numbers.txt

Intra-package references

In your package you can import things from the various submodules/subpackages using the . notation.

For example, to import things between Python files in the same part of the package (e.g., at the lancstro/ level), you can do:

from .base import GroupMember

which imports from the base.py file.

If a file in a subpackage wants to import from the level below, e.g., a Python file in lancstro/members wants to import from a file in lancstro/, the you could use:

from ..base import GroupMember

I.e, use two dots .. to specify going down one package level.

The bin directory

You may want to include executable scripts in your package. It is good to place them in a directory called, for example, bin/ in the root directory of your repository. To make these part of the package you need to list these in the setup.cfg file in a scripts section, e.g.,

scripts =
    bin/favourite_object.py

Once the packages are installed these scripts should be in you path and usable with, e.g.,:

$ favourite_object.py -h

Installing the package

It is best practice to install Python packages using pip (the "package installer for Python"), so you should have that installed. Once you have the above structure you can install the package (from it's root directory) using:

pip install .

where the . just refers to the current directory. The standard install locations are described here, but I would recommend using virtual environments, such as provided via conda, in which case the package will be installed only in the environment.

That's it! Open up a Python terminal (from any location except in the package directory, otherwise it'll get confused!) and you should be able to do:

import lancstro
print(lancstro.__version__)
0.0.1

or run the favourite_object.py script from the command line:

$ favourite_object.py -h
usage: favourite_object.py [-h] name name

Get a staff member's favourite object

positional arguments:
  name        The staff member's full name

optional arguments:
  -h, --help  show this help message and exit

You can then tell other people to clone your Github repo and install things in the same way, or even pip install directly from the repo with, e.g.:

$ pip install git+git://github.com/mattpitkin/lancstro.git#egg=lancstro

These methods will install the very latest code from the repo, so not necessarily a specific version (although that can be done if you've tagged a version or work from a particular the git hash).

Publishing the package on PyPI

Rather than getting people to install code directly from your Github repo, it is often better to publish versioned releases of your code. You can publish Python packages on the PyPI (Python Package Index) repository from which they will then be pip installable by anyone!

Firstly, you'll need to register an account on PyPI. Anyone is able to do this. Secondly, you'll need to install the twine package, which is used for uploading packages to PyPI.

Within your repo's root directory (containing setup.py) you can now build a Python wheel (a zipped binary format of the package designed for speedier installation) containing your package with:

python setup.py bdist_wheel sdist

Note: if your code is pure Python, creating a wheel should work straightforwardly, but if not the wheel generation may not work. In these cases you can just build a tarball containing the package using:

python setup.py sdist

This should create a dist/ directory containing a file with the extension .whl (built by including the bdist_wheel argument). This is the Python wheel. It should also contain a tarball of the package (built by including the sdist argument).

It is often best to first upload these products to PyPI's testing repository (you'll need to register a separate account for this), which can be done using twine with:

twine upload -r testpypi dist/*

Note: make sure the dist/ directory is empty before generating the new package version with python setup.py bdist_wheel sdist otherwise you might end up uploading multiple versions.

You should be prompted for your username and password, although there are ways to set these as environment variables or using keyring, so that you don't have to enter them each time. If the upload is successful you should be able to see the project on the Test PyPI site, e.g., at https://test.pypi.org/project/lancstro/0.0.2/.

You can test that the package installs correctly from the Test PyPI repository by running (potentially in a new virtual environment):

pip install -i https://test.pypi.org/simple/ lancstro

If you're happy with the package you can proceed to upload it to the main PyPI repository using:

twine upload dist/*

Et voilà! Now you just need to tell people to run:

pip install lancstro

to install your package. If they want to install a particular version they can use, e.g.,:

pip install lancstro==0.0.2

Or, if there's a lower or upper version that must be used the inequality operators can be used instead, e.g.,:

pip install lancstro<=0.0.2

Publishing the package on conda-forge

You may (and should!) install Python packages in a virtual environment that is relevant for the particular project that you are working on. A popular virtual environment/package manager tool is conda, which is installed as part of Anaconda. Conda is a package manager for a variety of software, not just Python packages, so if creating a conda package for your Python project you can make it dependent on specific versions of non-Python libraries (maybe you want to use a specific version of GSL!).

You can build a conda package and host it in your own account on Anaconda.org. However, a popular repository for hosting projects is conda-forge. An advantage of hosting your package on conda-forge is that it will have been automatically verified by a test suite and reviewed by an actual person, so hopefully will be more robust for other users.

Getting a package on conda-forge is quite a bit more involved than uploading to PyPI, although if you already have your package on PyPI that is an advantage (and is what I'll assume in the example below). The basic steps are given here, but you will need a Github account. I'll detail these a bit more below.

Note: you will need to have uploaded the package source tarball to PyPI for these instructions to work.

  1. Go to https://github.com/conda-forge/staged-recipes and fork the repository to your own account.

  2. In your fork of the repository create a new branch. If you've cloned your fork of the repository you might do:

    git checkout -b add_lancstro_to_conda_forge
    
  3. In the recipes/ directory create a new directory with the name of your package and copy the meta.yaml file from the example/ directory into it:

    cd recipes
    mkdir lancstro
    cp example/meta.yaml lancstro
    
  4. Open up the copied meta.yaml file in a text editor and change it to look something like below (I've removed a lot of the comments):

{% set name = "lancstro" %}
{% set version = "0.0.1" %}

package:
  name: {{ name|lower }}
  version: {{ version }}

source:
  url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/{{ name }}-{{ version }}.tar.gz
  # get the SHA256 check sum of the file (on the PyPI page for the package
  # click on "Download files" and then "View" under the "Hashes" heading)
  sha256: 2873bb17f5e8cc84ac19e22307cc8567273fcdc57e5dd1f57fe52b2b1a6b1da3

build:
  noarch: python
  number: 0
  script: "{{ PYTHON }} -m pip install . -vv"

requirements:
  host:
    # packages required to build and install the package
    - python
    - pip
    - setuptools
  run:
    # packges required to run the package
    - astropy
    - astroquery >= 0.4.3
    - python

test:
  # make sure the package can at least be imported (other tests can be added)
  imports:
    - lancstro

about:
  home: https://github.com/mattpitkin/lancstro
  license: MIT
  license_family: MIT
  summary: 'My great package'
  description: |
    An example package for showing how to package a package.
  doc_url: https://lancstro.readthedocs.io/
  dev_url: https://github.com/mattpitkin/lancstro

extra:
  recipe-maintainers:
    # github ids for maintainers
    - mattpitkin
  1. Commit the changes and push them to your fork of the staged-recipes repository.
  2. Open up a pull request (PR) between your branch and conda-forge's staged-recipes repo. Call the PR something like "Add lancstro". Create the pull request.
  3. After a while check that the test builds in the PR have completed successfully. If not try and fix the issue by editing the (forked) meta.yaml file.
  4. Answer and respond to any questions/comments from the assigned reviewer (you shouldn't have to assigned a reviewer, but sometimes you need to prod the appropriate channel).
  5. Wait for a reviewer to sign-off and merge the PR.

At this point your package should be installable from conda-forge using, e.g.,:

conda install -c conda-forge lancstro

Documentation

You should try not to just write code for yourself. Academic results should be transparent and reproducible, so the code you write and use should be usable by others, therefore Write The Docs!

Creating documentation for your code doesn't just mean that your code should contain comments (which it definitely should!), but there should also be documentation (on, e.g., a website) on how to install and use your code. This should include information on the code's API (just a fancy way of saying show how to use the functions and classes in your package). It is also important to have examples of use cases as it's often good to "show not tell". You can store the documentation source files in the same repository as you package (e.g., a docs/ folder).

I'm not going to describe in detail how to add documentation to a package (I haven't added it into this package yet, but I may add this in the future!), but will just point towards some resources. Two packages that you may want to look into for building documentation are:

  1. Sphinx
  2. mkdocs

Both of these allow you to write documentation in Markdown or reStructuredText and automatically include (via various extensions/plugins) code docstrings. They can also include Jupyter notebooks.

For repositories hosted on Github, you can easily and freely set up building and hosting of the documentation on Read the Docs. You can also publish your documentation directly on Github using Github Pages.

There is an example of using Sphinx for documenting a package here.

Contributions

Your code may be the product of many developer's work. If it's open source you may also be open to having other developers contributing to it. You should therefore have instructions on how people should contribute and guidelines on the expected behaviour of contributors.

Often you will see a CONTRIBUTING.md Markdown file in package repositories that describes how to contribute. If a contributor wants to add/request a new feature, or fix a bug, then they may want to open a Github issue (or post on an appropriate forum) to see if the feature is useful/bug is known. If they have coded up a bug fix/feature then adding that into the repository often involves a "fork-and-pull request" workflow process (this is the process for many projects, e.g., NumPy, astropy):

  1. fork the repository to your own Github account
  2. create a new branch on your fork for development
  3. add and commit your changes making sure that they work and don't break the package
  4. push your commits to your fork
  5. create a pull request with the upstream (i.e., original) repository
  6. respond to any comments on the change
  7. merge the request into the original repository

Code of conduct

You should also consider adding a code of conduct to your project outlining expected behaviours during interactions between developers/contributors. There are many examples of code's of conduct that you can often use verbatim (many are licensed using Creative Commons licenses) or adapt to your needs:

Code style

You may want to enforce a particular style for your code. Many projects follow the PEP8 style guide. There are packages that you can run on your code to automatically make them conform to this style, e.g., black or flake8, so you should tell contributors to run these on any code they submit (and make sure you run them yourself!). You can also add the pep8speaks app on Github that will check that any pull request conforms to PEP8 and inform the committer of any violations of the style.

You can force checks to happen automatically by using the pre-commit package to add "pre-commit" hooks to git, so that it automatically runs, e.g., black, on any committed code.

Making code citable

Your code is a very large part of your academic output, so it's good to make your package citable. This way you can receive appropriate acknowledgement when people use it and show evidence of your output. There are a variety of ways of doing this (skewed towards Astro/Physics):

  • For packages on Github, link your repository to Zenodo which will provide a citable DOI for you project.
  • Get it linked onto the Astrophysics Source Code Library (ASCL). This is indexed on NASA ADS, but does not give a DOI.
  • Write a paper for the Journal of Open Source Software (JOSS). This is a very light touch, but peer reviewed publication that also provides a DOI and is indexed on NASA ADS. It does require you to have proper documentation for your package as an acceptable level of documentation is part of the review.
  • Write a paper for a standard journal. Many journals (MNRAS, ApJ, PASP, etc) do now accept papers on software, although it's likely that they should also include a description of a practical use case for the software.

Not covered here!

There are many additional useful things that I've not covered here. These include:

  • using entry point console scripts rather than, or as well as, including executable scripts
  • including C/C++/FORTRAN code, or Cython-ized code, in your package
  • creating a test suite for your package (and checking its coverage)
  • setting up continuous integration for building and testing (and automatically publishing) your code (e.g., with Github Actions, TravisCI, ...)

I may add these at a later date.

Other resources

For other descriptions of creating your Python code see:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lancstro-0.0.2.tar.gz (38.8 kB view hashes)

Uploaded Source

Built Distribution

lancstro-0.0.2-py3-none-any.whl (17.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page