Skip to main content

Topic modeling with latent Dirichlet allocation

Project description

lda: Topic modeling with latent Dirichlet allocation
====================================================

|pypi| |travis| |crate|

Topic modeling with latent Dirichlet allocation. ``lda`` aims for simplicity.

``lda`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs
sampling. LDA is described in `Blei et al. (2003)`_ and `Pritchard et al. (2000)`_.

Installation
------------

``pip install lda``

Getting started
---------------

``lda.LDA`` implements latent Dirichlet allocation (LDA). The interface follows
conventions found in scikit-learn_.

The following demonstrates how to inspect a model of a subset of the Reuters
news dataset.

.. code-block:: python

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
>>> model.fit(X)
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: church people told years last year time
Topic 1: elvis music fans york show concert king
Topic 2: pope trip mass vatican poland health john
Topic 3: film french against france festival magazine quebec
Topic 4: king michael romania president first service romanian
Topic 5: police family versace miami cunanan west home
Topic 6: germany german war political government minister nazi
Topic 7: harriman u.s clinton churchill ambassador paris british
Topic 8: yeltsin russian russia president kremlin moscow operation
Topic 9: prince queen bowles church king royal public
Topic 10: simpson million years south irish churches says
Topic 11: charles diana parker camilla marriage family royal
Topic 12: east peace prize president award catholic timor
Topic 13: order nuns india successor election roman sister
Topic 14: pope vatican hospital surgery rome roman doctors
Topic 15: mother teresa heart calcutta missionaries hospital charity
Topic 16: bernardin cardinal cancer church life catholic chicago
Topic 17: died funeral church city death buddhist israel
Topic 18: museum kennedy cultural city culture greek byzantine
Topic 19: art exhibition century city tour works madonna

The document-topic distributions are available in ``model.doc_topic_``.

.. code-block:: python

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 11)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 0)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 15)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 11)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 15)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 15)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 15)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 15)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 15)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 0)

Requirements
------------

Python 2.7 or Python 3.3+ is required. The following packages are required

- numpy_
- scipy_
- pbr_

Caveat
------

``lda`` aims for simplicity. (It happens to be fast, as essential parts are
written in C via Cython_.) If you are working with a very large corpus you may
wish to use more sophisticated topic models such as those implemented in hca_
and MALLET_. hca_ is written entirely in C and MALLET_ is written in Java.
Unlike ``lda``, hca_ can use more than one processor at a time. Both MALLET_ and
hca_ implement topic models known to be more robust than standard latent
Dirichlet allocation.

Important links
---------------

- Documentation: http://pythonhosted.org/lda
- Source code: https://github.com/ariddell/lda/
- Issue tracker: https://github.com/ariddell/lda/issues

License
-------

lda is licensed under Version 2.0 of the Mozilla Public License.

.. _Python: http://www.python.org/
.. _scikit-learn: http://scikit-learn.org
.. _hca: http://www.mloss.org/software/view/527/
.. _MALLET: http://mallet.cs.umass.edu/
.. _numpy: http://www.numpy.org/
.. _scipy: http://docs.scipy.org/doc/
.. _pbr: https://pypi.python.org/pypi/pbr
.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html
.. _Pritchard et al. (2000): http://www.genetics.org/content/164/4/1567.full


.. |pypi| image:: https://badge.fury.io/py/lda.png
:target: https://badge.fury.io/py/lda
:alt: pypi version

.. |travis| image:: https://travis-ci.org/ariddell/lda.png?branch=master
:target: https://travis-ci.org/ariddell/lda
:alt: travis-ci build status

.. |crate| image:: https://pypip.in/d/lda/badge.png
:target: https://pypi.python.org/pypi/lda
:alt: pypi download statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lda-0.2.0.tar.gz (247.6 kB view hashes)

Uploaded Source

Built Distributions

lda-0.2.0.win-amd64-py3.4.exe (522.1 kB view hashes)

Uploaded Source

lda-0.2.0.win-amd64-py2.7.exe (524.5 kB view hashes)

Uploaded Source

lda-0.2.0.win32-py3.4.exe (484.3 kB view hashes)

Uploaded Source

lda-0.2.0.win32-py2.7.exe (489.3 kB view hashes)

Uploaded Source

lda-0.2.0-cp34-none-win_amd64.whl (297.0 kB view hashes)

Uploaded CPython 3.4 Windows x86-64

lda-0.2.0-cp34-none-win32.whl (290.4 kB view hashes)

Uploaded CPython 3.4 Windows x86

lda-0.2.0-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (372.0 kB view hashes)

Uploaded CPython 3.3m macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

lda-0.2.0-cp27-none-win_amd64.whl (297.9 kB view hashes)

Uploaded CPython 2.7 Windows x86-64

lda-0.2.0-cp27-none-win32.whl (290.3 kB view hashes)

Uploaded CPython 2.7 Windows x86

lda-0.2.0-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (371.5 kB view hashes)

Uploaded CPython 2.7 macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page