fancyimpute
Matrix completion and feature imputation algorithms
|Build Status| |Coverage Status| |DOI|
fancyimpute
===========
A variety of matrix completion and imputation algorithms implemented in
Python.
Usage
-----
.. code:: python
from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).complete(X_incomplete)
# matrix completion using convex optimization to find low-rank solution
# that still matches observed values. Slow!
X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)
# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)
# print mean squared error for the three imputation methods above
nnm_mse = ((X_filled_nnm[missing_mask] - X[missing_mask]) ** 2).mean()
print("Nuclear norm minimization MSE: %f" % nnm_mse)
softImpute_mse = ((X_filled_softimpute[missing_mask] - X[missing_mask]) ** 2).mean()
print("SoftImpute MSE: %f" % softImpute_mse)
knn_mse = ((X_filled_knn[missing_mask] - X[missing_mask]) ** 2).mean()
print("knnImpute MSE: %f" % knn_mse)
Algorithms
----------
- ``SimpleFill``: Replaces missing entries with the mean or median of
each column.
- ``KNN``: Nearest neighbor imputations which weights samples using the
mean squared difference on features for which two rows both have
observed data.
- ``SoftImpute``: Matrix completion by iterative soft thresholding of
SVD decompositions. Inspired by the
`softImpute <https://web.stanford.edu/~hastie/swData/softImpute/vignette.html>`__
package for R, which is based on `Spectral Regularization Algorithms
for Learning Large Incomplete
Matrices <http://web.stanford.edu/~hastie/Papers/mazumder10a.pdf>`__
by Mazumder et. al.
- ``IterativeSVD``: Matrix completion by iterative low-rank SVD
decomposition. Should be similar to SVDimpute from `Missing value
estimation methods for DNA
microarrays <http://www.ncbi.nlm.nih.gov/pubmed/11395428>`__ by
Troyanskaya et. al.
- ``MICE``: Reimplementation of `Multiple Imputation by Chained
Equations <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/>`__.
- ``MatrixFactorization``: Direct factorization of the incomplete
matrix into low-rank ``U`` and ``V``, with an L1 sparsity penalty on
the elements of ``U`` and an L2 penalty on the elements of ``V``.
Solved by gradient descent.
- ``NuclearNormMinimization``: Simple implementation of `Exact Matrix
Completion via Convex
Optimization <http://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf>`__
by Emmanuel Candes and Benjamin Recht using
`cvxpy <http://www.cvxpy.org/en/latest/>`__. Too slow for large
matrices.
- ``BiScaler``: Iterative estimation of row/column means and standard
deviations to get doubly normalized matrix. Not guaranteed to
converge but works well in practice. Taken from `Matrix Completion
and Low-Rank SVD via Fast Alternating Least
Squares <http://arxiv.org/abs/1410.2596>`__.
.. |Build Status| image:: https://travis-ci.org/hammerlab/fancyimpute.svg?branch=master
:target: https://travis-ci.org/hammerlab/fancyimpute
.. |Coverage Status| image:: https://coveralls.io/repos/github/hammerlab/fancyimpute/badge.svg?branch=master
:target: https://coveralls.io/github/hammerlab/fancyimpute?branch=master
.. |DOI| image:: https://zenodo.org/badge/doi/10.5281/zenodo.51773.svg
:target: http://dx.doi.org/10.5281/zenodo.51773
Alex Rubinsteyn, Sergey Feldman
de8d73761443fcd1ab4ab95e6ac45079e485b15b
0.2.0