Modular Distributed TensorFlow Framework

Project description

Genomics

Modular Distributed TensorFlow Framework

Extract Transform Load Pipeline

Extract: Read using local file system (HDD or SSD) or remote file system (GCS or HDFS)
Transform: Effectively utilize CPU cores to parse and perform pre processing, batching
Load: Heavy lifting of computation on many GPUS or TPUs locally or across cluster

Feeding Data to Graph

Initialize Tensors with input data into the Graph: Bloat Graph size, Used for trivial problem and on single GPU, Very inefficient to duplicate Graph on multiple devices
Feed data into Graph using dictionary: Huge memory utilization, and also huge disk requirement for huge preprocessed data
Input pipeline using Queue: Queue Runner are implemented in python, Efficient but can not saturate current generation multiple GPUs
Input pipeline using tf.Data() API: Implemented using C++, parallelize I/O, transform and load steps using background threads, Recommended

Variable Distribution in Multi GPU and Distributed Model

Parameter Servers: Parameters are pinned to parameter server, and they are implicitly copied to worker, gradient is computed at worker and aggregated at parameter server
Replicated Variables: Each GPUs or worker has its own copy of variable, single device (CPU or GPU) is then used to aggregate gradient
Replicated Variables in Distributed Environment: Each worker has local copy of variables, local copy is then updated using parameter server which aggregates gradient

Keeping local copies of variables allows for faster computation

Code Description

Multi GPUs Model, Uses Input Pipeline using Queue, Variable distribution are done using Parameter Server approach
Parameter are pinned to CPU, and GPUs if available serves as worker
Very Modular and Object Oriented Design, Core module abstract away basic routine functionality and also provide layers to implement new models
For example, for sample dataset, I only implemented TFBSAAFileReader, TFBS_AA_CNN_MODEL classes apart from pre-processing(initial, data specific)
proper name scoping for visualization of Graph in Tensorborad along with tf.summaries
data is passed using data_dict rather than command line parameter, cause this dict can be stored and retrieved in automated manner

How to RUN : Sample Dataset

Change line 222 in https://github.com/rohit06nitbpl/genomics/blob/master/tfbs/source_code/tensorflow/models/dreamc/pre_processing.py#L222 according to location of data_dir on your disk
Run python file https://github.com/rohit06nitbpl/genomics/blob/master/tfbs/source_code/tensorflow/experiments/dreamc/experiments.py without argument on latest tensorflow environment.
Add available GPUs on line 19 in https://github.com/rohit06nitbpl/genomics/blob/master/tfbs/source_code/tensorflow/models/dreamc/experiments.py#L19

Device Placement and Training log are done.

How to RUN Tensorboard : Sample Dataset

tensorboard --logdir=genomics/tfbs/source_code/tensorflow/dataset/dreamc/sample_dataset/attempt0/train/logs

Graph and Scaler can be visualised in Tensorboard

Sample Data Description

It is small self made data in similar format as DREAM-ENCODE TF in-vivo binding challenge data

I used this model in this code as very initial experiments, I also used Amino Acid sequences of TF as additional feature, just to see its usefulness in the prediction of TF binding even for unknown TF (i.e. for which experiments are not done). Our earlier focus was to predict TF binding sites for even for unknown TFs , but later on, I only focused on improving Dream Challenge results.

I am reading tsv format files in this code rather than zipped version. Although, lately I have used zipped versions and tools like bedtool etc. mentioned on DREAM-ENCODE website.

I am creating input Tensors on the fly. This way, I am not storing huge one hot matrix in memory or disk (data set is huge in genomics).

Earlier, I was reading file in many threads and filling three queues for Ambigous(A), Bound(B), Unbound(U) samples. Then I was using a random_shuffle_queue to sample equal number of samples and to form final batch (balanced classes in each batch) for training.

But due to unbalanced number of three classes (A, B, U), training had to wait for say(B) samples to be found deep in the files using queue pipeline.

Therefore, I pre-processed files to segregate A, B and U sample in file to three separate files. Now, as in this code, I am creating one thread per class (A, B, U). Each thread quickly find sample for each class and enqueque it in queue pipeline.

It has been been suggested in TensorFlow to use large file instead of many small file, cause it will bottleneck systems file handling resources.

In fact, In sample data set, I used in this code, size of class A file is very small, which makes file to be closed and reopened again very quickly for different epochs. This makes it slow to fill queue for class A. If you run my code, you will see, training waits intermittently for queue of class A.

It can be made much faster on cluster using big data technologies like Apache Spark and Hadoop, and They can gather data and push into queue much quickly than a single system.

TODO

Upload 1D implementation of Capsules, I used for TF Binding Problem
Add Pipeline using tf.data() API and compare performances using real world data like mnist
Complete Distributed implementation using https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

Project details

Release history Release notifications | RSS feed

This version

0.1.dev3 pre-release

Oct 1, 2018

0.1.dev2 pre-release

Sep 19, 2018

0.1.dev1 pre-release

Aug 2, 2018

0.1.dev0 pre-release

Aug 2, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EasyFlow-0.1.dev3.tar.gz (11.7 kB view hashes)

Uploaded Oct 1, 2018 Source

Built Distribution

EasyFlow-0.1.dev3-py3-none-any.whl (11.3 kB view hashes)

Uploaded Oct 1, 2018 Python 3

Hashes for EasyFlow-0.1.dev3.tar.gz

Hashes for EasyFlow-0.1.dev3.tar.gz
Algorithm	Hash digest
SHA256	`1b0b6bfe9026d0ea02336633637fc51b57ce1d41890a6beec2b5b47c84f30e49`
MD5	`45ff4e4d3b54e158d5c83c6307467475`
BLAKE2b-256	`cef2b5b31888bd143c5d882b28522a63d9159df77454f70adb20af29b36913f9`

Hashes for EasyFlow-0.1.dev3-py3-none-any.whl

Hashes for EasyFlow-0.1.dev3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87c9afcde35b81f9f4fd06ed0908d9a583bdac52b2542cc957b90ca4ffd8d69d`
MD5	`813cf998ee671e8c547db8012c9f1b3c`
BLAKE2b-256	`891a7f16fbf764a1474bb18010cabbe1e0351f2c61d2492ff7d1c02b9d6d82ce`