项目作者: carlomazzaferro

项目描述 :
MHC-antigen affinity prediction using TensorFlow.
高级语言: Jupyter Notebook
项目地址: git://github.com/carlomazzaferro/mhcPreds.git
创建时间: 2017-03-22T11:09:59Z
项目社区:https://github.com/carlomazzaferro/mhcPreds

开源协议:

下载


Predicting Protein Binding Affinity With Word Embeddings and Recurrent Neural Networks

Biorxiv link to paper: http://biorxiv.org/content/early/2017/04/18/128223.article-metrics

To recreate the results reported, download this repo, navigate to the main directory and run bash project_results_embedding.sh and bash project_results_rnn.sh . The data is already contained in the /data folder, and the results should pop up on the /results directory. Feel free do delete its current contents if you’d like to re-create them yourself.
The bash commands will run a variety of models/model parameters and will store each run in the results folder. For more info on the experiments ran, please refer to the paper submission. Then, run python analyze_results to create the visualizations and csv summaries.

NOTE: running the above commands will take LONG (~36 hours). I’ll post a script soon to reproduce just the best performing models soon.

Creating models and predictions

The main module responsible for the computations is mhcPreds_tflearn_cmd_line.py. It can be run as a standalone command line python program and it accepts a variety of different options:

  1. mhcPreds_tflearn_cmd_line.py [-h] [-cmd CMD] [-b BATCH_SIZE]
  2. [-bn BATCH_NORM] [-ls LAYER_SIZE]
  3. [-nl NUM_LAYERS] [-d EMBEDDING_SIZE]
  4. [-a ALLELE] [-m MODEL] [-c DATA_ENCODING]
  5. [-r LEARNING_RATE] [-e EPOCHS] [-n NAME]
  6. [-l LEN] [-s SAVE] [--data-dir DATA_DIR]
  7. [--cell-size CELL_SIZE]
  8. [--tensorboard-verbose TENSORBOARD_VERBOSE]
  9. [--from-file FROM_FILE] [--run-id RUN_ID]

For example: mhcPreds_tflearn_cmd_line.py -cmd 'train_test_eval' -e 15 -bn 1 -nl 3 -c 'kmer_embedding' -a 'A0101' -m 'embedding_rnn' -r 0.001

Will run the train, test, and evaluation protocol with 15 epochs, 1 round of batch normalization, learning rate being 0.001. It will run on the a subset of the train data set comprised of peptides binding to the HLA-A0101 allele and will transform each kmer in the data set into a 9-mer. Other default parameters can be seen by
Results will be stored to the /mhcPreds/results/run_id folder, where run_id is either specified by the user or a randomly selected number between 0 and 10000.

optional arguments:

  1. -h, --help show this help message and exit
  2. -cmd CMD command
  3. -b BATCH_SIZE, --batch-size BATCH_SIZE
  4. -bn BATCH_NORM, --batch-norm BATCH_NORM
  5. Perform batch normalization either: only after LSTM (1), after and before (2)
  6. -ls LAYER_SIZE, --layer-size LAYER_SIZE
  7. Size of inner layeres of RNN
  8. -nl NUM_LAYERS, --num-layers NUM_LAYERS
  9. Number of LSTM layers
  10. -d EMBEDDING_SIZE, --embedding-size EMBEDDING_SIZE
  11. Embedding layer output dimension
  12. -a ALLELE, --allele ALLELE
  13. Allele to use for prediction. None predicts for all alleles.
  14. -m MODEL, --model MODEL
  15. RNN model. Basic LSTM, Birectional LSTM or simple RNN
  16. -c DATA_ENCODING, --data-encoding DATA_ENCODING
  17. Embedding layer output dimension
  18. -r LEARNING_RATE, --learning-rate LEARNING_RATE
  19. learning rate (default 0.001)
  20. -e EPOCHS, --epochs EPOCHS
  21. number of trainig epochs
  22. -n NAME, --name NAME name of model, used when generating default weights filenames
  23. -l LEN, --len LEN size of k-mer to predict on
  24. -s SAVE, --save SAVE Save model to --data-dir
  25. --data-dir DATA_DIR directory to use for saving models
  26. --cell-size CELL_SIZE
  27. size of RNN cell to use (default 32)
  28. --tensorboard-verbose TENSORBOARD_VERBOSE
  29. tensorboard verbosity level (default 0)
  30. --run-id RUN_ID Name of run to be displayed in tensorboard and results folder

NOTES-1:

Here’s a list of possible options for some of the parameters.

  1. POSSIBLE_ALLELES = ['A3101', 'B1509', 'B2703', 'B1517', 'B1801', 'B1501', 'B4002', 'B3901', 'B5701', 'A6801',
  2. 'B5301', 'A2301', 'A2902', 'B0802', 'A3001', 'A0301', 'A0202', 'A0101', 'B4001', 'B5101',
  3. 'A1101', 'B4402', 'B0803', 'B5801', 'A2601', 'A0203', 'A3002', 'B4601', 'A3301', 'A6802',
  4. 'B3801', 'A3201', 'B3501', 'A2603', 'B0702', 'A6901', 'B0801', 'B4501', 'A0206', 'A0201',
  5. 'B1503', 'A2602', 'A8001', 'A2402', 'B2705', 'B4403', 'A2501', 'B5401']
  6. TRAIN_DEFAULTS = ['A0201', 'A0301', 'A0203', 'A1101', 'A0206', 'A3101']
  7. AVAILABLE_MODELS = ['deep_rnn', 'embedding_rnn', 'bi_rnn']
  8. DATA_ENCODINGS = ['one_hot', 'kmer_embedding']

NOTES-2:

  1. embedding_rnn requires not parameters referring to a RNN since it’s been found that using an embedding layer + hidden layers is sufficient to obtain good accuracy. Adding recurrent layers for the most hurts performance.
  2. Similarly, ‘embedding_rnn’ requires ‘kmer_embedding’ as argument, and can not be used with one_hot data encoding.
  3. one_hot encoding allows the user to specifiy a variety of different architectures, including:
    • bi-directional rnn with a user-definer layer size
    • deep LSTM with a user-defined number of LSTM layers
    • simple rnn with a user-definer layer size
  4. One Hot encoding usually leads to slower training due to increased feature dimension