Generate drug-like molecules for drug discovery.
Molecule-RNN is a recurrent neural network built with Pytorch to generate molecules for drug discovery. Basically, it learns the distribution of the training dataset and tries to sample from this distrubtion. So, the output molecules will have similar distributions to the training dataset.
There are different ways to tokenize SMILES, 3 of them are implemented in this project:
The chembl28 dataset is used. It is under ./dataset
.
out_dir
in train.yaml
as the directory where you want to store output results.which_vocab
and vocab_path
in train.yaml
to specify which tokenization scheme to use. The pre-computed vocabularies are at ./vocab
.train.yaml
if you like (the default setting is working).
python train.py
The trained model will be saved in the out_dir
directory. We can generate molecules by sampling the trained model according to the output distribution. If the -result_dir
is not specified, the out_dir
in train.yaml
will be used.
python sample.py -result_dir your_output_dir
The default setting yields over 80% valid rate for character-level tokenization and regex-based tokenization, and it gives 99.9% valid rate for SELFIES tokenization. Here are examples of some sampled molecules: