A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.