项目作者: heshenghuan

项目描述 :
A (CNN+)RNN(LSTM/BiLSTM)+CRF model for sequence labelling.:smirk:
高级语言: Python
项目地址: git://github.com/heshenghuan/LSTM-CRF.git
创建时间: 2017-02-20T10:04:02Z
项目社区:https://github.com/heshenghuan/LSTM-CRF

开源协议:

下载


LSTM-CRF

Introduction

An implementation of LSTM+CRF model for Sequence labeling tasks. Based on Tensorflow(>=r1.1), and support multiple architecture like LSTM+CRF, BiLSTM+CRF, and combination of character-level CNN and BiLSTM+CRF.

Other architecture of RNN+CRF, like traditional feature involved architecture will be adding after.

Dependecies

Because this project used Tensorflow API, it requires installation of Tensorflow and some other python modules:

  • Tensorflow ( >= r1.1)

Both of them can be easily installed by pip.

Data Format

The data format is basically consistent with the CRF++ toolkit. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence.

To identify the boundary between sentences, an empty line is put. It means there should be a ‘\n\n’ between two different sentences. So, if your OS is Windows, please check out what the boundary character really is.

Here’s an example of such a file: (data for Chinese NER)

  1. ...
  2. O
  3. O
  4. O
  5. B-PER.NAM
  6. I-PER.NAM
  7. I-PER.NAM
  8. O
  9. O
  10. O
  11. O
  12. O
  13. O
  14. ...

Featrue template

This part you can read the readme file under lib directory, which is a submodule named NeuralTextProcess.

In file template specificated the feature template which used in context-based feature extraction. The second line fields indicates the field name for each column of a token. And the templates described how to extract features.

For example, the basic template is:

  1. # Fields(column), w, y, x & F are reserved names
  2. w y
  3. # templates.
  4. w:-2
  5. w:-1
  6. w: 0
  7. w: 1
  8. w: 2

it means, each token will only has 2 columns data, ‘w’ and ‘y’. Field y should always be at the last column.

Note that w y & F fields are reserved, because program used them to represent word, label and word’s features.

Each token will become a dict type data like ‘{‘w’: ‘李’, ‘y’: ‘B-PER.NAM’, ‘F’: [‘w[-2]=动’, ‘w[-1]=了’, …]}’

The above templates describes a classical context feature template:

  • C(n) n=-2,-1,0,1,2

‘C(n)’ is the value of token[‘w’] at relative position n.

If your token has more than 2 columns, you may need change the fields and template depends on how you want to do extraction.

In this project, I disabled prefix of feature to extract words in a context window.

Embeddings

This program supports pretrained embeddings input. When running this program, you should give a embedding text file(word2vec tool standard output format) by specific argument.

Usage

Environment settings

In env_settings.py file, there are some environment settings like ‘output dir’:

  1. # Those are some IO files' dirs
  2. # you need change the BASE_DIR on your own PC
  3. BASE_DIR = r'project dir/'
  4. MODEL_DIR = BASE_DIR + r'models/'
  5. DATA_DIR = BASE_DIR + r'data/'
  6. EMB_DIR = BASE_DIR + r'embeddings/'
  7. OUTPUT_DIR = BASE_DIR + r'export/'
  8. LOG_DIR = BASE_DIR + r'Summary/'

If your don’t have those dirs in your project dir, just run python env_settings.py, and they will be created automatically.

Training

1. Using embeddings as features

Just run the ./main.py file. Or specify some arguments if you need, like this:

  1. python main.py --lr 0.005 --fine_tuning False --l2_reg 0.0002

Then the model will run on lr=0.005, not fine-tuning, l2_reg=0.0002 and all others default. Using -h will print all help informations. Some arguments are not useable now, but I will fix it as soon as possible.

  1. python main.py -h
  2. usage: main.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
  3. [--valid_data VALID_DATA] [--log_dir LOG_DIR]
  4. [--model_dir MODEL_DIR] [--model MODEL]
  5. [--restore_model RESTORE_MODEL] [--emb_file EMB_FILE]
  6. [--emb_dim EMB_DIM] [--output_dir OUTPUT_DIR]
  7. [--only_test [ONLY_TEST]] [--noonly_test] [--lr LR]
  8. [--dropout DROPOUT] [--fine_tuning [FINE_TUNING]]
  9. [--nofine_tuning] [--eval_test [EVAL_TEST]] [--noeval_test]
  10. [--max_len MAX_LEN] [--nb_classes NB_CLASSES]
  11. [--hidden_dim HIDDEN_DIM] [--batch_size BATCH_SIZE]
  12. [--train_steps TRAIN_STEPS] [--display_step DISPLAY_STEP]
  13. [--l2_reg L2_REG] [--log [LOG]] [--nolog] [--template TEMPLATE]
  14. optional arguments:
  15. -h, --help show this help message and exit
  16. --train_data TRAIN_DATA
  17. Training data file
  18. --test_data TEST_DATA
  19. Test data file
  20. --valid_data VALID_DATA
  21. Validation data file
  22. --log_dir LOG_DIR The log dir
  23. --model_dir MODEL_DIR
  24. Models dir
  25. --model MODEL Model type: LSTM/BLSTM/CNNBLSTM
  26. --restore_model RESTORE_MODEL
  27. Path of the model to restored
  28. --emb_file EMB_FILE Embeddings file
  29. --emb_dim EMB_DIM embedding size
  30. --output_dir OUTPUT_DIR
  31. Output dir
  32. --only_test [ONLY_TEST]
  33. Only do the test
  34. --noonly_test
  35. --lr LR learning rate
  36. --dropout DROPOUT Dropout rate of input layer
  37. --fine_tuning [FINE_TUNING]
  38. Whether fine-tuning the embeddings
  39. --nofine_tuning
  40. --eval_test [EVAL_TEST]
  41. Whether evaluate the test data.
  42. --noeval_test
  43. --max_len MAX_LEN max num of tokens per query
  44. --nb_classes NB_CLASSES
  45. Tagset size
  46. --hidden_dim HIDDEN_DIM
  47. hidden unit number
  48. --batch_size BATCH_SIZE
  49. num example per mini batch
  50. --train_steps TRAIN_STEPS
  51. trainning steps
  52. --display_step DISPLAY_STEP
  53. number of test display step
  54. --l2_reg L2_REG L2 regularization weight
  55. --log [LOG] Whether to record the TensorBoard log.
  56. --nolog
  57. --template TEMPLATE Feature templates

There has three type of model can be choosed by using argument ‘—model’, they are:

  1. LSTM + CRF
  2. BiLSTM + CRF
  3. CNN + BiLSTM + CRF

2. Using both embeddings and context

We proposed a hybrid model which can use both embeddings and contextual features as input for sequence labeling task. The embeddings are used as input of RNN. And the contextual features are used like traditional feature functions in CRFs.

Just run the ./hybird_tagger.py file. Or specify some arguments if you need, like this:

  1. python hybrid_tagger.py -h
  2. usage: hybrid_tagger.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
  3. [--valid_data VALID_DATA] [--log_dir LOG_DIR]
  4. [--model_dir MODEL_DIR]
  5. [--restore_model RESTORE_MODEL] [--emb_file EMB_FILE]
  6. [--emb_dim EMB_DIM] [--output_dir OUTPUT_DIR]
  7. [--only_test [ONLY_TEST]] [--noonly_test] [--lr LR]
  8. [--dropout DROPOUT] [--fine_tuning [FINE_TUNING]]
  9. [--nofine_tuning] [--eval_test [EVAL_TEST]]
  10. [--noeval_test] [--max_len MAX_LEN]
  11. [--nb_classes NB_CLASSES] [--hidden_dim HIDDEN_DIM]
  12. [--batch_size BATCH_SIZE] [--train_steps TRAIN_STEPS]
  13. [--display_step DISPLAY_STEP] [--l2_reg L2_REG]
  14. [--log [LOG]] [--nolog] [--template TEMPLATE]
  15. [--window WINDOW] [--feat_thresh FEAT_THRESH]
  16. optional arguments:
  17. -h, --help show this help message and exit
  18. --train_data TRAIN_DATA
  19. Training data file
  20. --test_data TEST_DATA
  21. Test data file
  22. --valid_data VALID_DATA
  23. Validation data file
  24. --log_dir LOG_DIR The log dir
  25. --model_dir MODEL_DIR
  26. Models dir
  27. --restore_model RESTORE_MODEL
  28. Path of the model to restored
  29. --emb_file EMB_FILE Embeddings file
  30. --emb_dim EMB_DIM embedding size
  31. --output_dir OUTPUT_DIR
  32. Output dir
  33. --only_test [ONLY_TEST]
  34. Only do the test
  35. --noonly_test
  36. --lr LR learning rate
  37. --dropout DROPOUT Dropout rate of input layer
  38. --fine_tuning [FINE_TUNING]
  39. Whether fine-tuning the embeddings
  40. --nofine_tuning
  41. --eval_test [EVAL_TEST]
  42. Whether evaluate the test data.
  43. --noeval_test
  44. --max_len MAX_LEN max num of tokens per query
  45. --nb_classes NB_CLASSES
  46. Tagset size
  47. --hidden_dim HIDDEN_DIM
  48. hidden unit number
  49. --batch_size BATCH_SIZE
  50. num example per mini batch
  51. --train_steps TRAIN_STEPS
  52. trainning steps
  53. --display_step DISPLAY_STEP
  54. number of test display step
  55. --l2_reg L2_REG L2 regularization weight
  56. --log [LOG] Whether to record the TensorBoard log.
  57. --nolog
  58. --template TEMPLATE Feature templates
  59. --window WINDOW Window size of context
  60. --feat_thresh FEAT_THRESH
  61. Only keep feats which occurs more than 'thresh' times.

Test

If you set ‘only_test’ to True or ‘train_steps’ to 0, then program will only do test process.

So you must give a specific path to ‘restore_model’.

History

  • 2018-01-09 ver 0.2.4
    • Update Neural Text Process lib 0.2.1
    • Compatible modification in main file.
  • 2017-11-04 ver 0.2.3
    • Hybrid feature architecture for LSTM and corresponding tagger’s python script.
  • 2017-10-31 ver 0.2.2
    • Update Neural Text Process lib 0.2.0
    • Compatible modification in main file.
  • 2017-10-20 ver 0.2.1
    • Fix: Non-suffix for template in ‘only test’ process.
    • Fix: Now using correct dicts for embedding lookup table.
    • Fix: A bug of batch generator ‘batch_index’.
  • 2017-09-12 ver 0.2.0
    • Update: process lib 0.1.2
    • Removed ‘keras_src’, completed the refactoring of the code hierarchy.
    • Added env_settings.py to make sure all default dirs exist.
    • Support restore model from file.
    • Support model selection.
  • 2017-07-06 ver 0.1.3
    • Add new method ‘accuracy’, which used to calculate correct labels
    • Arguments ‘emb_type’ & ‘emb_dir’ now are deprecated.
    • New argument ‘emb_file’
  • 2017-04-11 ver 0.1.2
    • Rewrite neural_tagger class method: loss.
    • Add a new tagger based Bi-LSTM + CNNs, where CNN used to extract bigram features.
  • 2017-04-08 ver 0.1.1
    • Rewrite class lstm-ner & bi-lstm-ner.
  • 2017-03-03 ver 0.1.0
    • Using tensorflow to implement the LSTM-NER model.
    • Basical function finished.
  • 2017-02-26 ver 0.0.3
    • lstm_ner basically completed.
    • viterbi decoding algorithm and sequence labeling.
    • Pretreatment completed.
  • 2017-02-21 ver 0.0.2
    • Basical structure of project
    • Added 3 module file: features, pretreatment and constant
    • Pretreatment’s create dictionary function completed
  • 2017-02-20 ver 0.0.1
    • Initialization of this project.
    • README file
    • Some util functions and basical structure of project