项目作者: acapitanelli

项目描述 :
An implementation of word2vec skip-gram algorithm
高级语言: Python
项目地址: git://github.com/acapitanelli/word-embedding.git
创建时间: 2019-08-10T14:23:27Z
项目社区:https://github.com/acapitanelli/word-embedding

开源协议:MIT License

下载


Word-embedding

An implementation of word2vec skip-gram algorithm for word embedding, with sub-sampling and negative sampling as in the origin implementation of Mikolov paper.

The training function is organized as follows:

  • corpus is generated parsing all txt files found in the specified folder
  • training samples are generated and saved to disk
  • in case of negative sampling, unigram table is generated too
  • training process is started
  • embedding data are saved to disk (with pickle)

Please note that code is written for convenience over performance and no specific optimization is in place, so embedding process for large corpora requires a very very long time!

Quick Start

Clone and install:

  1. git clone https://github.com/acapitanelli/word-embedding.git
  2. cd word-embedding
  3. pip install .

From console:

  1. foo@bar:~$ pyembed -h
  2. usage: pyembed [-h] [--win-size] [--dry-run] [--min-count] [--sample]
  3. [--embedding-size] [--learning-rate] [--epochs] [--negative]
  4. [--unigram-table]
  5. data_dir
  6. An implementation of word2vec algorithm for word embedding.
  7. positional arguments:
  8. data_dir Folder with documents of corpus
  9. optional arguments:
  10. -h, --help show this help message and exit
  11. --win-size Size of moving window for context words. (default: 5)
  12. --dry-run If true, loads corpus, generates and saves training
  13. samples without performing NN training (default: False)
  14. --min-count Words appearing less than min-count are excluded from
  15. corpus (default: 5)
  16. --sample Scale factor for subsampling probability (default: 0.001)
  17. --embedding-size Embedding size (default: 300)
  18. --learning-rate NN learning rate for gradient descent (default: 0.025)
  19. --epochs Num. epochs to train (default: 10)
  20. --negative Num. of negative samples. Negative sampling is applied
  21. only if greater than 0 (default: 5)
  22. --unigram-table Size of table for unigram distribution (default:
  23. 100000000)