项目作者: yipenglai

项目描述 :
Learn Chinese word representations using subword and subcharacter information
高级语言: Python
项目地址: git://github.com/yipenglai/Chinese-Word-Representation.git
创建时间: 2020-04-19T16:59:37Z
项目社区:https://github.com/yipenglai/Chinese-Word-Representation

开源协议:

下载


Chinese Word Representation

Most of the popular methods used for learning word representations only consider the external context of words and ignore their internal structures. These techniques are limited for learning effective representations of Chinese words, the internal structures of which can also be semantically important. This project generates and compares Chinese word representations that exploit different subword and subcharacter components including characters, graphical components, and Wubi codes.

Data

  • Training corpus: Chinese Wikipedia dump
  • Evalution data: Chinese Word similarity tasks wordsim-240 and wordsim-296
  • Dictionaries: subchar_dict/graphical_dict.p and subchar_dict/wubi_dict.p contain dictionaries that map Chinese characters to graphical components and Wubi codes. These two file can be used to convert characters into subcharacter components or to train joint word embeddings (JWE).

Quick Start

Install Packages

Run pip install -r requirements.txt

Preprocess Wiki Dump:

After downloading the latest Chinese Wikipedia dump from the link above, use preprocess_wiki.py to

  • Convert traditional Chinese to simplified Chinese
  • Remove non-Chinese characters including punctuations and spaces
  • Split sentences into words separated by space
  • Convert xml into txt file

Parameters:

  1. input # Wiki dump XML file path
  2. output # Preprocessed txt file path

Convert Tokenized Wiki Text to Subcharacter Components

Use convert_subchar.py to convert tokenized text into subcharacter components (graphical components or Wubi codes) while keeping the delimeters between words
Parameters:

  1. input # Tokenized txt file path
  2. output # Subcharacter output txt file path
  3. subchar # Subcharacter types {radical, wubi}

Learn Word Representations

CBOW or skipgram

Use python train.py to train continuous-bag-of-words (CBOW) or skipgram models using fastText
Parameters:

  1. input # Training txt file path
  2. model_path # Trained model path
  3. model # Model type {cbow, skipgram}
  4. dim # Size of the word vectors
  5. ws # Size of the context window
  6. epoch # Number of epochs
  7. minn # Minimum length of subword ngram
  8. maxn # Maximal length of subword ngram

Note: Depending on whether the training corpus contains characters, graphical components, or Wubi codes, word length may vary significantly, and different minn and maxn should be chosen accordingly. When maxn = 0, no subword information will be considered.

JWE

For jointly learning word vectors, character vectors and subcharacter vectors, see JWE

Evaluate

CBOW or skipgram

Use eval.py to evaluate the trained CBOW and skipgram models on word similarity tasks. The script will first compute the cosine similarity between each pair of words, and then compute the Spearman correlation coefficient between cosine similarity and human-labeled scores as final evaluation metric for model performance.
Parameters:

  1. input # Evaluation data file path
  2. model_path # Trained model path
  3. output # Predicted score output txt file path
  4. subword # Subword type {character, graphical, wubi}