项目作者: zushicat

项目描述 :
Identify topics of text corpus and classify documents into topics with different methods.
高级语言: Python
项目地址: git://github.com/zushicat/text-topics.git
创建时间: 2020-05-20T11:37:01Z
项目社区:https://github.com/zushicat/text-topics

开源协议:

下载


text-topics

NMF implementation of 2 cases:

  • if you know the number of topics: nmf_fixed_k.py
  • if you don’t know the number of topics: nmf_unknown_k.py

LDA implementation incl. grid search for unknown number of topics (k): lda.py
This is for the sake of completeness, since

  • most parts of the code are no different from NMF usage (hence a little redundant)
  • the LDA results are not as good as those of NMF

A simple Keras implementation of a text multiclass classifier (with known classes): keras_simple_classifier.py

Human readable topics

A topic can be represented resp. interpreted by the most important token / phrases of its documents. Sometimes, this is not as clear as one would like.
These scripts:

  • identify_topic.py
  • _request_wikipedia.py

try to solve this problem by requesting Wikipedia with top token on a document level and processing the returned categories for each topic.

The results are quite satisfying as shown in following example:

  1. Top phrases from each topic:
  2. [
  3. [
  4. "henry", "england", "elizabeth", "king", "anne", "marriage", "death", "son", "throne", "college"
  5. ],
  6. [
  7. "design", "architect", "architecture", "niemeyer", "building", "office", "movement", "designer", "furniture", "site"
  8. ],
  9. [
  10. "film", "swanson", "keaton", "bow", "hollywood", "actress", "cinema", "star", "pickford", "actor"
  11. ]
  12. ]
  1. Top 3 phrases for the same topics from Wikipedia category phrase processing:
  2. topic 0: 16th century | english | monarchs
  3. topic 1: american | architects | 20th century
  4. topic 2: american | actresses | 20th century

Data

The directories in /data:

  • source_texts:
    Excerpts of wikipedia biographies falling in 3 broad topics:
    • Tudor dynasty (marked with “a”)
    • Midcentury Architects / Designer (marked with “b”)
    • Stars of the silent movie area (marked with “c”)
  • target_texts:
    Very short texts based on source texts whith varying similarity, marked accordingly to the source texts. Also, one text about a movie star not included in source texts and one text about “Charlie Brown” without any topic affiliation (marked with “d”).

This data is corresponding to: https://github.com/zushicat/text-similarity-extractive

Further Reading

General

LDA / NMF