Identify topics of text corpus and classify documents into topics with different methods.
NMF implementation of 2 cases:
LDA implementation incl. grid search for unknown number of topics (k): lda.py
This is for the sake of completeness, since
A simple Keras implementation of a text multiclass classifier (with known classes): keras_simple_classifier.py
A topic can be represented resp. interpreted by the most important token / phrases of its documents. Sometimes, this is not as clear as one would like.
These scripts:
try to solve this problem by requesting Wikipedia with top token on a document level and processing the returned categories for each topic.
The results are quite satisfying as shown in following example:
Top phrases from each topic:
[
[
"henry", "england", "elizabeth", "king", "anne", "marriage", "death", "son", "throne", "college"
],
[
"design", "architect", "architecture", "niemeyer", "building", "office", "movement", "designer", "furniture", "site"
],
[
"film", "swanson", "keaton", "bow", "hollywood", "actress", "cinema", "star", "pickford", "actor"
]
]
Top 3 phrases for the same topics from Wikipedia category phrase processing:
topic 0: 16th century | english | monarchs
topic 1: american | architects | 20th century
topic 2: american | actresses | 20th century
The directories in /data:
This data is corresponding to: https://github.com/zushicat/text-similarity-extractive