项目作者: mynlp

项目描述 :
Wikipedia Entities Lexicon Extractor
高级语言: Python
项目地址: git://github.com/mynlp/wikilex.git
创建时间: 2017-07-28T05:07:06Z
项目社区:https://github.com/mynlp/wikilex

开源协议:GNU General Public License v3.0

下载


wikilex

Wikipedia Entities Lexicon Extractor

Scans a Wikipedia Dump in xml and for each article it extracts the article Title (and generate the proper uri), Categories, Entities (all the mentions, uris, sentence triples), Links (all the entities mentioned in the article)

All this information is saved into a SQLite database using the following structure:

  1. Categories {
  2. id,
  3. uri,
  4. category
  5. }
  6. Entities {
  7. id,
  8. source_uri,
  9. link_uri,
  10. sentence
  11. }
  12. Mentions {
  13. id,
  14. mention,
  15. target_uri, -- uri the mention is linking to
  16. source_uri, -- uri page where the mention was found
  17. sentence
  18. }

This allows to query easily different potential features for an Entity Linking system.