项目作者: uschindler

项目描述 :
Data files of German Decompounder for Apache Lucene / Apache Solr / Elasticsearch
高级语言:
项目地址: git://github.com/uschindler/german-decompounder.git
创建时间: 2017-09-16T10:11:15Z
项目社区:https://github.com/uschindler/german-decompounder

开源协议:

下载


Data files of German Decompounder for Apache Lucene / Apache Solr / Elasticsearch

This project was started to offer German decompounding out of box for users
of Apache Lucene, Apache Solr, or Elasticsearch. The problem with the data files is
their license, so be careful when packaging them. Apache Lucene is an Apache v2.0
licensed project, so the data files cannot be shipped together with the distribution.

For decompounding German words, the recommended approach is the following:

  • First use a hyphenator to create syllables of the input tokens. Of course this does
    way too much. If we would index syllables the user would match a lot of wrong stuff.
    The hyphenator rules are used in many word processor programs (e.g., Open Office or
    Latex). They are provided here in the format of an XML file for Apache FOPs (Formatting
    Objects Processor, taken from https://sourceforge.net/projects/offo/). Those files
    can be read by Lucene’s HyphenationCompoundWordTokenFilter to do the hyphenation.
    Be sure to use the files of offo-hyphenation v1.2, not newer (2.x) ones (Lucene can’t
    read them)!
  • The second step is therefore to take the syllables and form words out of them again.
    The Lucene HyphenationCompoundWordTokenFilter can do this based on a dictionary.
    This project here mainly provides the dictionary to do this (see below). As the
    dictionary solely provides parts of compound words (not the compounds itsself),
    it is important to use the onlyLongestMatch setting of the token filter,
    otherwise you might get wrong decompounding results (especially as the dictionary
    also contains very short words).
  • The third step is stemming the full word (the token filter keep the original by
    default) and also its parts. You should use the light German stemmer (not the
    minimal), because the decompounded parts have lots of filler characters that should
    be removed by the stemmer. The minimal stemmer is not able to do this.
    As decompounding is no longer a minimal approach, you may consider to use a separate
    lucene field only using the minimal stemmer but not doing decompounding for
    preferring exact matches in your search.

The dictionary file dictionary-de.txt is developed here and was
created based on the fabulous data by Björn Jacke: https://www.j3e.de/ispell/igerman98/

I used his large and high quality dictionary to make a dictionary file only containing
the parts of German compounds. The dictionary therefore is not large, it contains
about 14,500 tokens, that are commonly used to form compounds. The dictionary does
not contain the compounds, only the parts that are used to create them.
The dictionary was lowercased and the umlauts restored to their UTF-8 representation.

Keep in mind: The files provided here are for new German orthography (since 1998)!

Apache Solr example

Here is a config example for Apache Solr. To use it put the two data files
into the core’s config directory’s lang subfolder. After that you can add the
following definition to your Solr schema:

  1. <!-- German -->
  2. <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
  3. <analyzer>
  4. <tokenizer class="solr.StandardTokenizerFactory"></tokenizer>
  5. <filter class="solr.LowerCaseFilterFactory"></filter>
  6. <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="lang/de_DR.xml"
  7. dictionary="lang/dictionary-de.txt" onlyLongestMatch="true" minSubwordSize="4"></filter>
  8. <filter class="solr.GermanNormalizationFilterFactory"></filter>
  9. <filter class="solr.GermanLightStemFilterFactory"></filter>
  10. </analyzer>
  11. </fieldType>

Important: Use the analyzer for both indexing and searching!

Elasticsearch example

Here is a config example for Elasticsearch. To use it put the two data files
into the ${ES_HOME}/config/analysis directory of your ES node and add
the following settings to your index. After that you can use the
german_decompound analyzer in your mapping.

  1. "settings": {
  2. "analysis": {
  3. "filter": {
  4. "german_decompounder": {
  5. "type": "hyphenation_decompounder",
  6. "word_list_path": "analysis/dictionary-de.txt",
  7. "hyphenation_patterns_path": "analysis/de_DR.xml",
  8. "only_longest_match": true,
  9. "min_subword_size": 4
  10. },
  11. "german_stemmer": {
  12. "type": "stemmer",
  13. "language": "light_german"
  14. }
  15. },
  16. "analyzer": {
  17. "german_decompound": {
  18. "type": "custom",
  19. "tokenizer": "standard",
  20. "filter": [
  21. "lowercase",
  22. "german_decompounder",
  23. "german_normalization",
  24. "german_stemmer"
  25. ]
  26. }
  27. }
  28. }
  29. }

Important: Use the analyzer for both indexing and searching!

Lucene API example

Custom Analyzer for use with the Apache Lucene API.

  1. Analyzer analyzer = CustomAnalyzer.builder(Paths.get("/path/to/german-decompounder"))
  2. .withTokenizer(StandardTokenizerFactory.NAME)
  3. .addTokenFilter(LowerCaseFilterFactory.NAME)
  4. .addTokenFilter(HyphenationCompoundWordTokenFilterFactory.NAME,
  5. "hyphenator", "de_DR.xml",
  6. "dictionary", "dictionary-de.txt",
  7. "onlyLongestMatch", "true",
  8. "minSubwordSize", "4")
  9. .addTokenFilter(GermanNormalizationFilterFactory.NAME)
  10. .addTokenFilter(GermanLightStemFilterFactory.NAME)
  11. .build();

Important: Use the analyzer for both indexing and searching!

Help Out!

If you have suggestions for improving the German dictionary, please send
a pull request, thanks! Be sure to only send “plain words”, no compounds!

License

See NOTICE.txt for more information!