Data files of German Decompounder for Apache Lucene / Apache Solr / Elasticsearch
This project was started to offer German decompounding out of box for users
of Apache Lucene, Apache Solr, or Elasticsearch. The problem with the data files is
their license, so be careful when packaging them. Apache Lucene is an Apache v2.0
licensed project, so the data files cannot be shipped together with the distribution.
For decompounding German words, the recommended approach is the following:
HyphenationCompoundWordTokenFilter
to do the hyphenation.HyphenationCompoundWordTokenFilter
can do this based on a dictionary.onlyLongestMatch
setting of the token filter,The dictionary file dictionary-de.txt is developed here and was
created based on the fabulous data by Björn Jacke: https://www.j3e.de/ispell/igerman98/
I used his large and high quality dictionary to make a dictionary file only containing
the parts of German compounds. The dictionary therefore is not large, it contains
about 14,500 tokens, that are commonly used to form compounds. The dictionary does
not contain the compounds, only the parts that are used to create them.
The dictionary was lowercased and the umlauts restored to their UTF-8 representation.
Keep in mind: The files provided here are for new German orthography (since 1998)!
Here is a config example for Apache Solr. To use it put the two data files
into the core’s config directory’s lang
subfolder. After that you can add the
following definition to your Solr schema:
<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"></tokenizer>
<filter class="solr.LowerCaseFilterFactory"></filter>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="lang/de_DR.xml"
dictionary="lang/dictionary-de.txt" onlyLongestMatch="true" minSubwordSize="4"></filter>
<filter class="solr.GermanNormalizationFilterFactory"></filter>
<filter class="solr.GermanLightStemFilterFactory"></filter>
</analyzer>
</fieldType>
Important: Use the analyzer for both indexing and searching!
Here is a config example for Elasticsearch. To use it put the two data files
into the ${ES_HOME}/config/analysis
directory of your ES node and add
the following settings to your index. After that you can use thegerman_decompound
analyzer in your mapping.
"settings": {
"analysis": {
"filter": {
"german_decompounder": {
"type": "hyphenation_decompounder",
"word_list_path": "analysis/dictionary-de.txt",
"hyphenation_patterns_path": "analysis/de_DR.xml",
"only_longest_match": true,
"min_subword_size": 4
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"german_decompound": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_decompounder",
"german_normalization",
"german_stemmer"
]
}
}
}
}
Important: Use the analyzer for both indexing and searching!
Custom Analyzer for use with the Apache Lucene API.
Analyzer analyzer = CustomAnalyzer.builder(Paths.get("/path/to/german-decompounder"))
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.addTokenFilter(HyphenationCompoundWordTokenFilterFactory.NAME,
"hyphenator", "de_DR.xml",
"dictionary", "dictionary-de.txt",
"onlyLongestMatch", "true",
"minSubwordSize", "4")
.addTokenFilter(GermanNormalizationFilterFactory.NAME)
.addTokenFilter(GermanLightStemFilterFactory.NAME)
.build();
Important: Use the analyzer for both indexing and searching!
If you have suggestions for improving the German dictionary, please send
a pull request, thanks! Be sure to only send “plain words”, no compounds!
See NOTICE.txt for more information!