项目作者: NLPIR-team

项目描述 :
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
高级语言: Java
项目地址: git://github.com/NLPIR-team/nlpir-analysis-cn-ictclas.git
创建时间: 2017-08-13T14:07:48Z
项目社区:https://github.com/NLPIR-team/nlpir-analysis-cn-ictclas

开源协议:Apache License 2.0

下载


Now NLPIR/ICTCLAS for Lucene/Solr plugin V2.2

Lucene-analyzers-nlpir-ictclas-6.6.0

NLPIR/ICTCLAS for Lucene/Solr 6.6.0 analyzer plugin. Support: MacOS,Linux x86/64, Windows x86/64

The project resources folder is a source folder, which contains all platform’s dynamic libraries and push them to the classpath.//Source Folder 保证所有平台下的动态库自动部署到classpath环境下,以便JNA加载动态库。

Building Lucene-analyzers-nlpir-ictclas

Lucene-analyzers-nlpir-ictclas is built by Maven. To build Lucene-analyzers-nlpir-ictclas run:

  1. mvn clean package -DskipTests

Or if you use IDE(Eclipse), there is also the same way.

How to use in your projects

You can use NLPIRTokenizerAnalyzer to do the Chinese Word Segmentation:

  • NLPIRTokenizerAnalyzer DEMO
  1. String text="我是中国人";
  2. NLPIRTokenizerAnalyzer nta = new NLPIRTokenizerAnalyzer("", 1, "", "", false);
  3. TokenStream ts = nta.tokenStream("word", text);
  4. ts.reset();
  5. CharTermAttribute term = ts.getAttribute(CharTermAttribute.class);
  6. while(ts.incrementToken()){
  7. System.out.println(term.toString());
  8. }
  9. ts.end();
  10. ts.close();
  11. nta.close();

and also use in Lucene:

  • Lucene DEMO

The sample shows how to index your text and search by using NLPIRTokenizerAnalyzer.

  1. //For indexing
  2. NLPIRTokenizerAnalyzer nta = new NLPIRTokenizerAnalyzer("", 1, "", "", false);
  3. IndexWriterConfig inconf=new IndexWriterConfig(nta);
  4. inconf.setOpenMode(OpenMode.CREATE_OR_APPEND);
  5. IndexWriter index=new IndexWriter(FSDirectory.open(Paths.get("index/")),inconf);
  6. Document doc = new Document();
  7. doc.add(new TextField("contents", "特朗普表示,很高兴汉堡会晤后再次同习近平主席通话。我同习主席就重大问题保持沟通和协调、两国加强各层级和各领域交往十分重要。当前,美中关系发展态势良好,我相信可以发展得更好。我期待着对中国进行国事访问。",Field.Store.YES));
  8. index.addDocument(doc);
  9. index.flush();
  10. index.close();
  11. //for searching
  12. String field = "contents";
  13. IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get("index/")));
  14. IndexSearcher searcher = new IndexSearcher(reader);
  15. QueryParser parser = new QueryParser(field, nta);
  16. Query query = parser.parse("特朗普习近平");
  17. TopDocs top=searcher.search(query, 100);
  18. ScoreDoc[] hits = top.scoreDocs;
  19. for(int i=0;i<hits.length;i++) {
  20. System.out.println("doc="+hits[i].doc+" score="+hits[i].score);
  21. Document d = searcher.doc(hits[i].doc);
  22. System.out.println(d.get("contents"));
  23. }

How Solr Install

To make part of Solr, you need these files:

  1. the plugin jar, which you have built and put it in your core’s lib directory.
  2. nlpir.properties contains:
  1. data="" #Data directory‘s parent path
  2. encoding=1 #0 GBK;1 UTF-8
  3. sLicenseCode="" # License code
  4. userDict="" # user dictionary, a text file
  5. bOverwrite=false # whether overwrite the existed user dictionary or not
  1. data directory, you can find it in NLPIR SDK https://github.com/NLPIR-team/NLPIR/tree/master/NLPIR%20SDK/NLPIR-ICTCLAS

Waring: You need to make sure the plugin jar can find the nlpir.properties file. You can put the file to solr_home/server/, and the data need to set the path of NLPIR/ICTCLAS Data.

  • Solr Managed-schema
  1. <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  2. <analyzer type="index">
  3. <tokenizer class="org.nlpir.lucene.cn.ictclas.NLPIRTokenizerFactory"></tokenizer>
  4. </analyzer>
  5. <analyzer type="query">
  6. <tokenizer class="org.nlpir.lucene.cn.ictclas.NLPIRTokenizerFactory"></tokenizer>
  7. </analyzer>
  8. </fieldType>
  1. dependency jar for dll: jna.jar. add to your solr’s lib.

Tokenizer

  • v2.*
  1. //Standard Tokenizer
  2. class="org.nlpir.lucene.cn.ictclas.NLPIRTokenizer"
  3. //Finer Segment
  4. class="org.nlpir.lucene.cn.ictclas.finersegmet.FinerTokenizer"
  • v1.*
  1. //Standard Tokenizer
  2. class="org.nlpir.lucene.cn.ictclas.NLPIRTokenizer"

Solr Show

Alt text