项目作者: ArnoldGaius

项目描述 :
基于sklearn的文本分类器 Text classifier based on sklearn
高级语言: Python
项目地址: git://github.com/ArnoldGaius/Text_Classifier.git
创建时间: 2017-06-05T10:50:02Z
项目社区:https://github.com/ArnoldGaius/Text_Classifier

开源协议:MIT License

下载


PyPI version
PyPI version

文本分类器 Text classifier

Text Classifier based on Numpy,Scikit-learn,Pandas,Matplotlib

Train Data Format

type Text
game The LoL champions pro players would ban forever
society In Beijing you should keep the rules
etc. etc.

Sample Usage

  1. >>> import TextClassifier
  2. # cerat classifier container
  3. >>> tc = TextClassifier.classifier_container()
  4. # load data
  5. # '../data/Train_data.txt' is data path
  6. # sep Default = ',' you can change it to '\t',etc.
  7. >>> tc.load_Data('../data/Train_data.txt',sep=',')
  8. # train the model
  9. >>> tc.train()
  10. # prediction. Input list or text-String
  11. >>> print tc.predict('Faker is the first League of Legends player to earn over $1 million in prize money')
  12. [u'game']
  13. >>> print tc.predict(['Faker is the first League of Legends player to earn over $1 million in prize money',
  14. '18-year-old youth killed 88-year-old veteran',
  15. 'Take you into the real North Korea'])
  16. [u'game',u'society',u'world']
  17. #get X_train, X_test, y_train, y_test
  18. >>> from sklearn import cross_validation
  19. >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(original_data['Text'], original_data['Categorization'], test_size=0.3, random_state=0)
  20. #get TrainData Accuracy
  21. >>> tc.Accuracy(X_train, y_train)
  22. Accuracy:
  23. 0.917504310503
  1. #get Confusion Matrix
  2. >>> Y_predict = tc.predict(X_test)
  3. >>> tc.confusion_matrix(y_test, Y_predict)
  4. Confusion Matrix :
  5. military baby car game food sports finance discovery regimen travel fashion history society story tech world entertainment essay
  6. military 2831 5 3 16 9 4 8 10 0 15 8 24 9 3 6 42 6 1
  7. baby 0 2932 3 3 26 0 1 0 10 7 10 3 16 4 3 7 20 4
  8. car 6 10 2813 3 6 8 13 3 1 13 10 3 39 1 11 5 24 4
  9. game 10 11 6 2843 5 9 2 4 1 11 13 3 8 4 25 3 31 3
  10. food 0 38 0 3 2799 1 5 1 67 34 16 7 9 3 4 8 14 10
  11. sports 2 7 6 13 6 2803 9 0 1 13 24 5 10 1 5 19 42 4
  12. finance 12 10 13 4 15 6 2692 1 2 21 5 3 18 2 79 47 12 8
  13. discovery 8 2 0 3 3 2 5 1155 1 5 1 1 1 0 13 9 0 1
  14. regimen 0 59 0 0 63 0 2 0 1093 0 3 3 4 2 0 1 5 0
  15. travel 9 19 8 8 23 4 9 8 0 2741 19 20 19 7 13 55 14 12
  16. fashion 2 21 5 9 14 9 1 5 13 18 2772 5 7 1 6 11 77 7
  17. history 49 9 2 3 6 3 3 6 4 28 3 2813 12 20 2 35 21 6
  18. society 27 77 50 7 43 7 42 5 16 78 27 13 2414 29 36 36 58 15
  19. story 3 17 1 3 7 2 2 2 2 7 5 12 19 1120 4 6 14 11
  20. tech 16 8 19 21 6 3 52 13 3 6 5 4 14 0 2787 9 17 7
  21. world 52 33 12 8 9 16 33 24 2 35 27 37 50 8 20 2583 30 4
  22. entertainment 5 14 3 28 6 13 4 3 1 9 120 29 17 3 12 10 2708 8
  23. essay 7 23 5 3 12 1 8 6 4 15 22 11 7 2 5 2 11 1010
  1. #get sub_result and Figure
  2. >>> tc.plot_display(y_test, Y_predict)
  3. Plot display...
  4. Test count: Predict count: Sub Result: Sub_Abs Result:
  5. baby 3049 3295 246 246
  6. car 2973 2949 -24 24
  7. discovery 1210 1246 36 36
  8. entertainment 2993 3104 111 111
  9. essay 1154 1115 -39 39
  10. fashion 2983 3090 107 107
  11. finance 2950 2891 -59 59
  12. food 3019 3058 39 39
  13. game 2992 2978 -14 14
  14. history 3025 2996 -29 29
  15. military 3000 3039 39 39
  16. regimen 1235 1221 -14 14
  17. society 2980 2673 -307 307
  18. sports 2970 2891 -79 79
  19. story 1237 1210 -27 27
  20. tech 2990 3031 41 41
  21. travel 2988 3056 68 68
  22. world 2983 2888 -95 95

image

Performance

  • Train set: 156k news headline with 18 labels
  • Test set: 36k news headline with 18 labels
  • Compare with svm , naive-bayes , SGD(loss = ‘perceptron’) of Scikit-learn
Classifier Accuracy Time cost(s)
scikit-learn(svm) 71.6% 241
scikit-learn(nb) 72.7% 12
scikit-learn(SGD) 72.4% 197
TextClassifier 76.8% 8

Installation

  1. $ pip install TextClassifier