文本分类器 Text classifier

Text Classifier based on Numpy,Scikit-learn,Pandas,Matplotlib

Train Data Format

type	Text
game	The LoL champions pro players would ban forever
society	In Beijing you should keep the rules
etc.	etc.

Sample Usage

>>> import TextClassifier
    # cerat classifier container
>>> tc = TextClassifier.classifier_container()
    # load data
    # '../data/Train_data.txt' is data path 
    # sep Default = ',' you can change it to '\t',etc.  
>>> tc.load_Data('../data/Train_data.txt',sep=',')
    # train the model
>>> tc.train()
    # prediction. Input list or text-String
>>> print tc.predict('Faker is the first League of Legends player to earn over $1 million in prize money')
    [u'game']
>>> print tc.predict(['Faker is the first League of Legends player to earn over $1 million in prize money',
                    '18-year-old youth killed 88-year-old veteran',
                    'Take you into the real North Korea'])
    [u'game',u'society',u'world']
    #get X_train, X_test, y_train, y_test
>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(original_data['Text'], original_data['Categorization'], test_size=0.3, random_state=0)
    #get TrainData Accuracy
>>> tc.Accuracy(X_train, y_train)
    Accuracy:
    0.917504310503

    #get Confusion Matrix
>>> Y_predict = tc.predict(X_test)
>>> tc.confusion_matrix(y_test, Y_predict)
    Confusion Matrix :
               military  baby   car  game  food  sports  finance  discovery  regimen  travel  fashion  history  society  story  tech  world  entertainment  essay
military           2831     5     3    16     9       4        8         10        0      15        8       24        9      3     6     42              6      1
baby                  0  2932     3     3    26       0        1          0       10       7       10        3       16      4     3      7             20      4
car                   6    10  2813     3     6       8       13          3        1      13       10        3       39      1    11      5             24      4
game                 10    11     6  2843     5       9        2          4        1      11       13        3        8      4    25      3             31      3
food                  0    38     0     3  2799       1        5          1       67      34       16        7        9      3     4      8             14     10
sports                2     7     6    13     6    2803        9          0        1      13       24        5       10      1     5     19             42      4
finance              12    10    13     4    15       6     2692          1        2      21        5        3       18      2    79     47             12      8
discovery             8     2     0     3     3       2        5       1155        1       5        1        1        1      0    13      9              0      1
regimen               0    59     0     0    63       0        2          0     1093       0        3        3        4      2     0      1              5      0
travel                9    19     8     8    23       4        9          8        0    2741       19       20       19      7    13     55             14     12
fashion               2    21     5     9    14       9        1          5       13      18     2772        5        7      1     6     11             77      7
history              49     9     2     3     6       3        3          6        4      28        3     2813       12     20     2     35             21      6
society              27    77    50     7    43       7       42          5       16      78       27       13     2414     29    36     36             58     15
story                 3    17     1     3     7       2        2          2        2       7        5       12       19   1120     4      6             14     11
tech                 16     8    19    21     6       3       52         13        3       6        5        4       14      0  2787      9             17      7
world                52    33    12     8     9      16       33         24        2      35       27       37       50      8    20   2583             30      4
entertainment         5    14     3    28     6      13        4          3        1       9      120       29       17      3    12     10           2708      8
essay                 7    23     5     3    12       1        8          6        4      15       22       11        7      2     5      2             11   1010

    #get sub_result and Figure
>>> tc.plot_display(y_test, Y_predict)
    Plot display...
               Test count:  Predict count:  Sub Result:  Sub_Abs Result:
baby                  3049            3295          246              246
car                   2973            2949          -24               24
discovery             1210            1246           36               36
entertainment         2993            3104          111              111
essay                 1154            1115          -39               39
fashion               2983            3090          107              107
finance               2950            2891          -59               59
food                  3019            3058           39               39
game                  2992            2978          -14               14
history               3025            2996          -29               29
military              3000            3039           39               39
regimen               1235            1221          -14               14
society               2980            2673         -307              307
sports                2970            2891          -79               79
story                 1237            1210          -27               27
tech                  2990            3031           41               41
travel                2988            3056           68               68
world                 2983            2888          -95               95

Performance

Train set: 156k news headline with 18 labels
Test set: 36k news headline with 18 labels
Compare with svm , naive-bayes , SGD(loss = ‘perceptron’) of Scikit-learn

Classifier	Accuracy	Time cost(s)
scikit-learn(svm)	71.6%	241
scikit-learn(nb)	72.7%	12
scikit-learn(SGD)	72.4%	197
TextClassifier	76.8%	8

Installation

$ pip install TextClassifier