项目作者: priyankayawalkar

项目描述 :
Natural LAnguage Processing
高级语言: Jupyter Notebook
项目地址: git://github.com/priyankayawalkar/Natural-Language-Processing.git


Natural-Language-Processing

Natural Language Processing is a subfield of Artificial Intelligence that deals with the interaction between Humans and computers using common natural language. Using this branch, computer can understand, analyze and derives meaning from human language and understands how
human communication text place. Natural Language Processing have hierarchical structure of language such as characters, words, sentence, text etc. Natural Language Processing helps real-world applications like Text Summarization, Named Entity Recognition, Text Translation, Relationship Between Words, Sentiment Analysis, Topic Segmentation, Speech Recognition etc.

Platform

Jupyter Notebook is a open source web based environment for creating documents. You can run your python scripts on this environment. You can install Jupyter Notebook using following command: pip install notebook

Tools used in Natural Language Processing

You can use open source NLP libraries listed below:

NLTK : NLTK is a Natural LAnguage Toolkit is a collection of programs and libraries for Statistical NLP for English Language

spaCy : spaCy is a open source library for NLP that contains advanced features. spacy tools supports 16 different languages.

iNLTK : iNLTK is a library specially supports indian langauges. iNLTK aims to provide out of the box support for various NLP tasks

Contents : Natural Language Processing Hands-On

Part : 1 Introduction to NLTK

  1. 1 Download NLTK
  2. 2 IMPORT BROWN CORPUS AND ACCESSING DATA
  3. 3 IMPORT INAUGRAL CORPUS AND ACCESS DATA
  4. 4 IMPORTING WEBTEXT CORPUS AND ACCESS DATA
  5. 5 FREQUENCY DISTRIBUTION OF WORDS IN A TEXT
  6. 6 CONDITIONAL FREQUENCY DISTRIBUTION OF WORDS IN A TEXT

Practice

  1. 1 IMPORT INAUGURAL CORPUS AND ACCESSING DATA
  2. 2 READ CONTENT OF THE TEXT FILE
  3. 3 READ WORDS OF THE TEXT FILE
  4. 4 FREQUENCY DISTRIBUTION OF WORDS
  5. 5 Conditional Frequency Distribution
  6. 6 Conditional Frequency Distribution for 4,5 6- letter words
  7. 7 Words in Ascending Order
  8. 8 Words in Descending Order

Additional

  1. 1 Various Predefined Text Access
  2. 2 Import Reuters Corpus and Accessing Data
  3. 3 Read the Content of Text
  4. 4 Words That Occur Together
  5. 5 Find A Specific Word
  6. 6 Words in Specific Fileid
  7. 7 Total Sentence, words in text
  8. 8 Frequency of Words Matches With List
  9. 9 Sentence startswith() Specified Word
  10. 10 Sort the Words
  11. 11 Reverse Sentence
  12. 12 Frequency of Each Word
  13. 13 Length of Longest Sentence
  14. 14 Length of Smallest Sentence
  15. 15 Part of speech

Part 2 : STEMMING OF WORDS

  1. 1 PorterStemmer
  2. 2 SnowballStemmer
  3. 3 Lemmatizer
  4. 4 RegexpStemmer
  5. 5 LancasterStemmer

Part 3 : Wordnet, CMU Pronouncing Dictionary and Stopwords

  1. 1 WordNet
  2. 2 CMU Pronounciation Dictionary
  3. 3 StopWords

Part 4 : Text Classification using Naive Bayes Classifier

  1. 1 Import and Access names corpus
  2. 2 Import random Library
  3. 3 Create Feature set
  4. 4 Split the Data into Training and Testing Set
  5. 5 Apply Naive Bayes Classifier
  6. 5 Classify names using classifier
  7. 6 Accuracy of Test set

Part 5 : Vectorisers & Cosine Similarity

  1. 1 Import CountVectorizer
  2. 2 Define Corpus
  3. 3 Create Vocabulary
  4. 4 Transform into Vector
  5. 5 Cosine Similarity

Part 6 : Tasks for Marathi Language

  1. 1 Import and Access Marathi Language
  2. 2 Words From Speciefied File
  3. 3 Print Content of the File
  4. 4 Sentence startswith() Specified Word
  5. 5 Tokenization
  6. 6 Read Words
  7. 7 Total Tokens
  8. 8 Frequency of All Words
  9. 9 Frequency of Most Common Words
  10. 10 Part-of-Speech tagging
  11. 11 Stemmer
  12. 1 RegexpStemmer
  13. 12 Word Embedding
  14. 13 Wheather a character is vowel or consonant?
  15. 14 Cosine Similarity

Part 7 : Text Pipeline Processing

  1. 1 Import and Access Corpus
  2. 2 Corpus - product_reviews_2, Access fileids
  3. 3 Print part of text
  4. 4 Tokenization
  5. o 1 Sentence Tokenizer
  6. o 2 Word Tokenizer
  7. o 3 Total tokens in selected text
  8. o 4 All Tokens in Sorted Order
  9. o 5 Frequency Distinct in The Text
  10. 5 Stemmer
  11. o 1 Porter stemmer
  12. o 2 Snowballstemmer
  13. o 3 Lemmatizer
  14. 6 Part of Speech Tagging

Part 8 : Functionality using NLP Tool:Spacy Tool

  1. 1 Import and Load Model for english Language
  2. 2 Preprocessing Step : Tokenization and stopwords
  3. 3 Part-of-Speech tags
  4. 4 Dependency Parsing
  5. 5 Named Entity Recognition
  6. 6 Conclusion
  7. 7 References

Part 9 : Web Scrapping for News Article

  1. 1 Extracting text from url for news article
  2. 2 Preprocessing and cleaning the text
  3. 3 Tokenization
  4. 4 POS Tagging
  5. 5 Named Entity Recognition

Part 10 : Word Embedding and Chunking

  1. 1 One-hot encoding (CountVectorizing)
  2. 2 TF-IDF transforming
  3. 3 Chunking

Part 11 : Sentiment Analysis using Logistic Regression

  1. 1 Loading the dataset
  2. 2 Transforming Docs into Feature Vectors
  3. 3 Term Frequency & Inverse DOC frequency
  4. 4 Doc classification using Logistic Regression
  5. 5 Model Evaluation

Part 12 : Bigram Model

  1. 1 Tokenization
  2. 2 Remove Stopwords
  3. 3 bigram_collocation