项目作者: jaaack-wang

项目描述 :
A corpus-linguistic tool to extract and search for linguistic features
高级语言: Python
项目地址: git://github.com/jaaack-wang/ling_feature_extractor.git
创建时间: 2020-11-10T01:44:38Z
项目社区:https://github.com/jaaack-wang/ling_feature_extractor

开源协议:MIT License

下载


My MA thesis, titled “A Macroscopic Re-examination of Language and Gender: A Corpus-Based Case Study in University Instructor Discourses”, that utilizes this program.

ling_feature_extractor

Description

  • A corpus-linguistic tool to extract and search for linguistic features in a text or a corpus.
  • There are 95 built-in linguistic features in the main version versus 98 features in the Thesis_Project version. Deleted features are words per utterance, number of utterances, and number of overlaps, which are not deemed as generally accessible in a normal corpus.
  • Over 2/3 of these features come from Biber et al.(2006) with 42 features also present in Biber(1988). These features are generally known as part of the Multi-Dimensional (MD) analysis framework.
  • The program is mainly tested on two online accessible corpora, namely British Academic Spoken Corpus and Michigan Corpus of Academic Englush, but due to copyright concerns, here it is tested on the test_sample.

Prerequisites

  • Computer Langauges:
    • Python 3.6+: check with cmd: python --version or python3 --version (Download Page);
    • Java 1.8+: check with cmd: ‘java —version’ (Download Page).
  • Python packages
Package Description Pip download
stanfordcorenlp A Python wrapper for StanforeCoreNLP pip/pip3 install stanfordcorenlp
pandas Used for storing extracted feature frequencies pip/pip3 install pandas

Besides, built-in packages are heavily employed in the program, especially the built-in re package for Regular Expression.

Installation

  • Directly download from this page and cd to the project folder.
  • By pip: pip/pip3 install LFExtractor

Usage

path to StanfordCoreNLP

Please specify the directory to StanfordCoreNLP in the text_processor.py under LFE folder when first using the program.

  • [X] nlp = StanfordCoreNLP("/path/to/StanfordCoreNLP/")

Example: nlp = StanfordCoreNLP(“/Users/wzx/p_package/stanford-corenlp-4.1.0”)

Dealing with a corpus of files

  1. from LFE.extractor import CorpusLFE
  2. lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
  3. # get frequency data and tagged corpus and extracted features by default
  4. lfe.corpus_feature_fre_extraction() lfe.corpus_feature_fre_extraction() # lfe.corpus_feature_fre_extraction(normalized_rate=100, save_tagged_corpus=True, save_extracted_features=True, left=0, right=0).
  5. # change the normalized_rate, trun off tagged text and leave extracted text with specified context to display
  6. lfe.corpus_feature_fre_extraction(1000, False, True, 2, 3) # extract frequency data only, and the data are normalized at 1000 words.
  7. # get frequency data only
  8. lfe.corpus_feature_fre_extraction(save_tagged_corpus=False, save_extracted_features=False)
  9. # get tagged corpus only
  10. lfe.save_tagged_corpus()
  11. # get extracted feature only
  12. lfe.save_corpus_extracted_features() # lfe.save_corpus_extracted_features(left=0, right=0)
  13. # set how many words to display besides the target pattern
  14. lfe.save_corpus_extracted_features(2, 3)
  15. # extract and save specific linguistic feature by feature name
  16. # to see the built-in features' names, use `show_feature_names()`
  17. from LFE.extractor import *
  18. print(show_feature_names()) # Six letter words and longer, Contraction, Agentless passive, By passive...
  19. # specify which feature to extract and save
  20. lfe.save_corpus_one_extracted_feature_by_name('Six letter words and longer')
  21. # extract and save specific linguistic feature by feature regex, for example, 'you know'
  22. lfe.save_corpus_one_extracted_feature_by_regex(r'you_\S+ know_\S+', 2, 2, feature_name='You Know') # Extract phrase 'you know' along with 2 words spanning around. Also remember the '_\S+' at the end of each word since the corpus will be automatically POS tagged.
  23. # for more complex structure, the features_set.py can be ultilized, for example, to extract "article + adj + noun" structure
  24. from LFE import features_set as fs
  25. ART = fs.ART
  26. ADJ = fs.ADJ
  27. NOUN = fs.NOUN
  28. lfe.save_corpus_one_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
  29. # result example (use test_sample): away_RB by_IN 【 the_DT whole_JJ thing_NN 】 In_IN fact_NN

Dealing with a text

  1. from LFE import extractor as ex
  2. # check the functionalities contained in ex by dir(ex)
  3. # show built-in feature names
  4. print(ex.show_feature_names()) # Six letter words and longer, Contraction, Agentless passive, By passive...
  5. # get built-in features' regex by its name
  6. print(ex.get_feature_regex_by_name('Contraction')) # (n't| '\S\S?)_[^P]\S+
  7. # get built-in features' names by regex
  8. print(ex.get_feature_name_by_regex(r"(n't| '\S\S?)_[^P]\S+")) # Contraction
  9. # text processing
  10. # tagged file
  11. ex.save_single_tagged_text('/path/to/the/file')
  12. # cleaned file
  13. ex.save_single_cleaned_text('/path/to/the/file')
  14. # display extracted feature by name
  15. res = ex.display_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)
  16. print(res) # 's_VBZ, n't_NEG, 've_VBP...
  17. # save the result
  18. ex.save_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)
  19. # display extracted feature by regex, for example, noun phrase
  20. from LFE import features_set as fs
  21. ART = fs.ART
  22. ADJ = fs.ADJ
  23. NOUN = fs.NOUN
  24. res = ex.display_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
  25. print(res) # One_CD is_VBZ 【 the_DT extraordinary_JJ evidence_NN 】 of_IN human_JJ
  26. # save the result
  27. ex.save_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
  28. # get the frequency data of all the linguistic features for a file
  29. res = ex.get_single_file_feature_fre(file_path, normalized_rate=100, save_tagged_file=True, save_extracted_features=True, left=0, right=0)
  30. print(res)

Dealing with a part of a corpus

  1. from LFE.extractor import *
  2. lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
  3. # get_filepath_list and select the files you want to examine and construct a list
  4. fp_list = lfe.get_filepath_list()
  5. # loop through the list and use the functionalities mentioned above to get the results you want