Data Pre-processing for Natural Language Processing
Women’s E-Commerce Clothing Reviews on Kaggle
Link Dataset:
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
Context
Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.
Content
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Acknowledgements
Anonymous but real source
Inspiration
Nicapotato, an owner of dataset Women’s E-Commerce Clothing Reviews that looks forward to coming quality NLP! There are also some great opportunities for feature engineering and multivariate analysis.
Publication
Usage Information
Maintainers
Updates
Expected update frequency (Not specified)
Last updated 2018-02-04
Date created 2018-02-04
Current version Version 1
Problem Framing
Ideal Outcome
Heuristics
Formulation of the problem
Text Cleaning
Turn into lower case text
Apply tokenize to each row
Remove stopwords
Prepocessing for Sentiment Analysis
Applying Model, Variable Creation
Converting 0 to 1 Decimal Score to a Categorical Variable
The number of columns from this datasets.
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
'Recommended IND', 'Positive Feedback Count', 'Division Name',
'Department Name', 'Class Name', 'tokenized', 'Polarity Score',
'Neutral Score', 'Negative Score', 'Positive Score', 'Sentiment',
'tokenized_unlist', 'label'],
dtype='object')
The total row from these datasets is 23486 rows and 19 feature variables.