Sentiment Classification of Hotel Reviews Posted on Yelp

The dataset contains 1,000 reviews of hotels downloaded from yelp. The sentiment labels used for this project include:

1: The review indicates a positive opinion toward the hotel

0: The review indicates a mixed or neutral opinion toward the hotel

-1: The review indicates a negative opinion toward the hotel

Setting up the Classifier

After reading in the data and partitioning testing and training data (90% Train, 10% Test), I convert the raw text into feature vectors. Throughout this project I am using scikit-learn. The code to the right defines the feature extraction function features. The function makes the text lowercase and removes any consecutive characters that are repeated (i.e woooow and wooooow will map to "woow"). Non-alphanumeric characters are replaced with whitespace, and the strings separated by whitespace are treated as tokens.

Classifier and Hyperparameter Tuning

I extracted features from all training instances and converted them into a feature vector to be used by the library. I then used sklearn's LogisticRegression class to run a multinomial regression. I set parameters for C values to compare (0.01, 0.1, 1.0) and then performed 3-fold cross validation with the classifier and parameter settings.

Experimenting with N-gram size

I decided to experiment with the n-gram sizes to find the best validation accuracy. I used six different ranges of n-gram sizes: (1,1),(2,2),(3,3),(1,2),(1,3), and (2,3). For each range, I calculate the cross-validation accuracy using GridSearchCV to find the best C value.

Output:

Feature Selection

I want to improve the classifiers efficiency and prevent overfitting, so now lets experiment with different levels of feature selection to find the best one. I create a SelectPercentile object which performs a chi-squared test for measuring significance. If percentile = 1, only the top 1% of features are selected. if percentile = 100, all features are selected. Using the best n-grams from above, I then use fit and transform functions to modify the vectors to choose only the selected features. I used percentile values [1,2,5,10,20,30,40,50,60,70,80,90,100].

Output:

Feature Engineering

To add new features, I created a feature extraction function. I used the features function from above and added to it with additional code that appends new features to the array. There are three feature types:

Skip-Grams: word tokens where only the first and last words are specified, and any word in between is replaced with a placeholder (*).
Word Pairs: encoding combinations of words indicating if two words are both present in a text. For example "The water is cold" contains 6 word pairs ((the, water),(the, is),(the, cold), etc.)
Sentiment Dictionary: using a sentiment lexicon created by Bing Liu which contains thousands of words labeled with POS and NEG.

To the right is the set-up code for my experiments with actual text strings.

Output:

Application

Lets test our classifier with some example strings! I then iterate over the strings to print the extracted features.

Positive Sentiment Output:

Negative Sentiment Output: