# Sentiment Analysis for Movie Reviews

Authors: Anthony Rizzo, Ben Shealy

In this notebook, we will try to predict sentiment in online movie reviews using __natural language processing (NLP)__. We will use a dataset of Rotten Tomatoes reviews that was created via web-scraping by a Reddit user. The dataset can be obtained manually here:

https://www.reddit.com/r/MachineLearning/comments/b5idqk/p_dataset_480000_rotten_tomatoes_reviews_for_nlp/

You will need to install these additional packages in your conda environment:
```
conda install -y nltk tqdm
```

This project is a work in progress, so anyone is welcome to pick up this project and attempt to improve the results!

In [None]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import re
import sklearn
import sklearn.linear_model
import sklearn.svm
import tqdm

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

## Load the Data

In [None]:
# loading the Rotten Tomatoes dataset as a pandas DataFrame
movies_df = pd.read_csv("rotten_tomatoes_reviews.csv")

In [None]:
# make sure there are no nulls in the data
movies_df = movies_df[~movies_df.Freshness.isnull() & ~movies_df.Review.isnull()]

In [None]:
# show a preview of the DataFrame
movies_df.head()

In [None]:
# get a full count of words in the dataset
n_words = 0
for review in movies_df.Review:
    n_words += len(review.split())

print("Number of words: %d" % n_words)
print("Number of unique words: %d" % len(np.unique(np.hstack(movies_df.Review))))
print("Number of reviews: %d" % len(movies_df))
print("Average number of words per review: %d" % (n_words // len(movies_df)))

In [None]:
# use X and y notation for data and labels
X = movies_df.Review
y = movies_df.Freshness

## Clean the Data

The way you prepare your data can have a huge effect on the performance of your machine learning models. We've already removed missing values from the dataset, which is a basic requirement for most machine learning tasks. Since we are working with text data, another basic step is to remove punctuation and convert all text to lower-case.

In [None]:
# use regular expressions to remove punctuation and convert to lower-case
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def clean_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    return reviews

X_cleaned = clean_reviews(X)

In [None]:
# before cleaning
X[1]

In [None]:
# after cleaning
X_cleaned[1]

In [None]:
print("Number of unique words after cleaning: %d" % len(np.unique(np.hstack(X_cleaned))))

## Initial Evaluation

Now that we've cleaned the reviews, we can evaluate a basic model just to see what accuracy we can achieve without any further processing.

In [None]:
# convert reviews to a matrix of token counts
count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True)
X_cv = count_vectorizer.fit_transform(X_cleaned)

In [None]:
# split dataset into train and test sets
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_cv, y, test_size=0.25)

# test initial feasibility of classification with logistic regression
# also vary the regularization strength (C)
best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    # train model
    clf = sklearn.linear_model.LogisticRegression(C=c, solver="lbfgs", n_jobs=-1)
    clf.fit(X_train, y_train)

    # evaluate model
    score = clf.score(X_test, y_test)

    # save best model
    if best_score < score:
        best_clf = clf
        best_score = score
    
    # print model score
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

In [None]:
# identify the most informative faatures in the best model
def show_most_informative_features(vectorizer, clf, n=10):
    feature_names = vectorizer.get_feature_names()
    features = sorted(zip(clf.coef_[0], feature_names))
    top = zip(features[:n], features[:-(n + 1):-1])

    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("  %.4f %-15s %.4f %-15s" % (coef_1, fn_1, coef_2, fn_2))

print("Most Informative Features (initial):\n")
show_most_informative_features(count_vectorizer, best_clf)

In [None]:
# tokenize the cleaned reviews
freq = nltk.FreqDist()

for review in tqdm.tqdm(X_cleaned):
    for word in nltk.tokenize.word_tokenize(review):
        freq[word] += 1

In [None]:
# plot frequency distribution of top 20 words
freq.plot(20, cumulative=False)
freq.pprint(20)

## Remove Stop Words

In [None]:
# remove stop words from the dataset
stop_words = set(["the", "a", "and", "of", "to", "is", "in", "its", "it", "that", "but", "as", "with", "this", "for", "an", "on", "be"])

def remove_stopwords(reviews):
    return [" ".join([w for w in review.split() if w not in stop_words]) for review in tqdm.tqdm(reviews)]

X_sw = remove_stopwords(X_cleaned)

print("Before removing stop words:", X_cleaned[1])
print("After removing stop words:", X_sw[1])

In [None]:
# tokenize the reviews with stop words removed
freq = nltk.FreqDist()

for review in tqdm.tqdm(X_sw):
    for word in nltk.tokenize.word_tokenize(review):
        freq[word] += 1

freq.plot(20, cumulative=False)

In [None]:
# convert dataset to token counts again
count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True)
X_sw_cv = count_vectorizer.fit_transform(X_sw)

In [None]:
# evaluate a logistic regression model again
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_sw_cv, y, test_size=0.25)

best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    clf = sklearn.linear_model.LogisticRegression(C=c, solver="lbfgs", n_jobs=-1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    if best_score < score:
        best_clf = clf
        best_score = score
    
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

## Lemmatization

In [None]:
# normalize different word forms into one using lemmatization with nltk
def lemmatize_text(reviews):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [" ".join([lemmatizer.lemmatize(word) for word in review.split()]) for review in tqdm.tqdm(reviews)]

X_sw_lm = lemmatize_text(X_sw)

In [None]:
X_sw_lm[1]

In [None]:
count_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True)
X_sw_lm_cv = count_vectorizer.fit_transform(X_sw_lm)

In [None]:
# evaluate a logistic regression model again
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_sw_lm_cv, y, test_size=0.25)

best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    clf = sklearn.linear_model.LogisticRegression(C=c, solver="lbfgs", n_jobs=-1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    if best_score < score:
        best_clf = clf
        best_score = score
    
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

In [None]:
print("Most Informative Features (stop words, lemmatization):\n")

show_most_informative_features(count_vectorizer, best_clf)

## N-grams

In [None]:
# use n-grams to also count 2-word sequences
ngram_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True, ngram_range=(1,2))
X_sw_lm_cv2 = ngram_vectorizer.fit_transform(X_sw_lm)

In [None]:
# evaluate a logistic regression model again
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_sw_lm_cv2, y, test_size=0.25)

best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    clf = sklearn.linear_model.LogisticRegression(C=c, solver="lbfgs", n_jobs=-1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    if best_score < score:
        best_clf = clf
        best_score = score
    
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

In [None]:
print("Most Informative Features (stop words, lemmatization, 2-grams):\n")
show_most_informative_features(ngram_vectorizer, best_clf)

In [None]:
# now include 2-word and 3-word sequences
ngram_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True, ngram_range=(1,3))
X_sw_lm_cv3 = ngram_vectorizer.fit_transform(X_sw_lm)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_sw_lm_cv3, y, test_size=0.25)

best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    clf = sklearn.linear_model.LogisticRegression(C=c, solver="lbfgs", n_jobs=-1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    if best_score < score:
        best_clf = clf
        best_score = score
    
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

In [None]:
print("Most Informative Features (stop words, lemmatization, 3-grams):\n")
show_most_informative_features(ngram_vectorizer, best_clf)

## Final Model: stop words, lemmatization, n-grams, SVM

In [None]:
# now use SVM instead of logistic regression
ngram_vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=True, ngram_range=(1,3))
X_sw_lm_cv3 = ngram_vectorizer.fit_transform(X_sw_lm)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_sw_lm_cv3, y, test_size=0.25)

best_clf = None
best_score = 0

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    clf = sklearn.svm.LinearSVC(C=c)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    if best_score < score:
        best_clf = clf
        best_score = score
    
    print("Accuracy for C=%0.2f: %0.3f" % (c, score))

In [None]:
print("Most Informative Features (stop words, lemmatization, 3-grams):\n")
show_most_informative_features(ngram_vectorizer, best_clf)