# Effect of Normalization on Classification Potential

Authors: Ben Shealy, Cole Younginer

In this notebook, we'll explore how the __normalization__ of your dataset affects __classification potential__ (the ability of a classifier to distinguish between the different classes in the data). For background information on the data we'll be working with, refer to the Tumor Classification notebook.

## Getting Started

In this notebook we're going to work with RNA expression data derived from kidney tumor samples. This data is taken 
from __The Cancer Genome Atles (TCGA)__ project, which contains RNA sequences for a wide array of cancers. Our dataset contains samples from five types of cancer. This dataset is available on our [Box folder](https://clemson.app.box.com/folder/11145145746).

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.metrics
import sklearn.model_selection
import sklearn.neural_network
import sklearn.pipeline
import sklearn.preprocessing

## Loading the Data

The dataset consists of two files:

- `tcga-5.fpkm.txt`: expression matrix with genes as rows and samples as columns
- `tcga-5.labels.txt`: label file containing tumor type label for each sample

We can load each of these files easily as pandas dataframes:

In [None]:
# load dataframe
X = pd.read_csv("tcga-5.fpkm.txt", index_col=0, sep="\t")
y = pd.read_csv("tcga-5.labels.txt", sep="\t", header=None)

# transpose data, fill missing values
X = X.T
X = X.fillna(X.min().min())

# select a subset of genes
n_genes = 1000

genes = np.random.choice(len(X.columns), n_genes, replace=False)
X = X.iloc[:, genes]

# convert labels to numerical encoding
le = sklearn.preprocessing.LabelEncoder()
y = le.fit_transform(y)

classes = le.classes_

print(X.shape, y.shape)

In [None]:
X

In [None]:
y

## Selecting a Classifier

An important aspect of this experiment is the classifier that we use. We could experiment with a variety of classifiers, but since our focus here is the effect of normalization, we'll stick to one classifier for now. Let's use a basic three-layer neural network.

In [None]:
def evaluate(clf, X, y, classes):
    # perform train/test split
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3)

    # train classifer
    clf.fit(X_train, y_train)

    # compute predicted labels for test set
    y_pred = clf.predict(X_test)

    # compute accuracy score
    score = sklearn.metrics.accuracy_score(y_test, y_pred)

    # create a confusion matrix from the class predictions
    cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cnf_matrix, annot=True, fmt="d", cbar=False, square=True, xticklabels=classes, yticklabels=classes)
    plt.ylabel("Expected")
    plt.xlabel("Measured")
    plt.title("Confusion Matrix")
    plt.show()
    
    return score

## Normalization Methods

In [None]:
scalers = [
    ("passthrough", None),
    ("log2", sklearn.preprocessing.FunctionTransformer(func=np.log2)),
    ("maxabs", sklearn.preprocessing.MaxAbsScaler()),
    ("minmax", sklearn.preprocessing.MinMaxScaler()),
    ("quantile", sklearn.preprocessing.QuantileTransformer(output_distribution="normal")),
    ("robust", sklearn.preprocessing.RobustScaler()),
    ("standard", sklearn.preprocessing.StandardScaler())
]

classifiers = [
    ("mlp", sklearn.neural_network.MLPClassifier(solver="adam", alpha=1e-4, hidden_layer_sizes=(256)))
]

for scaler_name, scaler in scalers:
    for clf_name, clf in classifiers:
        pipeline = sklearn.pipeline.Pipeline([
            (scaler_name, scaler),
            (clf_name, clf)
        ])

        score = evaluate(clf, X, y, classes)
        print("%s, %s: %0.3f" % (scaler_name, clf_name, score))