# Respiratory Disease Classification

Authors: Zice Wei, Ben Shealy

In this notebook we demonstrate how to classify respiratory diseases using the [Respiratory Sound Database](https://www.kaggle.com/vbookshelf/respiratory-sound-database) from Kaggle. This dataset contains audio samples of people coughing, and each sample is annotated with the individual's respiratory disease and other metadata such as age, gender, height, and weight.

To perform the classification, we will create a hybrid neural network model which has a CNN branch to process the audio data and an MLP branch to process the metadata. The features from the two branches are concatenated and followed by some dense layers to form the full network.

## Getting Started

In [None]:
import IPython
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow import keras
from tensorflow.keras.layers import concatenate, Conv2D, Dense, Dropout, Flatten, GlobalAveragePooling2D, Input, MaxPooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

## Load Metadata

The disease labels and metadata are in two separate files so we will load them and combine them into one dataframe. Additionally, many of the samples are missing height or weight or BMI, and since BMI is computed from height and weight we will simply compute BMI for each sample and then discard height and weight.

In [None]:
# read labels and metadata
df_labels = pd.read_csv(
    'Respiratory_Sound_Database/Respiratory_Sound_Database/patient_diagnosis.csv',
    names=['Patient No', 'Disease']
)
df_metadata = pd.read_csv(
    'demographic_info.txt',
    sep=' ',
    names=['Patient No', 'Age', 'Gender', 'BMI for Adults', 'Weight (Children)', 'Height (Children)']
)

# append labels to metadata
df_metadata['Disease'] = df_labels['Disease']

# compute BMI for all samples
df_metadata['BMI for Children'] = (df_metadata['Weight (Children)'] / (df_metadata['Height (Children)'] ** 2)) * 10000
df_metadata['BMI'] = df_metadata['BMI for Adults'].combine_first(df_metadata['BMI for Children'])

# remove unused columns
df_metadata.drop(['Weight (Children)', 'Height (Children)', 'BMI for Adults', 'BMI for Children'], axis=1, inplace=True)

In [None]:
df_metadata

## Load Audio Metadata

For now we will simply get all of the audio filenames and append them to the metadata.

In [None]:
# get parent directory of audio files
audio_path = 'Respiratory_Sound_Database/Respiratory_Sound_Database/audio_and_txt_files/'

# get filenames of audio (wav) files
filenames = [f for f in os.listdir(audio_path) if (os.path.isfile(os.path.join(audio_path, f)) and f.endswith('.wav'))]

# extract patient number from each filename
patient_ids = [int(f.split('_')[0]) for f in filenames]

# create dataframe of audio metadata
df_audio = pd.DataFrame({
    'Patient No': patient_ids,
    'filename': filenames
})

In [None]:
df_audio

In [None]:
df_metadata.set_index('Patient No', inplace=True)
df_audio.set_index('Patient No', inplace=True)

df_metadata = df_metadata.join(df_audio, on='Patient No')

In [None]:
df_metadata

## Remove Missing Samples

Even after filling in the missing BMI values, after appending the audio filenames we see that there are still some samples that are missing several metadata fields.

In [None]:
print(df_metadata.isnull().sum())

Some of these samples have an audio file but no metadata, so we must simply discard those.

In [None]:
df_metadata.dropna(thresh=3, inplace=True)

In [None]:
print(df_metadata.isnull().sum())

Now the only remaining problem is samples without a BMI value. To handle these cases we will attempt to interpolate the BMI value from similar samples. If we cannot find enough similar samples then we will simply discard the incomplete sample.

In [None]:
indices = df_metadata[df_metadata['BMI'].isnull()].index

for index in indices:
    row = df_metadata.iloc[index]

    # attempt to find similar samples by gender, disease, and age
    similar_samples = df_metadata[
        (df_metadata['Gender'] == row['Gender'])
        & (df_metadata['Disease'] == row['Disease'])
        & (row['Age'] - 5 <= df_metadata['Age'])
        & (df_metadata['Age'] <= row['Age'] + 5)
        & ~df_metadata['BMI'].isnull()
    ]

    # estimate missing BMI value if at least 3 similar samples are found
    if len(similar_samples.index) >= 3:
        df_metadata.at[index, 'BMI'] = similar_samples['BMI'].mean()

    # otherwise discard the sample
    else:
        df_metadata.drop(index, inplace=True)

In [None]:
df_metadata

## Visualize Metadata

We have loaded the metadata and filtered out samples with missing values, which means we now have the samples that we will use in our classification models. Now let's take a moment to visualize some properties of our dataset.

In [None]:
sns.distplot(df_metadata['Age'])
plt.show()

In [None]:
sns.countplot(x='Gender', data=df_metadata)
plt.show()

In [None]:
sns.distplot(df_metadata['BMI'])
plt.show()

In [None]:
sns.countplot(y='Disease', data=df_metadata)
plt.show()

## Load Audio Data

Now let's take a break from the metadata for a moment and load the audio data. An audio sample is a time series, but we will use the MFCC of each audio sample instead of the raw audio. The MFCC is essentially a frequency spectrum that can be viewed as an image; each column in the image is the spectrum for a single time point and each row represents a particular frequency range. Since the MFCC for an audio sample is like an image, ultimately we will use a CNN to learn the MFCC data.

In [None]:
max_pad_width = 862

X_mfcc = []
filenames = [os.path.join(audio_path, f) for f in df_metadata['filename']]

for filename in filenames:
    try:
        audio, sample_rate = librosa.load(filename, res_type='kaiser_fast', duration=20) 
        mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_width - mfcc.shape[1]
        mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')

    except Exception as e:
        print("Error encountered while parsing file: ", filename)
        mfcc = np.nan

    X_mfcc.append(mfcc)

X_mfcc = np.array(X_mfcc).reshape(X_mfcc.shape[0], X_mfcc.shape[1], X_mfcc.shape[2], 1)

## Visualize Audio Data

We'll write a function to plot the audio signal and corresponding MFCC spectrum of a given sample so that we can see what the MFCC looks like.

In [None]:
def plot_wav_mfcc(index):
    row = df_metadata.iloc[index]
    filename = filenames[index]
    mfcc = X_mfcc[index]
    
    plt.figure(figsize=(12, 6))
    plt.subplot(2, 1, 1)
    y, sr = librosa.load(filename, duration=20)
    librosa.display.waveplot(y, sr=sr)
    plt.title('Patient No = %s, Disease = %s' % (row.name, row['Disease']))
    
    plt.subplot(2, 1, 2)
    librosa.display.specshow(mfcc, x_axis='time')
    plt.colorbar()
    plt.title('MFCC')

    plt.tight_layout()
    plt.show()

In [None]:
plot_wav_mfcc(100)

## Prepare Data for Training

Now there are a few more preprocessing steps that need to be done before we can starting training our models with our dataset. The categorical features (gender, disease) need to be converted into numerical codes. The numerical features (age, BMI) need to be normalized to have roughly the same scale, but we need to split the dataset into train/test sets first so that we normalize the data based on the training set alone.

In [None]:
# extract input features from metadata
X_meta = df_metadata[["Age", "Gender", "BMI"]]

In [None]:
# convert gender to categorical feature
X_meta['Gender'] = X_meta['Gender'].map({'F': 0, 'M': 1})

In [None]:
# convert disease label to a one-hot encoding
y, class_names = pd.factorize(df_metadata['Disease'])
y = keras.utils.to_categorical(y) 

In [None]:
# create train/test sets for both metadata and mfcc data
X_meta_train, X_meta_test, X_mfcc_train, X_mfcc_test, y_train, y_test = train_test_split(X_meta, X_mfcc, y, test_size=0.25)

In [None]:
# normalize the numerical features to have the same scale
columns = ['Age', 'BMI']

scaler = MinMaxScaler()
scaler.fit(X_meta_train[columns])
X_meta_train.loc[:, columns] = scaler.transform(X_meta_train[columns])
X_meta_test.loc[:, columns] = scaler.transform(X_meta_test[columns])

In [None]:
X_meta_train

In [None]:
y_train

## Create Hybrid Neural Network

In [None]:
# create cnn branch (for mfcc data)
n_rows = X_mfcc.shape[1]
n_cols = X_mfcc.shape[2]
n_channels = X_mfcc.shape[3]
kernel_size = (2, 2)

cnn_inputs = Input(shape=(n_rows, n_cols, n_channels))

cnn_branch = Conv2D(filters=16, kernel_size=kernel_size, activation='relu')(cnn_inputs)
cnn_branch = MaxPooling2D(pool_size=2)(cnn_branch)
cnn_branch = Dropout(0.2)(cnn_branch)

cnn_branch = Conv2D(filters=64, kernel_size=kernel_size, activation='relu')(cnn_branch)
cnn_branch = MaxPooling2D(pool_size=2)(cnn_branch)
cnn_branch = Dropout(0.2)(cnn_branch)

cnn_branch = Flatten()(cnn_branch)

# enable this code to create a stand-alone cnn model
# n_classes = len(class_names)
# cnn = GlobalAveragePooling2D()(cnn_branch)
# cnn_outputs = Dense(n_classes, activation='softmax')(cnn)
# cnn = Model(inputs=cnn_inputs, outputs=cnn_outputs)

In [None]:
# create mlp branch (for metadata)
n_classes = len(class_names)

mlp_inputs = Input(shape=(3,))
mlp_branch = Dense(units=64, activation="relu")(mlp_inputs)
mlp_branch = Dense(units=64, activation="relu")(mlp_branch)
mlp_branch = Dense(units=64, activation="relu")(mlp_branch)

# enable this code to create a stand-alone mlp model
# n_classes = len(class_names)
# mlp_outputs = Dense(units=num_label, activation="softmax")(mlp_branch)
# mlp = Model(inputs=mlp_inputs, outputs=mlp_outputs)

In [None]:
# create the hybrid neural network model
n_classes = len(class_names)

hnn = concatenate([cnn_branch, mlp_branch])
hnn = Flatten()(hnn)
hnn_outputs = Dense(units=n_classes, activation='sigmoid')(hnn)
hnn = Model(inputs=[cnn_inputs, mlp_inputs], outputs=hnn_outputs)

In [None]:
hnn.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [None]:
hnn.summary()

In [None]:
# create an image of the hybrid model
keras.utils.plot_model(hnn, to_file='hybrid_model.png')

IPython.display.Image('hybrid_model.png')

In [None]:
# train the model
history = hnn.fit(
    [X_mfcc_train, X_meta_train],
    y_train,
    batch_size=8,
    epochs=100,
    validation_split=0.1,
    verbose=1
)

In [None]:
# plot the training accuracy
plt.plot(history.history["acc"])
plt.plot(history.history["val_acc"])
plt.title("Training Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(["Training", "Validation"], loc="upper left")
plt.show()
    
# plot the training loss
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("Training Loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.legend(["Training", "Validation"], loc="upper left")
plt.show()

In [None]:
# evaluate the model
hnn.evaluate([X_mfcc_test, X_meta_test], y_test)