In this coursework we are going to be working with the Wine dataset. This is a 178 sample dataset that categorises 3 different types of Italian wine using 13 different features. The code below loads the Wine dataset and selects a subset of features for you to work with.
# set matplotlib backend to inline
%matplotlib inline
# import modules
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# load data
wine=datasets.load_wine()
#print(wine.DESCR)
# this dataset has 13 features, we will only choose a subset of these
df_wine = pd.DataFrame(wine.data, columns = wine.feature_names )
selected_features = ['alcohol','flavanoids','color_intensity','ash']
# extract the data as numpy arrays of features, X, and target, y
X = df_wine[selected_features].values
y = wine.target
The first part of tackling any ML problem is visualising the data in order to understand some of the properties of the problem at hand. When there are only a small number of classes and features, it is possible to use scatter plots to visualise interactions between different pairings of features.
The following image shows what such a visualisation might look like on the Iris dataset that you worked on during the Topic exercises.
Your first task is to recreate a similar grid for the Wine dataset, with each off-diagonal subplot showing the interaction between two features, and each of the classes represented as a different colour. The on-diagonal subplots (representing a single feature) should show a distribution (or histogram) for that feature.
You should create a function that, given data X and labels y, plots this grid. The function should be invoked something like this: myplotGrid(X,y,...)
where X is your training data and y are the labels (you may also supply additional optional arguments). You can use an appropriate library to help you create the visualisation. You might want to code it yourself using matplotlib functions scatter and hist - however, this is not strictly necessary here, so try not spend too much time on this.
# Adding the Wine Type to the dataset
# This had to be done since I have used Seaborn for the graph
# Seaborn only uses dataframe for hue and to show each classes...
# ...hence a combined copy of data was made by using target and existing data
target = pd.DataFrame(y, columns=['wine'])
frames = [target, df_wine]
combined = pd.concat(frames, axis=1, join='inner')
combined
wine | alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 0 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
2 | 0 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
3 | 0 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
4 | 0 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 2 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 |
174 | 2 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 |
175 | 2 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 |
176 | 2 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 |
177 | 2 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 |
178 rows × 14 columns
# define plotting function
# Using seaborn library
import seaborn as sns
# Using seaborn librarie's pairplot function
sns.pairplot(data=combined, vars=selected_features, hue='wine', palette='tab10' )
# run the plotting function
plt.show()
When data are collected under real-world settings they usually contain some amount of noise that makes classification more challenging. In the cell below, invoke your exploratory data analysis function above on a noisy version of your data X.
Try to perturb your data with some Gaussian noise,
# initialize random seed to replicate results over different runs
mySeed = 12345
np.random.seed(mySeed)
XN=X+np.random.normal(0,0.6,X.shape)
and then invoke
myplotGrid(XN,y)
# noise code
mySeed = 123456
np.random.seed(mySeed)
XN = X + np.random.normal(0, 0.6, X.shape);
noise_Data = pd.DataFrame(XN, columns=selected_features);
noise_arr = [target, noise_Data]
Noise_df = pd.concat(noise_arr, axis=1, join='inner')
Noise_df
wine | alcohol | flavanoids | color_intensity | ash | |
---|---|---|---|---|---|
0 | 0 | 14.511467 | 2.890282 | 4.734565 | 1.748621 |
1 | 0 | 13.927267 | 2.656071 | 4.451525 | 1.513458 |
2 | 0 | 12.642891 | 1.977258 | 5.383042 | 3.313082 |
3 | 0 | 14.802933 | 3.065937 | 7.176255 | 2.663116 |
4 | 0 | 12.985017 | 3.030212 | 4.485739 | 2.217560 |
... | ... | ... | ... | ... | ... |
173 | 2 | 14.536851 | 0.815131 | 7.288758 | 2.436147 |
174 | 2 | 12.269144 | 1.184646 | 7.029788 | 2.513411 |
175 | 2 | 12.507302 | 0.683971 | 10.051960 | 1.402465 |
176 | 2 | 12.410797 | 1.106591 | 8.809080 | 2.696125 |
177 | 2 | 15.076628 | 0.570022 | 8.736084 | 1.770933 |
178 rows × 5 columns
# SHOWING THE NOISY DATA
sns.pairplot(data=Noise_df, vars=selected_features, hue='wine', palette='tab10')
plt.show()
Based on your exploratory analysis, if you were to build a classifier using only two of the available features, which ones would you choose and why? Answer as fully as you can.
answer: I would use Alcohol and Color Intensity. Starting with Color Intensity, it is the only one which is the most different for each type of wine, and the reason why I would use alcohol is because it is the main ingredient in any wine hence it is one of the features which affects the color of the wine as well hence it will be a nice combination to use alcohol levels, along with color intensity to find out how different levels of alcohol may affect the color of wine in each class and the different color also does differentiate between different wine classes. So hence this will be better feature compared to flavanoids and ash.
What do you observe by plotting the data without noise compared to plotting with added Gaussian noise?
answer: Before the Gaussian noise, each feature was vairly separated when compared to noisy data, however the noisy data has more overlapping features and values overall and the values have increased significantly as well. Like the ash data has more similar values for each class. And flavanoids have increased quite a lot compared to before. However, color intensity is vairly unchanged and the alcohol levels for type 2 overlapping more with other two types compared to before where it had a few different values.
In the cell below, develop your own code for performing k-Nearest Neighbour classification. You may use the scikit-learn k-NN implementation from the labs as a guide - and as a way of verifying your results - but it is important that your implementation does not use any libraries other than the basic numpy and matplotlib functions.
Define a function that performs k-NN given a set of data. Your function should be invoked similary to:
y_ = mykNN(X,y,X_,options)
where X is your training data, y is your training outputs, X_ are your testing data and y_ are your predicted outputs for X_. The options argument (can be a list or a set of separate arguments depending on how you choose to implement the function) should at least contain the number of neighbours to consider as well as the distance function employed.
Hint: it helps to break the problem into various sub-problems, implemented as helper function. For example, you might want to implement separate function(s) for calculating the distances between two vectors. And another function that uncovers the nearest neighbour(s) to a given vector.
# Splitting the test and training data using Sklearn library
# THIS IS THE ONLY PLACE WHER SKLEARN LIBRARY VALUES ARE USED IN MY CODE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Testing code to compare with SKLEARN
# Sklearn values will be later compared to my KNN function
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
y_pred_sklearn = knn.predict(X_test)
acc_sklearn = np.sum(y_pred_sklearn == y_test) / len(y_test)
print("Testing: %s " % y_test)
print("Predicted Sklearn: %s " % y_pred_sklearn)
print("Accuracy Sklearn: %s " % round(acc_sklearn,2))
Testing: [2 2 0 1 0 0 1 0 2 2 0 0 1 0 1 2 1 1 2 1 1 1 0 0 1 0 0 1 0 0 0 1 2 1 0 0] Predicted Sklearn: [2 2 0 1 0 0 1 0 2 2 0 0 0 0 1 2 1 1 2 1 1 1 0 0 1 0 0 1 0 0 0 1 2 1 0 0] Accuracy Sklearn: 0.97
# My KNN Code
# CODED BY USING DIFFERENT EXTERNAL SOURCES
# REFERENCES ARE GIVEN IN THE END OF THE NOTEBOOK
import numpy as np
from collections import Counter
# Euclidean and Manhattan distance functions
def euclidean_dist(x1, x2):
return np.sqrt(np.sum((x1-x2)**2))
def manhattan_dist(x1, x2):
return sum(abs(val1-val2) for val1, val2 in zip(x1,x2))
# My Function based KNN
def myKNN(X_train, y_train, X_test, k, distance):
output = []
# Iterating over the length of X_test
for i in range(len(X_test)):
# Arrays for indices and labels
indices = []
labels = []
for j in range(len(X_train)):
# Checking for type of distance given
if(distance == 'euclidean'):
dist = euclidean_dist(X_train[j], X_test[i])
elif(distance == "manhattan"):
dist = manhattan_dist(X_train[j], X_test[i])
# Appending indices using the distance
indices.append([dist, j])
# Sorting the indices and iterating over them to appeand to output array
indices.sort()
indices = indices[0:k]
for indices, j in indices:
labels.append(y_train[j])
ans = Counter(labels).most_common(1)[0][0]
output.append(ans)
return output
# Getting predicted values using my KNN function
predict = myKNN(X_train, y_train, X_test, 5, 'euclidean')
# Getting accuracy from the prediction values
acc_knn = np.sum(predict == y_test) / len(y_test)
# Printing out the values
# AFTER COMPARING THE VALUES, THEY ARE SIMILAR TO SKLEARN
print("Testing: %s " % y_test)
print("Predicted My KNN: %s " % predict)
print("Accuracy My KNN: %s " % round(acc_knn, 2))
Testing: [2 2 0 1 0 0 1 0 2 2 0 0 1 0 1 2 1 1 2 1 1 1 0 0 1 0 0 1 0 0 0 1 2 1 0 0] Predicted My KNN: [2, 2, 0, 1, 0, 0, 1, 0, 2, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 2, 1, 0, 0] Accuracy My KNN: 0.97
In the cell below, implement your own classifier evaluation code. This should include some way of calculating confusion matrices, as well as common metrics like accuracy.
Write some additional code that lets you display the output of your confusion matrices in a useful and easy-to-read manner.
You might want to test your functions on some test data, and compare the results to the sklearn library versions.
# Testing first with SKLEARN
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score
# Printing out values using Sklearn to compare later
print("Confusion Matrix (Sklearn):")
print(confusion_matrix(y_test, y_pred_sklearn))
print("Accuracy Score (Sklearn): %s " % round(accuracy_score(y_test, y_pred_sklearn), 2))
print("Precision Score (Sklearn): %s " % precision_score(y_test, y_pred_sklearn, average=None))
print("Recall Score (Sklearn): %s " % recall_score(y_test, y_pred_sklearn, average=None))
Confusion Matrix (Sklearn): [[16 0 0] [ 1 12 0] [ 0 0 7]] Accuracy Score (Sklearn): 0.97 Precision Score (Sklearn): [0.94117647 1. 1. ] Recall Score (Sklearn): [1. 0.92307692 1. ]
# confusion matrix, accuracy, precision, recall, etc.
# CODED BY USING EXTERNAL AND COURSERA LAB SOURCES
# REFERENCES ARE GIVEN IN THE END OF THE NOTEBOOK
# Creating a confusion matrix function
def conf_matrix(y_actual, y_pred):
classes = np.unique(y_actual)
matrix = np.zeros((len(classes), len(classes)), dtype=int)
for i in range(len(classes)):
for j in range(len(classes)):
matrix[i, j] = np.sum((y_actual == classes[i]) & (y_pred == classes[j]))
return matrix
# Creating a accuracy function
def accur(y_actual, y_pred):
return np.sum(y_pred == y_actual) / len(y_actual)
# Creating prepcision function
def precision(y_actual, y_pred):
matrix = conf_matrix(y_actual, y_pred)
classes = np.unique(y_actual)
prec = np.zeros(classes.shape)
for i in classes:
prec[i] = matrix[i,i] / sum(matrix[:,i])
return prec
# Creating recall function
def recall(y_actual, y_pred):
classes = np.unique(y_pred)
rec = np.zeros(classes.shape)
matrix = conf_matrix(y_actual, y_pred)
for i in classes:
rec[i] = matrix[i, i] / sum(matrix[i,:])
return rec
# Printing out values for the data
# In comparison to the SKlearn, the our data evaluation is similar as well.
print('My Confusion Matrix: ')
print(conf_matrix(y_test, predict))
print('My Accuracy Score: %s' % round(accur(y_test, predict), 2))
print('My Precision Score: %s' % precision(y_test, predict));
print('My Precision Score: %s' % recall(y_test, predict));
My Confusion Matrix: [[16 0 0] [ 1 12 0] [ 0 0 7]] My Accuracy Score: 0.97 My Precision Score: [0.94117647 1. 1. ] My Precision Score: [1. 0.92307692 1. ]
In the cell below, develop your own code for performing 5-fold nested cross-validation along with your implemenation of k-NN above. You must write your own code -- the scikit-learn module may only be used for verification purposes.
Your code for nested cross-validation should invoke your kNN function (see above). You cross validation function should be invoked similary to:
accuracies_fold = myNestedCrossVal(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)
where X is your data matrix (containing all samples and features for each sample), 5 is the number of folds, y are your known output labels, list(range(1,11)
evaluates the neighbour parameter from 1 to 10, and ['euclidean','manhattan',...]
evaluates the distances on the validation sets. mySeed is simply a random seed to enable us to replicate your results.
Notes:
# My Nested Cross Validation Function
# CODED BY USING EXTERNAL AND COURSERA LAB SOURCES
# REFERENCES ARE GIVEN IN THE END OF THE NOTEBOOK
def NestedCrossValidation(X, y, k_fold, neighbour, distance, mySeed):
# Declaring the fold arrays to store the accuracy, parameters and confusion matrix
acc_fold = []
param_fold = []
confusion_matrix = []
np.random.seed(mySeed)
# Generating shuffled list from 0 to length of data
indices = np.random.permutation(np.arange(0,len(X),1))
# Splitting data into different bins
bins = np.array_split(indices, k_fold)
# Iterating over number of folds, in this case 5
for i in range(k_fold):
# list to save indices for training, testing and validation data
foldTrain=[]
foldTest=[]
valid_fold = []
# Setting up initial values for best accuracy and neighbours before...
# ...going into the two loops
accuracy_best = 0
neighbour_best = 0
# Taking bin i for testing
foldTest = bins[i]
# Setting up bin for fold and checking if it is still under 5
valid_bin = i + 1
if valid_bin >= k_fold:
valid_bin = 0
# Dividing bins into training and validation
for j in range(len(bins)):
if j == valid_bin:
valid_fold = bins[valid_bin]
else:
foldTrain.extend( bins[j] )
# Nested Loop for Nested Cross Validation
# First loop is for distances
for x in distance:
# Second loop for number of neighbours
for z in neighbour:
# Calling our KNN function
y_prediction = myKNN(X[foldTrain], y[foldTrain], X[valid_fold], z, x)
# Calculating accuracy
acc_score = accur(y[valid_fold], y_prediction)
# Checking if accuracy is better or not
# If it is then setting up appropriate parameters
if acc_score > accuracy_best:
neighbour_best = z
distance_best = x
accuracy_best = acc_score
# Extending the training array with validation data
foldTrain.extend(valid_fold)
# Doing a KNN on each fold using the best values taken out from nested loop
y_final = myKNN(X[foldTrain], y[foldTrain], X[foldTest], neighbour_best, distance_best)
# Calculating final accuracy for the fold
acc_final = accur(y[foldTest], y_final)
# Calculating final matrix for the fold
matrix = conf_matrix(y[foldTest], y_final)
# Printing all the values
print("==============================")
fold_no = i + 1
print("Fold Number: %s" % (fold_no))
print("This Best Accuracy: %s" % round(acc_final, 2))
print("This Best Distance (Parameter): %s" % distance_best )
print("This Best Neighbour (Parameter): %s" % neighbour_best )
print("This Confusion Matrix Per Fold:")
print(matrix)
# Appending the values so they can be returned and used for summary
param_fold.append((distance_best, neighbour_best))
confusion_matrix.append(matrix)
acc_fold.append(acc_final)
# Returning the values
matrices = np.array(confusion_matrix)
return acc_fold, param_fold, matrices
# evaluate clean data code
dists=["euclidean", "manhattan"]
mySeed=123456
folds=5
# Calling the NCV for noisy data and passing in XN value
# Prints out fold, best accuracy, distance, neighbour and confusion matrix per fold
a_fold_clean, p_fold_clean, conf_matrices_clean = NestedCrossValidation(X,y,folds,list(range(1,11)),dists,mySeed)
============================== Fold Number: 1 This Best Accuracy: 1.0 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 3 This Confusion Matrix Per Fold: [[11 0 0] [ 0 17 0] [ 0 0 8]] ============================== Fold Number: 2 This Best Accuracy: 1.0 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 1 This Confusion Matrix Per Fold: [[13 0 0] [ 0 11 0] [ 0 0 12]] ============================== Fold Number: 3 This Best Accuracy: 0.92 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 3 This Confusion Matrix Per Fold: [[11 0 0] [ 2 12 1] [ 0 0 10]] ============================== Fold Number: 4 This Best Accuracy: 0.91 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 7 This Confusion Matrix Per Fold: [[12 0 0] [ 3 11 0] [ 0 0 9]] ============================== Fold Number: 5 This Best Accuracy: 1.0 This Best Distance (Parameter): manhattan This Best Neighbour (Parameter): 3 This Confusion Matrix Per Fold: [[12 0 0] [ 0 14 0] [ 0 0 9]]
# evaluate noisy data code
# Calling the NCV for noisy data and passing in XN value
# Prints out fold, best accuracy, distance, neighbour and confusion matrix per fold
a_fold_noisy, p_fold_noisy, conf_matrices_noisy = NestedCrossValidation(XN,y,folds,list(range(1,11)),dists,mySeed)
============================== Fold Number: 1 This Best Accuracy: 0.83 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 3 This Confusion Matrix Per Fold: [[ 9 2 0] [ 4 13 0] [ 0 0 8]] ============================== Fold Number: 2 This Best Accuracy: 0.89 This Best Distance (Parameter): manhattan This Best Neighbour (Parameter): 4 This Confusion Matrix Per Fold: [[11 1 1] [ 0 11 0] [ 0 2 10]] ============================== Fold Number: 3 This Best Accuracy: 0.97 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 3 This Confusion Matrix Per Fold: [[11 0 0] [ 1 14 0] [ 0 0 10]] ============================== Fold Number: 4 This Best Accuracy: 0.83 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 8 This Confusion Matrix Per Fold: [[11 1 0] [ 4 9 1] [ 0 0 9]] ============================== Fold Number: 5 This Best Accuracy: 0.94 This Best Distance (Parameter): euclidean This Best Neighbour (Parameter): 5 This Confusion Matrix Per Fold: [[12 0 0] [ 2 12 0] [ 0 0 9]]
Using your results from above, fill out the following table using the clean data:
Fold | accuracy | k | distance |
---|---|---|---|
1 | .? | ? | ? |
2 | .? | ? | ? |
3 | .? | ? | ? |
4 | .? | ? | ? |
5 | .? | ? | ? |
total | .? ? |
Where total is given as an average over all the folds, and the standard deviation.
Now fill out the following table using the noisy data:
Fold | accuracy | k | distance |
---|---|---|---|
1 | .? | ? | ? |
2 | .? | ? | ? |
3 | .? | ? | ? |
4 | .? | ? | ? |
5 | .? | ? | ? |
total | .? ? |
# Summary Code
# A result summary function that takes in parameter and accuracy array
# Extracts the values from the array and saves it in a dataframe
# Written by Myself
import pandas as pd
def result_summary(a_fold, p_fold):
# Iterating over accuracies_fold
round_acc=[round(acc, 2) for acc in a_fold]
# Iterating over parameter fold and separating distances and neighbours
dist_fold, neighbour = zip(*p_fold)
# Combining the data in a numpy array
data = np.array([round_acc,neighbour,dist_fold])
# Calculating indicies
indices = [i for i in range(1, folds+1)]
# Saving in a dataframe using pandas
df=pd.DataFrame(data.T,indices, ["accuracy", "k", "distance"],)
# Returning dataframe
return df
# Clean Data Summary
# Calling summary function
clean_summary = result_summary(a_fold_clean, p_fold_clean)
# Calculating average accuracy and standard deviation
avg_accuracy_clean = np.average(a_fold_clean)
sd_clean = np.std(a_fold_clean)
# Printing average accuracy, standard deviation and showing the dataframe
print("Average Clean Accuracy: %2f ± %2f" % (avg_accuracy_clean, sd_clean))
clean_summary
Average Clean Accuracy: 0.966190 ± 0.041415
accuracy | k | distance | |
---|---|---|---|
1 | 1.0 | 3 | euclidean |
2 | 1.0 | 1 | euclidean |
3 | 0.92 | 3 | euclidean |
4 | 0.91 | 7 | euclidean |
5 | 1.0 | 3 | manhattan |
# Noisy Data Summary
# Calling summary function
noisy_summary = result_summary(a_fold_noisy, p_fold_noisy)
# Calculating average accuracy and standard deviation
avg_accuracy_noisy = np.average(a_fold_noisy)
sd_noisy = np.std(a_fold_noisy)
# Printing average accuracy, standard deviation and showing the dataframe
print("Average Noisy Accuracy: %2f ± %2f" % (avg_accuracy_noisy, sd_noisy))
noisy_summary
Average Noisy Accuracy: 0.893175 ± 0.057428
accuracy | k | distance | |
---|---|---|---|
1 | 0.83 | 3 | euclidean |
2 | 0.89 | 4 | manhattan |
3 | 0.97 | 3 | euclidean |
4 | 0.83 | 8 | euclidean |
5 | 0.94 | 5 | euclidean |
Summarise the overall results of your nested cross validation evaluation of your K-NN algorithm using two summary confusion matrices (one for the noisy data, one for the clean data). You might want to adapt your myNestedCrossVal
code above to also return a list of confusion matrices.
Use or adapt your evaluation code above to print the two confusion matrices below. Make sure you label the matrix rows and columns. You might also want ot show class-relative precision and recall.
# New custom functions for class precision and recall
# The older one's couldn't be used
# Written by Myself
def class_precision(matrix):
classes = np.unique(y)
result = []
# goes over the length of classes
for i in range(len(classes)):
# For each class, calculates TP and FP
true_pos = matrix[i, i]
false_pos = np.sum(matrix[:, i]) - true_pos
# Uses the Precision Formula to calculate and appeand in array
c_precision = true_pos / (true_pos + false_pos)
result.append(c_precision)
# Returning precision
return result
def class_recall(matrix):
classes = np.unique(y)
result = []
# goes over the length of classes
for i in range(len(classes)):
# For each class, calculates TP and FN
true_pos = matrix[i, i]
false_neg = np.sum(matrix[i, :]) - true_pos
# Uses the Recall Formula to calculate and appeand in array
c_recall = true_pos / (true_pos + false_neg)
result.append(c_recall)
# Returning recall
return result
# Written by Myself
print('CLEAN')
# clean data summary results
# Summing up the clean matrices
sum_clean_matrix = np.sum(conf_matrices_clean, axis=0)
# Printing recall and precision values for clean data
print("Confusion Matrix:")
print(sum_clean_matrix)
print("Precision: %s" % class_precision(sum_clean_matrix))
print("Recall: %s" % class_recall(sum_clean_matrix))
print("===============")
print('NOISY')
# clean data summary results
# Summing up the noisy matrices
sum_noisy_matrix = np.sum(conf_matrices_noisy, axis=0)
# Printing recall and precision values for noisy data
print("Confusion Matrix:")
print(sum_noisy_matrix)
print("Precision: %s" % class_precision(sum_noisy_matrix))
print("Recall: %s" % class_recall(sum_noisy_matrix))
CLEAN Confusion Matrix: [[59 0 0] [ 5 65 1] [ 0 0 48]] Precision: [0.921875, 1.0, 0.9795918367346939] Recall: [1.0, 0.9154929577464789, 1.0] =============== NOISY Confusion Matrix: [[54 4 1] [11 59 1] [ 0 2 46]] Precision: [0.8307692307692308, 0.9076923076923077, 0.9583333333333334] Recall: [0.9152542372881356, 0.8309859154929577, 0.9583333333333334]
Now answer the following questions as fully as you can. The answers should be based on your implementation above. Write your answers in the Markdown cells below each question.
Do the best parameters change when noise is added to the data? Can you say that one parameter choice is better regardless of the data used?
Answer: Yes, the parameters changed quite a lot. The highest accuracy dropped to 0.97 from 1.0 compared to the clean data and the lowest goes to 0.83. However the euclidean distance was the only parameter that didn't change a lot in each fold and manhattan was used only once. It is also worth noticing that the value of k is much more varied in nosiy data compared to clean data and also has the highest value of k. However one parameter that is a better choice overall is the euclidean distance which gave the highest accuracy in both the clean and noisy data.
Assume that you have selected the number of neighbours to be an even number, e.g., 2. For one of the neighbours, the suggested class is 1, and for the other neighbour the suggested class is 2. How would you break the tie? Write example pseudocode that does this.
Answer: There are a few ways to break the tie, one being the picking up the class at random if a tie occurs which in theory would work and fix the issue, but, it won't be really accurate result. One main way to fix this issue is to introduce a way to see which of the class is closest adn choose that one. The code is written below, and it currently takes in the neighbours array to sort it and see if we have a tie or not and assigns the first value from the sorted neighbours to the new tie breaker variable and returns it.
import numpy as np
# tie breaker function that is called if there's a tie, takes in neighbour array
def tie_breaker(neighbours):
# sorting array to see which one is closest
sorted_neighbours = np.sort(neighbours)
# assigning the value to a variable
nearest_value = sorted_neighbours[0]
# Calculating the length of class and unique values
class_length = len(neighbours)
class_unique = len(np.unique(neighbours))
# Checking if length and unique values are equal to 2
if class_length == 2 and class_unique == 2:
tie_breaker_class = nearest_value
# Returns the class
return tie_breaker_class
If you were to run your k-nn algorithm on a new dataset (e.g., the breast cancer dataset, or Iris), what considerations would you need to take into consideration? Outline any changes that might be needed to your code.
Answer: Noting that those datasets will have different number of classes, while they will also be much more longer than the wine dataset, it is necessary for the KNN algorithm to have more flexibilty in-terms of distance metrics and for that we can have Minkowski Distance as well to make the KNN more versatile.
The Code I have written in this Notebook has been mostly adapted from the Coursera Labs and external sources like stackoverflow, youtube and different websites have been used to make the functions for this notebook.
Their are functions that I have completely implemented myself, and that are the distance function by using the formula we already knew for KNN, Summary code and as well as Seaborn code for visualization.
The Prediction algorithm in my KNN was adapted from medium, my implementation was function based as per the requirement, and with different distances hence the code was adjusted and molded according to my requirements. Reference is below:
In order to do make the Manhattan distance code really small, the python formula was adapted from datagy. Initially I wrote my own Manhattan code but wanted to make it shorter like the euclidean function hence this source was used in order to do that.
And initially other youtube video were also used to understand the KNN code in practice, references are below:
The evaluation code is adapted from different sources, mainly two, the coursera labs and mainly for precision and recall I have followed the code that we learned to make and there are just a few changes that I have made in-terms of input each function takes, otherwise the code has been adapted from there.
The Confusion Matrix on the adapated from stackoverflow:
Note that the sklearn code was something that we already implemented on coursera labs hence it is used the same way to compare to our custom code values.
The Nested Cross Validation code is a mixture of different sources and algorihtms, the main one being the Coursera Labs in Week 7 since we already had an implementation of cross validation function. The rest of code was implemented by reviewing different sources that explained how the Nested Cross Validation function works and the code was then made by using that idea of inner loops.
All of the links above didn't necessarily had exact code but an explaination to the algorithm and the process that should be used to make the function working hence the part of the code is implemented by me by carefully reviewing the algorithm and explaination provided by the sources above while the cross validation code was used as said before to extend into a nested cross validation function.