chapter 2. data and analysis
TRANSCRIPT
Chapter 2. Data and analysis
Section 2: Machine Learning
Aim
In this section, you will learn the concept of Machine Learning (ML) and will be introduced
to the concept of Artificial intelligence(AI). AI uses a large amount of data through
various processing and intelligent algorithms to help teach computers to learn
information on their own. Machine learning is a method to achieve artificial intelligence.
Therefore, you are introduced to the concept of machine learning and its various
algorithms and applications.
Learning outcomes
Assess a range of different machine learning algorithms.
Prior knowledge
• Computer science
• Data representation
• Data organisation and storage
My STREAM focus
• Science
• Technology
• Engineering
Key vocabulary
Word Meaning Picture
artificial
intelligence
ability of a machine to perform tasks
that normally require human
intelligence
machine
learning
study of the algorithms that enable
machines to learn from
data to make decisions without the
help of humans
Word Meaning Picture
supervised
learning
type of learning that uses examples
from the training dataset to learn
unsupervised
learning
type of learning which groups similar
things based on features
from the dataset
reinforcement
learning
type of learning, which uses the
rewards and punishments
system
Classification
algorithm
algorithm that requires the machine to
predict the label or class of an object
regression algorithm that requires the machine to
predict a numeric value
‘
clustering algorithm that requires the machine to
group similar things in the same group
association
algorithm that requires machine to find
the relationships between features in
the input dataset
SB Activity 2.1.1
Can you identify the purpose of the following technologies?
Activity 2.2.1
Conduct research on how the UAE used artificial intelligence and smart solutions as
part of its strategy to control the spread of COVID-19. Record your answers in the
space given below.
Activity 2.2.2
Categorise the following statements into true/false statements:
Statement True / False
1. AI is a subset of machine learning. False
2. A computer that uses machine learning needs to take orders from the
user. True
3. Machine learning uses datasets to learn from data. True
4. You can find the features and labels in the dataset by default. True
5. Features can be defined as the unique characteristics of something. True
6. The accuracy of a machine learning model determines the best
relationships and patterns present in a dataset. False
Activity 2.2.3
1. Name and define the type of data used in supervised learning.
Labelled data is data that comes with a tag, like a name, a type, or a number.
2. What is the main goal of supervised learning?
to predict the correct label for newly presented input data
3. List any five types of supervised learning algorithms.
Decision trees
Linear Regression
Random forest
Logistic regression
Support Vector Machines
4. List one difference between classification and regression algorithms.
Classification Regression
Classification is the problem of predicting
a discrete class label output for an
example.
Regression is the problem of predicting a
continuous quantity output for an
example.
5. Classify each of the following supervised learning application as"Classificatio"
or"Regression":
Application Classification/Regression
Customer behaviour prediction Classification
Predict the price of a car based on the
latest technologies available Regresion
Predict the price of a house price based
on data such as quality of schools in the
area, number of bedrooms in the house
and house location
Regression
Choosing a cat's breed based on its
physical features such as height, width
and skin colour
Classification
Prediction of the temperature of any day
based on wind speed, humidity,
atmospheric pressure
Regression
Activity 2.2.4
1. Name and define the type of data used in unsupervised learning.
Unlabelled data is data that comes with no tag.
2. What is the main goal of unsupervised learning?
to discover hidden and interesting patterns in unlabeled data
3. List any three types of unsupervised learning algorithms.
K – means clustering
Principal component analysis
Neural networks – deep belief networks
4. List one difference between clustering and association algorithms with one
example each.
Clustering Association
It means grouping a set of objects in such
a manner that objects in the same group
are more similar than to those object
belonging to other groups.
It is about finding associations amongst
items within large commercial databases
Activity 2.2.5
Differentiate between the three types of machine learning algorithm.
Supervised Learning Unsupervised Learning Reinforcement Learning
works with the labelled
data and here the output
data patterns are known to
the system
deals with unlabeled data
where the output is based
on the collection of
perceptions
based on interaction with
environments using the
trial–and–error process
Activity 2.2.6
Conduct research on the real-life application of reinforcement learning. Record your
answers in the space given below.
Activity 2.2.7
Choose the correct option for the following questions:
1. Which of the following is an example of negative reinforcement?
a. Every time you eat chocolate, you get a stomachache.
b. You do not study for a test, and you receive a bad grade.
c. Whenever you drink coffee in the morning, you are able to get your work done
more quickly.
d. Using umbrella results in you not getting rained on. Therefore, you start to bring
an umbrella with you whenever rain is in the forecast.
2. Reinforcement Learning is -
a. Supervised learning
b. Unsupervised learning
c. Award based learning
d. Semi-supervised learning
Try to research the answer on your own.
Student reflection
List three things you have learned and two things you have enjoyed.
Three things I have learned:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
3. ___________________________________________________________________________________
Two things I have enjoyed:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
Learning outcomes
Key Skills
(Please tick the box to show
your understanding of the skills
below.)
I
do not
understand.
I
understand. I'm an expert.
Assess a range of
different machine
learning algorithms.
I can explain a range of
different machine
learning algorithms.
I can assess a range of
different machine
learning algorithms.
Teacher’s comment:
Section 3: ML Algorithms
Aim
In this section, you will learn how to build a machine learning model. You will first be
introduced to the steps of creating an ML model. The steps include – collecting the data,
building, training, and evaluating the model. You will also be introduced to Scikit-learn
and pandas library, which you will be using to write the Python codes for the ML model
using the PyCharm IDE. You will learn how to collect data in three different ways. Further,
you will learn how to build your first machine learning model in Python. You will evaluate
your ML model by using new datasets and determine its accuracy percentage.
Learning outcomes
Select appropriate data in order to implement a basic classification system using
an AI algorithm.
Prior knowledge
• Python programming
• Machine learning models
My STREAM focus
• Science
• Technology
• Engineering
Key vocabulary
Word Meaning Picture
dataset
collection of data in which
data is arranged in some
order
feature
Individual, independent
variables which act as the
input in a model
class label final output or target output
of a model
Word Meaning Picture
evaluation
measure of the machine
learning model’s
performance
accuracy
percentage of the correct
predictions from the total
predictions
SB Activity 2.3.1
Identify the features for the iris dataset.
• petal length
• petal width
• sepal length
• sepal width
SB Activity 2.3.2
Identify the class labels for the iris dataset.
• Iris Versicolor
• Iris Setosa
• Iris Virginica
SB Activity 2.3.3
Can you identify what the output is in the context of the iris dataset?
SB Activity 2.3.5
1. Modify the code to print the values stored in each of the following variables:
(a) x_train
(b) x_test
(c) y_train
(d) y_test
Observe the results and paste a picture of the output below.
print(x_train)
print(x_test)
print(y_test)
print(y_train)
2. How many samples are present in the training and test set, respectively?
Training set samples – 105
Test set samples - 45
Palmer Penguin Dataset
This dataset is used as a replacement for the iris dataset for analysis purposes, as it consists
of 344 samples of data. This dataset was released in August 2020.
This dataset contains data for 344 penguins. There are three different species of penguins
in this dataset, as shown in the figure below. The data is collected from three islands in
the Palmer Archipelago, Antarctica.
Figure 2.3.1
The dataset represents the following features:
• species - a factor denoting penguin species- Adélie, Chinstrap and Gentoo
• island - a factor denoting island in Palmer Archipelago, Antarctica - Biscoe,
Dream or Torgersen
• bill_length_mm - a number denoting bill length – in millimetres
• bill_depth_mm - a number denoting bill depth – in millimetres
• flipper_length_mm - an integer denoting flipper length – in millimetres
• body_mass_g - an integer denoting body mass – in grams
• Gender - a factor denoting penguin gender – female, male
The culmen is the upper ridge of a bird’s bill. In the given data, culmen length and
depth are renamed as variables bill_length_mm and bill_depth_mm. For this penguin
data, the bill length and depth are measured as shown below.
Figure 2.3.2
Complete all the activities below using the information provided above to build an ML
model using the following available algorithms in Python:
(a) Decision Tree
(b) Random Forest
(c) Logistic Regression
(d) KNN
(e) Naïve Bayes
(f) SVM
While evaluating your models, you will compare the accuracy obtained from each of
the algorithms mentioned above and conclude with the best algorithm technique.
Activity 2.3.1
1. Can you identify the features and class labels of the penguin dataset in
the table below?
Features Class label
species Gender
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
2. What type of classification will be done using the penguin dataset?
Supervised learning concept will be used to classify the data into two types of
gender of penguins – male and female.
3. Categorise the following statement as True or False.
Statement True / False
Features and class labels can be differentiated as
independent and dependent variables, respectively. True
Note:
• Install the package “palmerpenguins” on your IDE to download the dataset directly.
• You can use Example 2.3.2 from your student book.
• Save the complete program for each of the algorithms in different files.
Activity 2.3.2
Collect the dataset
Write the Python code to load the data from “palmerpenguins” package.
Copy and paste your code in the space below and a screenshot of the
output.
Code:
import pandas as pd
#loading the dataset
data_df = pd.read_csv('penguins.csv')
print(data_df.head())
Output:
Activity 2.3.3
Building and training the model
Modify the code from Activity 2.3.2 to build and train the model using the
following classification algorithms – (a) Decision Tree
(b) Random Forest
(c) Logistic Regression
(d) KNN
(e) Naïve Bayes
(f) SVM
Copy and paste the code below for each of the algorithms.
Code:
#importing libraries
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
#separating the features and class labels
x=data_df[[‘species’, ‘island’, ‘bill_length_mm’,‘bill_depth_mm’,
‘flipper_length_mm’, ‘body_mass_g’]]
y=data_df['Gender']
# building the model
dtc = DecisionTreeClassifier()
rcm = RandomForestClassifier()
logreg_clf = LogisticRegression()
knn_model = KNeighborsClassifier(n_neighbors=5)
na_model = GaussianNB()
SVC_model = svm.SVC()
# training the model
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.3)
dtc = dtc.fit(x_train, y_train)
rcm = rcm.fit(x_train, y_train)
logreg_clf = logreg_clf.fit(x_train, y_train)
knn_model = knn_model.fit(x_train, y_train)
na_model = na_model.fit(x_train, y_train)
SVC_model = SVC_model.fit(x_train, y_train)
Note:
After training the model, the model needs to be fitted using the fit() command.
Model fitting is a measure of how well a machine learning model generalizes to
similar data to that on which it was trained. Fitting is an automatic process that
makes sure your machine learning models have the individual parameters best
suited to solve your specific real-world business problem with a high level of
accuracy.
Activity 2.3.4
Evaluate the model
1. Modify the code from the previous activity to evaluate the model using for
each the following classification algorithm – (a) Decision Tree
(b) Random Forest
(c) Logistic Regression
(d) KNN
(e) Naïve Bayes
(f) SVM
#importing libraries
from sklearn.metrics import accuracy_score
# evaluating the model
# decision tree classifier
predictions1 = dtc.predict(x_test)
print("Predicted output for decision tree classifier is:",
predictions1)
print("The accuracy is", accuracy_score(y_test, predictions1))
# random forest classifier
predictions2 = rcm.predict(x_test)
print("Predicted output for random forest classifier is:",
predictions2)
print("The accuracy is", accuracy_score(y_test, predictions2))
# logistic regression
predictions3 = logreg_clf.predict(x_test)
print("Predicted output with logistic regression classifier is:",
predictions3)
print("The accuracy is", accuracy_score(y_test, predictions3))
# KNN model
predictions4 = knn_model.predict(x_test)
print("Predicted output for KNN classifier is:", predictions4)
print("The accuracy is", accuracy_score(y_test, predictions4))
# Naive Bayes
predictions5 = na_model.predict(x_test)
print("Predicted output for Naive Bayes classifier is:",
predictions5)
print("The accuracy is", accuracy_score(y_test, predictions5))
# SVM
predictions6 = SVC_model.predict(x_test)
print("Predicted output for SVM classifier is:", predictions6)
print("The accuracy is", accuracy_score(y_test, predictions6))
2. Complete the table below to record the predictions and the accuracy value for
each of the algorithms:
Classification
algorithm Prediction output Accuracy
Decision Tree
Random
Forest
Logistic
Regression
KNN
Naïve Bayes
SVM
3. (a) Which algorithm has the best accuracy value?
(b) What is the meaning of having the highest accuracy value?
(a) Random Forest Classifier
(b) This classifier is able to make the highest number of correct predictions for the
penguin dataset.
Activity 2.3.5 (T,E)
Categorise the following statements into true/false statements.
Statement True / False
1. Training the model is the second step in the ML process. False
2. The testing dataset should be larger than the training dataset. False
3. Usually, 50% of the dataset is used for training and 50% for testing. False
4. Accuracy is used as an evaluation of the model’s prediction. True
5. The evaluation process is used to measure a model’s performance. True
Student reflection
List three things you have learned and two things you have enjoyed.
Three things I have learned:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
3. ___________________________________________________________________________________
Two things I have enjoyed:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
Learning outcomes
Key Skills
(Please tick the box to show
your understanding of the skills
below.)
I
do not
understand.
I
understand. I’m an expert.
Select appropriate
data in order to
implement a basic
classification system
using an AI algorithm.
I can collect data for
training a classification
model.
I can select appropriate
data in order to
implement a basic
classification system using
an AI algorithm.
Teacher’s comment:
Section 4: Data visualisation
Aim
In this section, you will learn to visualise the data collected in the form of different plots.
You will learn about the different libraries required to generate a particular plot style. This
includes line plot, bar plot, pie chart, and scatter plot. You will then understand what type
of information can be inferred from each plot type. You will also learn how to choose the
right chart or graph for your data.
Learning outcomes
• Explain how interactive data visualisations help others better understand real-
world phenomena.
Prior knowledge
• Basics of Python programming
• Machine learning models
My STREAM focus
• Science
• Technology
• Engineering
• Mathematics
Key vocabulary
Word Meaning Picture
data analysis process of studying data to
find useful information
correlation relationship between
different data elements
Word Meaning Picture
data
visualisation
graphical representation of
information and data
line plots
graphical representation of
data by connecting all data
points with a line
bar plot chart that plots data using
rectangular bars or columns
pie chart type of graph that displays
data in a circular graph
scatter plot
graph in which the values of
two variables are plotted
along two axes
SB Activity 2.4.1 – Loading of data
The first step of any AI/ML process is to load the data. You have learned how to do
this in the previous section. Open your Python IDE and load the iris dataset in
preparation for visualising the data.
Copy and paste your code below.
import sklearn
from sklearn.datasets import load_iris
iris = load_iris()
OR
import pandas as pd
iris_df = pd.read_csv('iris.csv')
SB Activity 2.4.2
Try the following for the iris dataset:
Modify the code in Figure 2.4.4 to generate a line plot with different colours between –
• class vs sepal_wid
• class vs petal_wid
• class vs petal_len
• class vs sepal_wid:
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df['class'] = iris.target
iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',
'class']
a = iris_df['class'].value_counts()
iris_df.plot(kind='line',x='class',y='sepal_wid', color='blue')
plt.xlabel('class')
plt.ylabel('Sepal Length')
plt.title('Line Plot')
plt.show()
• class vs petal_wid:
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df['class'] = iris.target
iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',
'class']
a = iris_df['class'].value_counts()
iris_df.plot(kind='line',x='class',y='petal_len', color='green')
plt.xlabel('class')
plt.ylabel('Sepal Length')
plt.title('Line Plot')
plt.show()
• class vs petal_len:
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df['class'] = iris.target
iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',
'class']
a = iris_df['class'].value_counts()
iris_df.plot(kind='line',x='class',y='petal_wid', color='black')
plt.xlabel('class')
plt.ylabel('Sepal Length')
plt.title('Line Plot')
plt.show()
SB Activity 2.4.3
You must have noticed that the code for scatter plot is mostly similar to the code for
line plot. Answer the following questions to check if you understand the code layout
for scatter plot in Figure 2.4.13:
(a) How do you identify that the code is written to plot a scatter plot?
iris_df.plot( kind='scatter',x='petal_len',y='petal_wid', color='red')
(b) Which features of the iris dataset are compared in the scatter plot?
Petal width and Petal length
(c) What functions are used to label the x-axis and y-axis in the scatter plot?
plt.xlabel('Petal Length')
plt.ylabel('Petal Width’)
(d) What will happen if you remove the plt.show() function from the code?
The scatter plot will not be displayed.
(e) Can you identify which points in the output belongs to each of the iris species?
No
SB Activity 2.4.4
(a) Modify the code in Figure 2.4.14 to compare between petal length and width of
the iris species in the dataset.
Copy and paste your code and output below.
Code:
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df['class'] = iris.target
iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',
'class']
plt.figure(figsize=(15,7))
plt.title("Comparison between various species based on petal length
and width")
sns.scatterplot(iris_df['petal_len'],iris_df['petal_wid'],hue
=iris_df['class'], style =iris_df['class'] ,s=50)
plt.show()
Output:
(b) What information can you understand from the scatter plot?
• Iris Setosa (Class 0) species has a smaller petal width but higher petal length
• Versicolor (Class 1) lies in almost middle for petal length as well as petal width
• Virginica (Class 2) has larger petal lengths and smaller petal widths
Activity 2.4.1
Research and identify the purpose of each of the following libraries used in
Python.
Library Purpose
Pandas
NumPy
Scikit Learn
Matplotlib
Seaborn
Activity 2.4.2
Use the link below to download the “fruits_with_colors” dataset -
https://www.kaggle.com/anandraos/fruit-data-with-colours
1. Identify the features and the class labels for the dataset.
Features – fruit_name, fruit_subtype, mass, width, height, color_score
Class labels – fruit_label
2. Perform the following actions -
• Load the dataset
• Explore the dataset to gain more information about the data.
Copy and paste your code and output below.
Code:
import pandas as pd
data_df = pd.read_csv('fruit_data_with_colours.csv')
print('The fruit dataset insights are:')
print(data_df.info())
print('The statistical descriptions of the dataset are:')
print(data_df.describe())
print('The number of values for each class labels are:')
print(data_df['fruit_label'].value_counts())
print('The following list displays if there are any null values
inside the dataset:')
print(pd.isnull(data_df).any())
print('The shape of the given dataset is:')
print(data_df.shape)
Try to research the
answer on your own.
print('The unique values in the dataset are:')
print(data_df.nunique())
Output:
3. What information can get from the above output?
Activity 2.4.3
You are planning to visit Jebel Al Ali with your parents, for which you have to
rent a car for around 5 hours. The car rent cost includes a deposit of AED 50
plus an hourly rate of AED 20. Therefore, the cost to rent the car for up to 5
hours is shown in the table below.
No.of hours 0 1 2 3 4 5
Cost in AED 50 70 90 110 130 150
Write a code for plotting a line plot using the table shown above by following
the steps below –
(a) Store the data in the table in a dataframe.
(b) Create a line plot of the data using the created dataframe.
Copy and paste your code and output below.
Code:
# importing libraries
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import numpy as np
# storing the given data in a dataframe
data = pd.DataFrame([[0, 50],[1, 70],[2, 90], [3, 110], [4, 130], [5,
150]])
data.columns = ['No. of hours' , 'Cost in AED']
print(data)
# creating the line plot in the data
data.plot(kind='line',x='No. of hours',y='Cost in AED', color='red')
plt.xlabel('No. of hours')
plt.ylabel('Cost in AED')
plt.title('Line Plot')
plt.show()
Try to answer on your own using
Table 2.4.2 in your student book.
Output:
(c) What kind of information can you understand from the generated line plot?
The line plot generated has a upward trend in the graph indicating a positive
corelation.
Activity 2.4.4
Use the “fruits_with_colors” dataset for plotting a bar plot representing the fruit
categories and the count of their occurrences.
Copy and paste your code and output below.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_df = pd.read_csv('fruit_data_with_colours.csv')
a = data_df['fruit_label'].value_counts()
fruit_type = a.index
fruit_count = a.values
plt.bar(fruit_type, fruit_count, color='red', width=0.5)
plt.xlabel('fruit_type')
plt.ylabel('fruit_count')
plt.title('Bar Plot')
plt.show()
Output:
What information can you understand from the bar plot?
The data in the dataset is not balanced as each of the fruits type has a different
count value –
Apple ( label 1) – 19
Mandarin (label 2) – 5
Orange (label 3) - 19
Lemon (label 4) - 16
Activity 2.4.5 (T, E, A, M)
Use the “fruits_with_colors” dataset for plotting a pie chart representing the
fruit categories and the count of their occurrences.
Copy and paste your code and output below.
Code: import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_df = pd.read_csv('fruit_data_with_colours.csv')
a = data_df['fruit_label'].value_counts()
fruit_type = a.index
fruit_count = a.values
colors = ['lightblue', 'lightgreen', 'gold']
plt.pie(fruit_count, labels=fruit_type, shadow=True, colors=colors,
autopct='%1.1f%%')
plt.xlabel('fruit_type')
plt.axis('equal')
plt.title('Pie Chart')
plt.show()
Output:
What information can you understand from the pie chart?
The data in the dataset is not balanced as each of the fruits type has a different
count value –
Apple ( label 1) – 19
Mandarin (label 2) – 5
Orange (label 3) - 19
Lemon (label 4) - 16
Activity 2.4.6
Use the “fruits_with_colors” dataset for plotting a scatter plot to compare the
fruit categories based on the width and height of the fruit.
Copy and paste your code and output below.
Code:
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
data_df = pd.read_csv('fruit_data_with_colours.csv')
plt.figure(figsize=(15,7))
plt.title("Comparison between various fruit based on their width and
height")
sns.scatterplot(data_df['width'], data_df['height'],
hue=data_df['fruit_label'], style=data_df['fruit_label'], s=50)
plt.show()
Output:
What information can you understand from the scatter plot?
Apple (Class 1) lies almost in the middle in terms of its width and height.
Mandarin (Class 2) has the smallest width and height.
Orange (Class 3) lies has higher width but medium size height.
Lemon (Class 4) has smaller width but higher height.
Activity 2.4.7 (T, E, M)
Can you identify a common feature in all the plots you have generated in activities
2.4.3 – 2.4.6?
The same set of features in the given dataset can be plotted in various formats to get
different insights in a visual manner.
The different plots helped us to see whether the datset is balanced or not and helped
us to compare the sizes of each fruit type.
Activity 2.4.8 (T, E, M)
Research real-life application of each of the plots you have learned in this section.
Give a short description of how plots were useful.
Try to research the answer on your own.
Student reflection
List three things you have learned and two things you have enjoyed.
Three things I have learned:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
3. ___________________________________________________________________________________
Two things I have enjoyed:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
Learning outcomes
Key Skills
(Please tick the box to show
your understanding of the skills
below.)
I
do not
understand.
I
understand. I’m an expert.
Explain how
interactive data
visualisations help
others better
understand real-world
phenomena.
I can identify data
visualisations that can
help others better
understand real-world
phenomena.
I can explain how
interactive data
visualisations to help
others better understand
real-world phenomena.
Teacher’s comment:
Section 5: Computational models
Aim
In this section, you will learn why it is important to evaluate a learning model. You will also
learn the various performance metrics you can use to evaluate its performance,
including the confusion matrix, precision, recall, F1 score and accuracy. You will learn to
calculate these metrics manually and then using Python. Finally, you will learn to
understand the meaning of the metrics score achieved from the Python program.
Learning outcomes
• Evaluate a computational model that represents the relationships between
different data elements collected from a phenomenon or process.
Prior knowledge
• Basics of Python programming
• Machine learning models
My STREAM focus
• Science
• Technology
• Engineering
• Mathematics
Key vocabulary
Word Meaning Picture
model
evaluation
process to measure
and assess the quality
of a system’s
predictions
accuracy
ratio of correct
predictions by total
number of input
samples
Word Meaning Picture
confusion
matrix
describes the
complete
performance of a
model in the form of a
matrix
precision
measure of how good
your model is when the
prediction is positive
recall
measure of how good
your model is at
correctly predicting
positive classes
F1 score weighted average of
precision and recall
SB Activity 2.5.1
For the above example, 2.5.5, suppose we have the below confusion matrix for our
classifier. We can use the metrics defined above to evaluate its performance.
Figure 2.5.13
1. Using the confusion matrix in Figure 2.5.13, complete the table below with the
following values:
True Positive (TP) 4252
True Negative (TN) 4706
False Positive (FP) 421
False Negative (FN) 875
2. Based on the values identified above, calculate the following values:
Accuracy 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁=
4252 + 4706
4252 + 4706 + 421 + 875=
8958
10254= 0.87
Precision 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 =
4252
4252 + 421= 0.91
Recall 𝑇𝑃
𝑇𝑃 + 𝐹𝑁 =
4252
4252 + 875= 0.83
F1 Score 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗
0.91 ∗ 0.83
0.91 + 0.83= 0.87
Activity 2.5.1
Categorise the statements below as true or false:
Statements True / False
1. F1 score is the weighted average of accuracy and precision. False
2. Precision and recall are inversely related to each other. False
3. Accuracy, precision, recall and F1 score values can never be
greater than 100%. True
4. The confusion matrix can help you to calculate accuracy,
precision and recall values. True
5. Accuracy is the ratio of false predictions over correct predictions. False
Activity 2.5.2
Match the terms in Column A with their meaning in Column B.
Column A
Column B
False positive An outcome where the model
correctly predicts the negative class
True positive An outcome where the model
incorrectly predicts the negative class
True negative An outcome where the model
incorrectly predicts the positive class
False negative An outcome where the model
correctly predicts the positive class
Activity 2.5.3 (T, E, M)
Download the ‘heart disease’ dataset from the link given below –
https://www.kaggle.com/ronitf/heart-disease-uci
1. Use the dataset to build and train a KNN model to predict if the patient is suffering
from a heart disease or not.
Copy and paste your code below.
Code:
# Importing the dataset
import sklearn
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Loading the dataset
data = pd.read_csv("heart.csv")
print(data.head())
# building the decision tree model
knn_model = KNeighborsClassifier(n_neighbors=5)
# Training the decision tree model
x = data[['age', 'gender', 'cp', 'trestbps', 'chol', 'fbs',
'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']]
y = data['target']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =
0.3)
knn_model.fit(x_train, y_train)
# Model Evaluation
# Predicting the output on the test data
prediction = knn_model.predict(x_test)
print('---------------------')
print("The prediction of the designed model on the test data is -")
print(prediction)
print('---------------------')
Output:
2. Modify your code to print the confusion matrix for the predictions made.
Copy and paste your code and output below.
Code:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
print('---------------------')
print("The confusion matrix for the the designed model is - ")
print(confusion_matrix(y_test, prediction))
print('---------------------')
Output:
3. Use the confusion matrix attained in question 2 to calculate the following
manually:
(a) Accuracy
(b) Precision
(c) Recall
(d) F1 score
Note: Please show your calculation steps as well.
Accuracy 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁=
22 + 37
22 + 37 + 27 + 5= 0.648
Precision (label 1) 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 =
22
22 + 5= 0.814
Precision (label 0) 𝑇𝑁
𝑇𝑁 + 𝐹𝑁 =
37
37 + 27= 0.578
Recall(label 1) 𝑇𝑃
𝑇𝑃 + 𝐹𝑁 =
22
22 + 27= 0.448
Recall(label 0) 𝑇𝑁
𝑇𝑁 + 𝐹𝑃 =
37
37 + 5= 0.88
F1 score (label 1) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗
0.814 ∗ 0.448
0.814 + 0.448= 0.578
F1 score(label 0) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗
0.578 ∗ 0.88
0.578 + 0.88= 0.698
4. Again, modify your code to print the accuracy, precision, recall and F1 score
values for the predictions made.
Copy and paste your code and output below.
Code:
# accuracy score, recall, precision and f1 score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
print('---------------------')
print("The accuracy of the designed model - ", accuracy_score(y_test,
prediction))
print("The precision value for the the designed model is - ",
precision_score(y_test, prediction, average = None))
print("The recall value for the the designed model is - ",
recall_score(y_test, prediction, average = None))
print("The F1 score value for the the designed model is - ",
f1_score(y_test, prediction, average = None))
print('---------------------')
Ouptut:
5. When comparing the manually calculated and Python calculated values for
accuracy, precision, recall, and F1 score, what do you observe?
All the values were same.
6. What conclusions can you make about the KNN model after finding the accuracy,
precision, recall and F1 score values?
1. The KNN model has an accuracy of 0.648, which is 64.8%. This is the value to
classify whether a patient has a heart disease (label 1) or not (label 0).
2. It is quite interesting to note that the precision, recall and F1 score value are in the
form of 1x2 matrix.
(a) Precision:
• Label 1 – 81.4 % of heart diseases were diagnosed correctly
• Label 0 – 57.8% of heart diseases were not diagnosed correctly.
(b) Recall :
• Label 1 – 44.8 % of heart diseases from the positive class were diagnosed
correctly.
• Label 0 - 88.1% of heart diseases from the positive class were not diagnosed
correctly.
3. Our model’s aim was to classify whether a patient has a heart disease (label 1) or
not (label 0). Observing the values of precision and recall, we can conclude the
model was not able to do thc lassification properly hence, the accuracy needs to
be improved.
Student reflection
List three things you have learned and two things you have enjoyed.
Three things I have learned:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
3. ___________________________________________________________________________________
Two things I have enjoyed:
1. ___________________________________________________________________________________
2. ___________________________________________________________________________________
Learning outcomes
Key Skills
(Please tick the box to show
your understanding of the skills
below.)
I
do not
understand.
I
understand. I’m an expert.
Evaluate a
computational
model that
represents the
relationships
between different
data elements
collected from a
phenomenon or
process.
I can examine a
computational model
that represents the
relationships between
different data elements
collected from a
phenomenon or process.
I can evaluate a
computational model
that represents the
relationships between
different data elements
collected from a
phenomenon or process.
Teacher’s comment: