chapter 2. data and analysis

45
Chapter 2. Data and analysis

Upload: others

Post on 18-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2. Data and analysis

Chapter 2. Data and analysis

Page 2: Chapter 2. Data and analysis

Section 2: Machine Learning

Aim

In this section, you will learn the concept of Machine Learning (ML) and will be introduced

to the concept of Artificial intelligence(AI). AI uses a large amount of data through

various processing and intelligent algorithms to help teach computers to learn

information on their own. Machine learning is a method to achieve artificial intelligence.

Therefore, you are introduced to the concept of machine learning and its various

algorithms and applications.

Learning outcomes

Assess a range of different machine learning algorithms.

Prior knowledge

• Computer science

• Data representation

• Data organisation and storage

My STREAM focus

• Science

• Technology

• Engineering

Key vocabulary

Word Meaning Picture

artificial

intelligence

ability of a machine to perform tasks

that normally require human

intelligence

machine

learning

study of the algorithms that enable

machines to learn from

data to make decisions without the

help of humans

Page 3: Chapter 2. Data and analysis

Word Meaning Picture

supervised

learning

type of learning that uses examples

from the training dataset to learn

unsupervised

learning

type of learning which groups similar

things based on features

from the dataset

reinforcement

learning

type of learning, which uses the

rewards and punishments

system

Classification

algorithm

algorithm that requires the machine to

predict the label or class of an object

regression algorithm that requires the machine to

predict a numeric value

clustering algorithm that requires the machine to

group similar things in the same group

association

algorithm that requires machine to find

the relationships between features in

the input dataset

Page 4: Chapter 2. Data and analysis

SB Activity 2.1.1

Can you identify the purpose of the following technologies?

Activity 2.2.1

Conduct research on how the UAE used artificial intelligence and smart solutions as

part of its strategy to control the spread of COVID-19. Record your answers in the

space given below.

Page 5: Chapter 2. Data and analysis

Activity 2.2.2

Categorise the following statements into true/false statements:

Statement True / False

1. AI is a subset of machine learning. False

2. A computer that uses machine learning needs to take orders from the

user. True

3. Machine learning uses datasets to learn from data. True

4. You can find the features and labels in the dataset by default. True

5. Features can be defined as the unique characteristics of something. True

6. The accuracy of a machine learning model determines the best

relationships and patterns present in a dataset. False

Activity 2.2.3

1. Name and define the type of data used in supervised learning.

Labelled data is data that comes with a tag, like a name, a type, or a number.

2. What is the main goal of supervised learning?

to predict the correct label for newly presented input data

3. List any five types of supervised learning algorithms.

Decision trees

Linear Regression

Random forest

Logistic regression

Support Vector Machines

4. List one difference between classification and regression algorithms.

Classification Regression

Classification is the problem of predicting

a discrete class label output for an

example.

Regression is the problem of predicting a

continuous quantity output for an

example.

Page 6: Chapter 2. Data and analysis

5. Classify each of the following supervised learning application as"Classificatio"

or"Regression":

Application Classification/Regression

Customer behaviour prediction Classification

Predict the price of a car based on the

latest technologies available Regresion

Predict the price of a house price based

on data such as quality of schools in the

area, number of bedrooms in the house

and house location

Regression

Choosing a cat's breed based on its

physical features such as height, width

and skin colour

Classification

Prediction of the temperature of any day

based on wind speed, humidity,

atmospheric pressure

Regression

Activity 2.2.4

1. Name and define the type of data used in unsupervised learning.

Unlabelled data is data that comes with no tag.

2. What is the main goal of unsupervised learning?

to discover hidden and interesting patterns in unlabeled data

3. List any three types of unsupervised learning algorithms.

K – means clustering

Principal component analysis

Neural networks – deep belief networks

4. List one difference between clustering and association algorithms with one

example each.

Clustering Association

It means grouping a set of objects in such

a manner that objects in the same group

are more similar than to those object

belonging to other groups.

It is about finding associations amongst

items within large commercial databases

Page 7: Chapter 2. Data and analysis

Activity 2.2.5

Differentiate between the three types of machine learning algorithm.

Supervised Learning Unsupervised Learning Reinforcement Learning

works with the labelled

data and here the output

data patterns are known to

the system

deals with unlabeled data

where the output is based

on the collection of

perceptions

based on interaction with

environments using the

trial–and–error process

Activity 2.2.6

Conduct research on the real-life application of reinforcement learning. Record your

answers in the space given below.

Activity 2.2.7

Choose the correct option for the following questions:

1. Which of the following is an example of negative reinforcement?

a. Every time you eat chocolate, you get a stomachache.

b. You do not study for a test, and you receive a bad grade.

c. Whenever you drink coffee in the morning, you are able to get your work done

more quickly.

d. Using umbrella results in you not getting rained on. Therefore, you start to bring

an umbrella with you whenever rain is in the forecast.

2. Reinforcement Learning is -

a. Supervised learning

b. Unsupervised learning

c. Award based learning

d. Semi-supervised learning

Try to research the answer on your own.

Page 8: Chapter 2. Data and analysis

Student reflection

List three things you have learned and two things you have enjoyed.

Three things I have learned:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

3. ___________________________________________________________________________________

Two things I have enjoyed:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

Learning outcomes

Key Skills

(Please tick the box to show

your understanding of the skills

below.)

I

do not

understand.

I

understand. I'm an expert.

Assess a range of

different machine

learning algorithms.

I can explain a range of

different machine

learning algorithms.

     

I can assess a range of

different machine

learning algorithms.

Teacher’s comment:

Page 9: Chapter 2. Data and analysis

Section 3: ML Algorithms

Aim

In this section, you will learn how to build a machine learning model. You will first be

introduced to the steps of creating an ML model. The steps include – collecting the data,

building, training, and evaluating the model. You will also be introduced to Scikit-learn

and pandas library, which you will be using to write the Python codes for the ML model

using the PyCharm IDE. You will learn how to collect data in three different ways. Further,

you will learn how to build your first machine learning model in Python. You will evaluate

your ML model by using new datasets and determine its accuracy percentage.

Learning outcomes

Select appropriate data in order to implement a basic classification system using

an AI algorithm.

Prior knowledge

• Python programming

• Machine learning models

My STREAM focus

• Science

• Technology

• Engineering

Key vocabulary

Word Meaning Picture

dataset

collection of data in which

data is arranged in some

order

feature

Individual, independent

variables which act as the

input in a model

class label final output or target output

of a model

Page 10: Chapter 2. Data and analysis

Word Meaning Picture

evaluation

measure of the machine

learning model’s

performance

accuracy

percentage of the correct

predictions from the total

predictions

Page 11: Chapter 2. Data and analysis

SB Activity 2.3.1

Identify the features for the iris dataset.

• petal length

• petal width

• sepal length

• sepal width

SB Activity 2.3.2

Identify the class labels for the iris dataset.

• Iris Versicolor

• Iris Setosa

• Iris Virginica

SB Activity 2.3.3

Can you identify what the output is in the context of the iris dataset?

Page 12: Chapter 2. Data and analysis

SB Activity 2.3.5

1. Modify the code to print the values stored in each of the following variables:

(a) x_train

(b) x_test

(c) y_train

(d) y_test

Observe the results and paste a picture of the output below.

print(x_train)

print(x_test)

print(y_test)

print(y_train)

Page 13: Chapter 2. Data and analysis

2. How many samples are present in the training and test set, respectively?

Training set samples – 105

Test set samples - 45

Page 14: Chapter 2. Data and analysis

Palmer Penguin Dataset

This dataset is used as a replacement for the iris dataset for analysis purposes, as it consists

of 344 samples of data. This dataset was released in August 2020.

This dataset contains data for 344 penguins. There are three different species of penguins

in this dataset, as shown in the figure below. The data is collected from three islands in

the Palmer Archipelago, Antarctica.

Figure 2.3.1

The dataset represents the following features:

• species - a factor denoting penguin species- Adélie, Chinstrap and Gentoo

• island - a factor denoting island in Palmer Archipelago, Antarctica - Biscoe,

Dream or Torgersen

• bill_length_mm - a number denoting bill length – in millimetres

• bill_depth_mm - a number denoting bill depth – in millimetres

• flipper_length_mm - an integer denoting flipper length – in millimetres

• body_mass_g - an integer denoting body mass – in grams

• Gender - a factor denoting penguin gender – female, male

The culmen is the upper ridge of a bird’s bill. In the given data, culmen length and

depth are renamed as variables bill_length_mm and bill_depth_mm. For this penguin

data, the bill length and depth are measured as shown below.

Figure 2.3.2

Page 15: Chapter 2. Data and analysis

Complete all the activities below using the information provided above to build an ML

model using the following available algorithms in Python:

(a) Decision Tree

(b) Random Forest

(c) Logistic Regression

(d) KNN

(e) Naïve Bayes

(f) SVM

While evaluating your models, you will compare the accuracy obtained from each of

the algorithms mentioned above and conclude with the best algorithm technique.

Activity 2.3.1

1. Can you identify the features and class labels of the penguin dataset in

the table below?

Features Class label

species Gender

island

bill_length_mm

bill_depth_mm

flipper_length_mm

body_mass_g

2. What type of classification will be done using the penguin dataset?

Supervised learning concept will be used to classify the data into two types of

gender of penguins – male and female.

3. Categorise the following statement as True or False.

Statement True / False

Features and class labels can be differentiated as

independent and dependent variables, respectively. True

Note:

• Install the package “palmerpenguins” on your IDE to download the dataset directly.

• You can use Example 2.3.2 from your student book.

• Save the complete program for each of the algorithms in different files.

Page 16: Chapter 2. Data and analysis

Activity 2.3.2

Collect the dataset

Write the Python code to load the data from “palmerpenguins” package.

Copy and paste your code in the space below and a screenshot of the

output.

Code:

import pandas as pd

#loading the dataset

data_df = pd.read_csv('penguins.csv')

print(data_df.head())

Output:

Activity 2.3.3

Building and training the model

Modify the code from Activity 2.3.2 to build and train the model using the

following classification algorithms – (a) Decision Tree

(b) Random Forest

(c) Logistic Regression

(d) KNN

(e) Naïve Bayes

(f) SVM

Copy and paste the code below for each of the algorithms.

Code:

#importing libraries

import sklearn

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

Page 17: Chapter 2. Data and analysis

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

#separating the features and class labels

x=data_df[[‘species’, ‘island’, ‘bill_length_mm’,‘bill_depth_mm’,

‘flipper_length_mm’, ‘body_mass_g’]]

y=data_df['Gender']

# building the model

dtc = DecisionTreeClassifier()

rcm = RandomForestClassifier()

logreg_clf = LogisticRegression()

knn_model = KNeighborsClassifier(n_neighbors=5)

na_model = GaussianNB()

SVC_model = svm.SVC()

# training the model

x_train, x_test, y_train, y_test = train_test_split(x, y,

test_size=0.3)

dtc = dtc.fit(x_train, y_train)

rcm = rcm.fit(x_train, y_train)

logreg_clf = logreg_clf.fit(x_train, y_train)

knn_model = knn_model.fit(x_train, y_train)

na_model = na_model.fit(x_train, y_train)

SVC_model = SVC_model.fit(x_train, y_train)

Note:

After training the model, the model needs to be fitted using the fit() command.

Model fitting is a measure of how well a machine learning model generalizes to

similar data to that on which it was trained. Fitting is an automatic process that

makes sure your machine learning models have the individual parameters best

suited to solve your specific real-world business problem with a high level of

accuracy.

Page 18: Chapter 2. Data and analysis

Activity 2.3.4

Evaluate the model

1. Modify the code from the previous activity to evaluate the model using for

each the following classification algorithm – (a) Decision Tree

(b) Random Forest

(c) Logistic Regression

(d) KNN

(e) Naïve Bayes

(f) SVM

#importing libraries

from sklearn.metrics import accuracy_score

# evaluating the model

# decision tree classifier

predictions1 = dtc.predict(x_test)

print("Predicted output for decision tree classifier is:",

predictions1)

print("The accuracy is", accuracy_score(y_test, predictions1))

# random forest classifier

predictions2 = rcm.predict(x_test)

print("Predicted output for random forest classifier is:",

predictions2)

print("The accuracy is", accuracy_score(y_test, predictions2))

# logistic regression

predictions3 = logreg_clf.predict(x_test)

print("Predicted output with logistic regression classifier is:",

predictions3)

print("The accuracy is", accuracy_score(y_test, predictions3))

# KNN model

predictions4 = knn_model.predict(x_test)

print("Predicted output for KNN classifier is:", predictions4)

print("The accuracy is", accuracy_score(y_test, predictions4))

# Naive Bayes

predictions5 = na_model.predict(x_test)

Page 19: Chapter 2. Data and analysis

print("Predicted output for Naive Bayes classifier is:",

predictions5)

print("The accuracy is", accuracy_score(y_test, predictions5))

# SVM

predictions6 = SVC_model.predict(x_test)

print("Predicted output for SVM classifier is:", predictions6)

print("The accuracy is", accuracy_score(y_test, predictions6))

2. Complete the table below to record the predictions and the accuracy value for

each of the algorithms:

Classification

algorithm Prediction output Accuracy

Decision Tree

Random

Forest

Logistic

Regression

KNN

Naïve Bayes

SVM

3. (a) Which algorithm has the best accuracy value?

(b) What is the meaning of having the highest accuracy value?

(a) Random Forest Classifier

(b) This classifier is able to make the highest number of correct predictions for the

penguin dataset.

Page 20: Chapter 2. Data and analysis

Activity 2.3.5 (T,E)

Categorise the following statements into true/false statements.

Statement True / False

1. Training the model is the second step in the ML process. False

2. The testing dataset should be larger than the training dataset. False

3. Usually, 50% of the dataset is used for training and 50% for testing. False

4. Accuracy is used as an evaluation of the model’s prediction. True

5. The evaluation process is used to measure a model’s performance. True

Student reflection

List three things you have learned and two things you have enjoyed.

Three things I have learned:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

3. ___________________________________________________________________________________

Two things I have enjoyed:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

Learning outcomes

Key Skills

(Please tick the box to show

your understanding of the skills

below.)

I

do not

understand.

I

understand. I’m an expert.

Select appropriate

data in order to

implement a basic

classification system

using an AI algorithm.

I can collect data for

training a classification

model.

     

I can select appropriate

data in order to

implement a basic

classification system using

an AI algorithm.

Teacher’s comment:

Page 21: Chapter 2. Data and analysis

Section 4: Data visualisation

Aim

In this section, you will learn to visualise the data collected in the form of different plots.

You will learn about the different libraries required to generate a particular plot style. This

includes line plot, bar plot, pie chart, and scatter plot. You will then understand what type

of information can be inferred from each plot type. You will also learn how to choose the

right chart or graph for your data.

Learning outcomes

• Explain how interactive data visualisations help others better understand real-

world phenomena.

Prior knowledge

• Basics of Python programming

• Machine learning models

My STREAM focus

• Science

• Technology

• Engineering

• Mathematics

Key vocabulary

Word Meaning Picture

data analysis process of studying data to

find useful information

correlation relationship between

different data elements

Page 22: Chapter 2. Data and analysis

Word Meaning Picture

data

visualisation

graphical representation of

information and data

line plots

graphical representation of

data by connecting all data

points with a line

bar plot chart that plots data using

rectangular bars or columns

pie chart type of graph that displays

data in a circular graph

scatter plot

graph in which the values of

two variables are plotted

along two axes

Page 23: Chapter 2. Data and analysis

SB Activity 2.4.1 – Loading of data

The first step of any AI/ML process is to load the data. You have learned how to do

this in the previous section. Open your Python IDE and load the iris dataset in

preparation for visualising the data.

Copy and paste your code below.

import sklearn

from sklearn.datasets import load_iris

iris = load_iris()

OR

import pandas as pd

iris_df = pd.read_csv('iris.csv')

SB Activity 2.4.2

Try the following for the iris dataset:

Modify the code in Figure 2.4.4 to generate a line plot with different colours between –

• class vs sepal_wid

• class vs petal_wid

• class vs petal_len

• class vs sepal_wid:

import sklearn

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

iris_df = pd.DataFrame(iris.data)

iris_df['class'] = iris.target

iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',

'class']

a = iris_df['class'].value_counts()

iris_df.plot(kind='line',x='class',y='sepal_wid', color='blue')

plt.xlabel('class')

plt.ylabel('Sepal Length')

plt.title('Line Plot')

plt.show()

Page 24: Chapter 2. Data and analysis

• class vs petal_wid:

import sklearn

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

iris_df = pd.DataFrame(iris.data)

iris_df['class'] = iris.target

iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',

'class']

a = iris_df['class'].value_counts()

iris_df.plot(kind='line',x='class',y='petal_len', color='green')

plt.xlabel('class')

plt.ylabel('Sepal Length')

plt.title('Line Plot')

plt.show()

Page 25: Chapter 2. Data and analysis

• class vs petal_len:

import sklearn

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

iris_df = pd.DataFrame(iris.data)

iris_df['class'] = iris.target

iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',

'class']

a = iris_df['class'].value_counts()

iris_df.plot(kind='line',x='class',y='petal_wid', color='black')

plt.xlabel('class')

plt.ylabel('Sepal Length')

plt.title('Line Plot')

plt.show()

Page 26: Chapter 2. Data and analysis

SB Activity 2.4.3

You must have noticed that the code for scatter plot is mostly similar to the code for

line plot. Answer the following questions to check if you understand the code layout

for scatter plot in Figure 2.4.13:

(a) How do you identify that the code is written to plot a scatter plot?

iris_df.plot( kind='scatter',x='petal_len',y='petal_wid', color='red')

(b) Which features of the iris dataset are compared in the scatter plot?

Petal width and Petal length

(c) What functions are used to label the x-axis and y-axis in the scatter plot?

plt.xlabel('Petal Length')

plt.ylabel('Petal Width’)

(d) What will happen if you remove the plt.show() function from the code?

The scatter plot will not be displayed.

(e) Can you identify which points in the output belongs to each of the iris species?

No

Page 27: Chapter 2. Data and analysis

SB Activity 2.4.4

(a) Modify the code in Figure 2.4.14 to compare between petal length and width of

the iris species in the dataset.

Copy and paste your code and output below.

Code:

import sklearn

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

from sklearn.datasets import load_iris

iris = load_iris()

iris_df = pd.DataFrame(iris.data)

iris_df['class'] = iris.target

iris_df.columns = ['sepal_len', 'sepal_wid','petal_len', 'petal_wid',

'class']

plt.figure(figsize=(15,7))

plt.title("Comparison between various species based on petal length

and width")

sns.scatterplot(iris_df['petal_len'],iris_df['petal_wid'],hue

=iris_df['class'], style =iris_df['class'] ,s=50)

plt.show()

Output:

(b) What information can you understand from the scatter plot?

• Iris Setosa (Class 0) species has a smaller petal width but higher petal length

• Versicolor (Class 1) lies in almost middle for petal length as well as petal width

• Virginica (Class 2) has larger petal lengths and smaller petal widths

Page 28: Chapter 2. Data and analysis

Activity 2.4.1

Research and identify the purpose of each of the following libraries used in

Python.

Library Purpose

Pandas

NumPy

Scikit Learn

Matplotlib

Seaborn

Activity 2.4.2

Use the link below to download the “fruits_with_colors” dataset -

https://www.kaggle.com/anandraos/fruit-data-with-colours

1. Identify the features and the class labels for the dataset.

Features – fruit_name, fruit_subtype, mass, width, height, color_score

Class labels – fruit_label

2. Perform the following actions -

• Load the dataset

• Explore the dataset to gain more information about the data.

Copy and paste your code and output below.

Code:

import pandas as pd

data_df = pd.read_csv('fruit_data_with_colours.csv')

print('The fruit dataset insights are:')

print(data_df.info())

print('The statistical descriptions of the dataset are:')

print(data_df.describe())

print('The number of values for each class labels are:')

print(data_df['fruit_label'].value_counts())

print('The following list displays if there are any null values

inside the dataset:')

print(pd.isnull(data_df).any())

print('The shape of the given dataset is:')

print(data_df.shape)

Try to research the

answer on your own.

Page 29: Chapter 2. Data and analysis

print('The unique values in the dataset are:')

print(data_df.nunique())

Output:

Page 30: Chapter 2. Data and analysis

3. What information can get from the above output?

Activity 2.4.3

You are planning to visit Jebel Al Ali with your parents, for which you have to

rent a car for around 5 hours. The car rent cost includes a deposit of AED 50

plus an hourly rate of AED 20. Therefore, the cost to rent the car for up to 5

hours is shown in the table below.

No.of hours 0 1 2 3 4 5

Cost in AED 50 70 90 110 130 150

Write a code for plotting a line plot using the table shown above by following

the steps below –

(a) Store the data in the table in a dataframe.

(b) Create a line plot of the data using the created dataframe.

Copy and paste your code and output below.

Code:

# importing libraries

import pandas as pd

import sklearn

import matplotlib.pyplot as plt

import numpy as np

# storing the given data in a dataframe

data = pd.DataFrame([[0, 50],[1, 70],[2, 90], [3, 110], [4, 130], [5,

150]])

data.columns = ['No. of hours' , 'Cost in AED']

print(data)

# creating the line plot in the data

data.plot(kind='line',x='No. of hours',y='Cost in AED', color='red')

plt.xlabel('No. of hours')

plt.ylabel('Cost in AED')

plt.title('Line Plot')

plt.show()

Try to answer on your own using

Table 2.4.2 in your student book.

Page 31: Chapter 2. Data and analysis

Output:

(c) What kind of information can you understand from the generated line plot?

The line plot generated has a upward trend in the graph indicating a positive

corelation.

Page 32: Chapter 2. Data and analysis

Activity 2.4.4

Use the “fruits_with_colors” dataset for plotting a bar plot representing the fruit

categories and the count of their occurrences.

Copy and paste your code and output below.

Code:

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

data_df = pd.read_csv('fruit_data_with_colours.csv')

a = data_df['fruit_label'].value_counts()

fruit_type = a.index

fruit_count = a.values

plt.bar(fruit_type, fruit_count, color='red', width=0.5)

plt.xlabel('fruit_type')

plt.ylabel('fruit_count')

plt.title('Bar Plot')

plt.show()

Output:

What information can you understand from the bar plot?

The data in the dataset is not balanced as each of the fruits type has a different

count value –

Apple ( label 1) – 19

Mandarin (label 2) – 5

Orange (label 3) - 19

Lemon (label 4) - 16

Page 33: Chapter 2. Data and analysis

Activity 2.4.5 (T, E, A, M)

Use the “fruits_with_colors” dataset for plotting a pie chart representing the

fruit categories and the count of their occurrences.

Copy and paste your code and output below.

Code: import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

data_df = pd.read_csv('fruit_data_with_colours.csv')

a = data_df['fruit_label'].value_counts()

fruit_type = a.index

fruit_count = a.values

colors = ['lightblue', 'lightgreen', 'gold']

plt.pie(fruit_count, labels=fruit_type, shadow=True, colors=colors,

autopct='%1.1f%%')

plt.xlabel('fruit_type')

plt.axis('equal')

plt.title('Pie Chart')

plt.show()

Output:

Page 34: Chapter 2. Data and analysis

What information can you understand from the pie chart?

The data in the dataset is not balanced as each of the fruits type has a different

count value –

Apple ( label 1) – 19

Mandarin (label 2) – 5

Orange (label 3) - 19

Lemon (label 4) - 16

Activity 2.4.6

Use the “fruits_with_colors” dataset for plotting a scatter plot to compare the

fruit categories based on the width and height of the fruit.

Copy and paste your code and output below.

Code:

import sklearn

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

data_df = pd.read_csv('fruit_data_with_colours.csv')

plt.figure(figsize=(15,7))

plt.title("Comparison between various fruit based on their width and

height")

sns.scatterplot(data_df['width'], data_df['height'],

hue=data_df['fruit_label'], style=data_df['fruit_label'], s=50)

plt.show()

Output:

Page 35: Chapter 2. Data and analysis

What information can you understand from the scatter plot?

Apple (Class 1) lies almost in the middle in terms of its width and height.

Mandarin (Class 2) has the smallest width and height.

Orange (Class 3) lies has higher width but medium size height.

Lemon (Class 4) has smaller width but higher height.

Activity 2.4.7 (T, E, M)

Can you identify a common feature in all the plots you have generated in activities

2.4.3 – 2.4.6?

The same set of features in the given dataset can be plotted in various formats to get

different insights in a visual manner.

The different plots helped us to see whether the datset is balanced or not and helped

us to compare the sizes of each fruit type.

Activity 2.4.8 (T, E, M)

Research real-life application of each of the plots you have learned in this section.

Give a short description of how plots were useful.

Try to research the answer on your own.

Page 36: Chapter 2. Data and analysis

Student reflection

List three things you have learned and two things you have enjoyed.

Three things I have learned:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

3. ___________________________________________________________________________________

Two things I have enjoyed:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

Learning outcomes

Key Skills

(Please tick the box to show

your understanding of the skills

below.)

I

do not

understand.

I

understand. I’m an expert.

Explain how

interactive data

visualisations help

others better

understand real-world

phenomena.

I can identify data

visualisations that can

help others better

understand real-world

phenomena.

     

I can explain how

interactive data

visualisations to help

others better understand

real-world phenomena.

Teacher’s comment:

Page 37: Chapter 2. Data and analysis

Section 5: Computational models

Aim

In this section, you will learn why it is important to evaluate a learning model. You will also

learn the various performance metrics you can use to evaluate its performance,

including the confusion matrix, precision, recall, F1 score and accuracy. You will learn to

calculate these metrics manually and then using Python. Finally, you will learn to

understand the meaning of the metrics score achieved from the Python program.

Learning outcomes

• Evaluate a computational model that represents the relationships between

different data elements collected from a phenomenon or process.

Prior knowledge

• Basics of Python programming

• Machine learning models

My STREAM focus

• Science

• Technology

• Engineering

• Mathematics

Key vocabulary

Word Meaning Picture

model

evaluation

process to measure

and assess the quality

of a system’s

predictions

accuracy

ratio of correct

predictions by total

number of input

samples

Page 38: Chapter 2. Data and analysis

Word Meaning Picture

confusion

matrix

describes the

complete

performance of a

model in the form of a

matrix

precision

measure of how good

your model is when the

prediction is positive

recall

measure of how good

your model is at

correctly predicting

positive classes

F1 score weighted average of

precision and recall

Page 39: Chapter 2. Data and analysis

SB Activity 2.5.1

For the above example, 2.5.5, suppose we have the below confusion matrix for our

classifier. We can use the metrics defined above to evaluate its performance.

Figure 2.5.13

1. Using the confusion matrix in Figure 2.5.13, complete the table below with the

following values:

True Positive (TP) 4252

True Negative (TN) 4706

False Positive (FP) 421

False Negative (FN) 875

2. Based on the values identified above, calculate the following values:

Accuracy 𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁=

4252 + 4706

4252 + 4706 + 421 + 875=

8958

10254= 0.87

Precision 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 =

4252

4252 + 421= 0.91

Recall 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 =

4252

4252 + 875= 0.83

F1 Score 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗

0.91 ∗ 0.83

0.91 + 0.83= 0.87

Page 40: Chapter 2. Data and analysis

Activity 2.5.1

Categorise the statements below as true or false:

Statements True / False

1. F1 score is the weighted average of accuracy and precision. False

2. Precision and recall are inversely related to each other. False

3. Accuracy, precision, recall and F1 score values can never be

greater than 100%. True

4. The confusion matrix can help you to calculate accuracy,

precision and recall values. True

5. Accuracy is the ratio of false predictions over correct predictions. False

Activity 2.5.2

Match the terms in Column A with their meaning in Column B.

Column A

Column B

False positive An outcome where the model

correctly predicts the negative class

True positive An outcome where the model

incorrectly predicts the negative class

True negative An outcome where the model

incorrectly predicts the positive class

False negative An outcome where the model

correctly predicts the positive class

Activity 2.5.3 (T, E, M)

Download the ‘heart disease’ dataset from the link given below –

https://www.kaggle.com/ronitf/heart-disease-uci

1. Use the dataset to build and train a KNN model to predict if the patient is suffering

from a heart disease or not.

Copy and paste your code below.

Page 41: Chapter 2. Data and analysis

Code:

# Importing the dataset

import sklearn

import pandas as pd

import numpy as np

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

# Loading the dataset

data = pd.read_csv("heart.csv")

print(data.head())

# building the decision tree model

knn_model = KNeighborsClassifier(n_neighbors=5)

# Training the decision tree model

x = data[['age', 'gender', 'cp', 'trestbps', 'chol', 'fbs',

'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']]

y = data['target']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =

0.3)

knn_model.fit(x_train, y_train)

# Model Evaluation

# Predicting the output on the test data

prediction = knn_model.predict(x_test)

print('---------------------')

print("The prediction of the designed model on the test data is -")

print(prediction)

print('---------------------')

Output:

Page 42: Chapter 2. Data and analysis

2. Modify your code to print the confusion matrix for the predictions made.

Copy and paste your code and output below.

Code:

# Confusion Matrix

from sklearn.metrics import confusion_matrix

print('---------------------')

print("The confusion matrix for the the designed model is - ")

print(confusion_matrix(y_test, prediction))

print('---------------------')

Output:

3. Use the confusion matrix attained in question 2 to calculate the following

manually:

(a) Accuracy

(b) Precision

(c) Recall

(d) F1 score

Note: Please show your calculation steps as well.

Accuracy 𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁=

22 + 37

22 + 37 + 27 + 5= 0.648

Precision (label 1) 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 =

22

22 + 5= 0.814

Precision (label 0) 𝑇𝑁

𝑇𝑁 + 𝐹𝑁 =

37

37 + 27= 0.578

Recall(label 1) 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 =

22

22 + 27= 0.448

Recall(label 0) 𝑇𝑁

𝑇𝑁 + 𝐹𝑃 =

37

37 + 5= 0.88

F1 score (label 1) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗

0.814 ∗ 0.448

0.814 + 0.448= 0.578

F1 score(label 0) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 ∗

0.578 ∗ 0.88

0.578 + 0.88= 0.698

Page 43: Chapter 2. Data and analysis

4. Again, modify your code to print the accuracy, precision, recall and F1 score

values for the predictions made.

Copy and paste your code and output below.

Code:

# accuracy score, recall, precision and f1 score

from sklearn.metrics import accuracy_score

from sklearn.metrics import recall_score

from sklearn.metrics import precision_score

from sklearn.metrics import f1_score

print('---------------------')

print("The accuracy of the designed model - ", accuracy_score(y_test,

prediction))

print("The precision value for the the designed model is - ",

precision_score(y_test, prediction, average = None))

print("The recall value for the the designed model is - ",

recall_score(y_test, prediction, average = None))

print("The F1 score value for the the designed model is - ",

f1_score(y_test, prediction, average = None))

print('---------------------')

Ouptut:

5. When comparing the manually calculated and Python calculated values for

accuracy, precision, recall, and F1 score, what do you observe?

All the values were same.

6. What conclusions can you make about the KNN model after finding the accuracy,

precision, recall and F1 score values?

Page 44: Chapter 2. Data and analysis

1. The KNN model has an accuracy of 0.648, which is 64.8%. This is the value to

classify whether a patient has a heart disease (label 1) or not (label 0).

2. It is quite interesting to note that the precision, recall and F1 score value are in the

form of 1x2 matrix.

(a) Precision:

• Label 1 – 81.4 % of heart diseases were diagnosed correctly

• Label 0 – 57.8% of heart diseases were not diagnosed correctly.

(b) Recall :

• Label 1 – 44.8 % of heart diseases from the positive class were diagnosed

correctly.

• Label 0 - 88.1% of heart diseases from the positive class were not diagnosed

correctly.

3. Our model’s aim was to classify whether a patient has a heart disease (label 1) or

not (label 0). Observing the values of precision and recall, we can conclude the

model was not able to do thc lassification properly hence, the accuracy needs to

be improved.

Page 45: Chapter 2. Data and analysis

Student reflection

List three things you have learned and two things you have enjoyed.

Three things I have learned:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

3. ___________________________________________________________________________________

Two things I have enjoyed:

1. ___________________________________________________________________________________

2. ___________________________________________________________________________________

Learning outcomes

Key Skills

(Please tick the box to show

your understanding of the skills

below.)

I

do not

understand.

I

understand. I’m an expert.

Evaluate a

computational

model that

represents the

relationships

between different

data elements

collected from a

phenomenon or

process.

I can examine a

computational model

that represents the

relationships between

different data elements

collected from a

phenomenon or process.

     

I can evaluate a

computational model

that represents the

relationships between

different data elements

collected from a

phenomenon or process.

Teacher’s comment: