introduction motivation for data visualization · 2014-10-24 · data visualization in r l. torgo...

26
Data Visualization in R L. Torgo [email protected] Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014 Introduction Motivation for Data Visualization Humans are outstanding at detecting patterns and structures with their eyes Data visualization methods try to explore these capabilities In spite of all advantages visualization methods also have several problems, particularly with very large data sets © L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 2 / 51

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Data Visualization in R

L. Torgo

[email protected]

Faculdade de Ciências / LIAAD-INESC TEC, LAUniversidade do Porto

Oct, 2014

Introduction

Motivation for Data Visualization

Humans are outstanding at detecting patterns and structures withtheir eyesData visualization methods try to explore these capabilitiesIn spite of all advantages visualization methods also have severalproblems, particularly with very large data sets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 2 / 51

Page 2: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction

Outline of what we will learn

Tools for univariate dataTools for bivariate dataTools for multivariate data

Multidimensional scaling methods

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 3 / 51

Introduction

R Graphics

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 4 / 51

Page 3: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to Standard Graphics

Standard Graphics (the graphics package)

R standard graphics, available through package graphics,includes several functions that provide standard statistical plots,like:

ScatterplotsBoxplotsPiechartsBarplotsetc.

These graphs can be obtained tyipically by a single function callExample of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding severalelements to these graphs (lines, text, etc.)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 51

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 51

Page 4: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to Standard Graphics

A few examples

A scatterplot

plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))

The same but stored on a jpeg file

jpeg('exp.jpg')plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))dev.off()

And now as a pdf file

pdf('exp.pdf',width=6,height=6)plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))dev.off()

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 7 / 51

Introduction to ggplot

Package ggplot2

Package ggplot2 implements the ideas created by Wilkinson(2005) on a grammar of graphicsThis grammar is the result of a theoretical study on what is astatistical graphicggplot2 builds upon this theory by implementing the concept ofa layered grammar of graphics (Wickham, 2009)The grammar defines a statistical graphic as:

a mapping from data into aesthetic attributes (color, shape, size,etc.) of geometric objects (points, lines, bars, etc.)

L. Wilkinson (2005). The Grammar of Graphics. Springer.

H. Wickham (2009). A layered grammar of graphics. Journal of Computational and Graphical Statistics.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 8 / 51

Page 5: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to ggplot

The Basics of the Grammar of Graphics

Key elements of a statistical graphic:dataaesthetic mappingsgeometric objectsstatistical transformationsscalescoordinate systemfaceting

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 9 / 51

Introduction to ggplot

Aesthetic Mappings

Controls the relation between data variables and graphic variablesmap the Temperature variable of a data set into the x variable in ascatter plotmap the Species of a plant into the colour of dots in a graphicetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 10 / 51

Page 6: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to ggplot

Geometric Objects

Controls what is shown in the graphicsshow each observation by a point using the aesthetic mappings thatmap two variables in the data set into the x, y variables of the plotetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 11 / 51

Introduction to ggplot

Statistical Transformations

Allows us to calculate and do statistical analysis over the data inthe plot

Use the data and approximate it by a regression line on the x, ycoordinatesCount occurrences of certain valuesetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 12 / 51

Page 7: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to ggplot

Scales

Maps the data values into values in the coordinate system of thegraphics device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 13 / 51

Introduction to ggplot

Coordinate System

The coordinate system used to plot the dataCartesianPolaretc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 14 / 51

Page 8: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Introduction to ggplot

Faceting

Split the data into sub-groups and draw sub-graphs for each group

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 15 / 51

Introduction to ggplot

A Simple Example

data(iris)library(ggplot2)ggplot(iris,

aes(x=Petal.Length,y=Sepal.Length,colour=Species)

) + geom_point()

●●

●●

●●

●●

● ●

●●

● ●

●●

5

6

7

8

2 4 6Petal.Length

Sep

al.L

engt

h Species

setosa

versicolor

virginica

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 16 / 51

Page 9: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Distribution of Values of Nominal Variables

The Distribution of Values of Nominal VariablesThe Barplot

The Barplot is a graph whose main purpose is to display a set ofvalues as heights of barsIt can be used to display the frequency of occurrence of differentvalues of a nominal variable as follows:

First obtain the number of occurrences of each valueThen use the Barplot to display these counts

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 17 / 51

Univariate Plots Distribution of Values of Nominal Variables

Barplots in base graphics

barplot(table(algae$season),main='Distribution of the Water Samples across Seasons')

autumn spring summer winter

Distribution of the Water Samples across Seasons

010

2030

4050

60

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 18 / 51

Page 10: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Distribution of Values of Nominal Variables

Barplots in ggplot2

ggplot(algae,aes(x=season)) + geom_bar() +ggtitle('Distribution of the Water Samples across Seasons')

0

20

40

60

autumn spring summer winterseason

coun

t

Distribution of the Water Samples across Seasons

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 19 / 51

Univariate Plots Distribution of Values of Nominal Variables

Pie Charts

Pie charts serve the same purpose as bar plots but present theinformation in the form of a pie.pie(table(algae$season),

main='Distribution of the Water Samples across Seasons')

autumn

spring

summer

winter

Distribution of the Water Samples across Seasons

Note: Avoid pie charts!

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 20 / 51

Page 11: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Distribution of Values of Nominal Variables

Pie Charts in ggplot

ggplot(algae,aes(x=factor(1),fill=season)) + geom_bar(width=1) +ggtitle('Distribution of the Water Samples across Seasons') +

coord_polar(theta="y")

50

100

150

0/200

1

count

fact

or(1

)

season

autumn

spring

summer

winter

Distribution of the Water Samples across Seasons

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 21 / 51

Univariate Plots Distribution of the Values of a Continuous Variable

The Distribution of Values of a Continuous VariableThe Histogram

The Histogram is a graph whose main purpose is to display howthe values of a continuous variable are distributedIt is obtained as follows:

First the range of the variable is divided into a set of bins (intervalsof values)Then the number of occurrences of values on each bin is countedThen this number is displayed as a bar

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 22 / 51

Page 12: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Distribution of the Values of a Continuous Variable

Two examples of Histograms in R

hist(algae$a1,main='Distribution of Algae a1',xlab='a1',ylab='Nr. of Occurrencies')

hist(algae$a1,main='Distribution of Algae a1',xlab='a1',ylab='Percentage',prob=TRUE)

Distribution of Algae a1

a1

Nr.

of O

ccur

renc

ies

0 20 40 60 80

020

4060

8010

0

Distribution of Algae a1

a1

Per

cent

age

0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 23 / 51

Univariate Plots Distribution of the Values of a Continuous Variable

Two examples of Histograms in ggplot2

ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10) +ggtitle("Distribution of Algae a1") + ylab("Nr. of Occurrencies")

ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10,aes(y=..density..)) +ggtitle("Distribution of Algae a1") + ylab("Percentage")

0

30

60

90

0 25 50 75 100a1

Nr.

of O

ccur

renc

ies

Distribution of Algae a1

0.00

0.02

0.04

0 25 50 75 100a1

Per

cent

age

Distribution of Algae a1

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 24 / 51

Page 13: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Distribution of the Values of a Continuous Variable

Problems with Histograms

Histograms may be misleading in small data setsAnother key issued is how the limits of the bins are chosen

There are several algorithms for thisCheck the “Details” section of the help page of function hist() ifyou want to know more about this and to obtain references onalternativesWithin ggplot2 you may control this through the binwidthparameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 25 / 51

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesKernel Density Estimates

Some of the problems of histograms can be tackled by smoothingthe estimates of the distribution of the values. That is the purposeof kernel density estimatesKernel estimates calculate the estimate of the distribution at acertain point by smoothly averaging over the neighboring pointsNamely, the density is estimated by

f̂ (x) =1n

n∑i=1

K(

x − xi

h

)where K () is a kernel function and h a bandwidth parameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 51

Page 14: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

An Example and how to obtain it in R basic graphics

hist(algae$mxPH,main='The Histogram of mxPH (maximum pH)',xlab='mxPH Values',prob=T)

lines(density(algae$mxPH,na.rm=T))rug(algae$mxPH)

The Histogram of mxPH (maximum pH)

mxPH Values

Den

sity

6 7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 27 / 51

Univariate Plots Kernel Density Estimation

An Example and how to obtain it in ggplot2

ggplot(algae,aes(x=mxPH)) + geom_histogram(binwidth=.5, aes(y=..density..)) +geom_density(color="red") + geom_rug() + ggtitle("The Histogram of mxPH (maximum pH)")

0.0

0.2

0.4

0.6

5 6 7 8 9 10mxPH

dens

ity

The Histogram of mxPH (maximum pH)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 28 / 51

Page 15: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesGraphing Quantiles

The x quantile of a continuous variable is the value below whichan x% proportion of the data isExamples of this concept are the 1st (25%) and 3rd (75%)quartiles and the median (50%)We can calculate these quantiles at different values of x and thenplot them to provide an idea of how the values in a sample of dataare distributed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 51

Univariate Plots Kernel Density Estimation

The Cumulative Distribution Function in Base Graphics

plot(ecdf(algae$NO3),main='Cumulative Distribution Function of NO3',xlab='NO3 values')

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Distribution Function of NO3

NO3 values

Fn(

x)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●● ●

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 30 / 51

Page 16: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

The Cumulative Distribution Function in ggplot

ggplot(algae,aes(x=NO3)) + stat_ecdf() + xlab('NO3 values') +ggtitle('Cumulative Distribution Function of NO3')

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40NO3 values

y

Cumulative Distribution Function of NO3

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 31 / 51

Univariate Plots Kernel Density Estimation

QQ Plots

Graphs that can be used to compare the observed distributionagainst the Normal distributionCan be used to visually check the hypothesis that the variableunder study follows a normal distributionObviously, more formal tests also exist

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 51

Page 17: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

QQ Plots from package car using base graphics

library(car)qqPlot(algae$mxPH,main='QQ Plot of Maximum PH Values')

−3 −2 −1 0 1 2 3

67

89

QQ Plot of Maximum PH Values

norm quantiles

alga

e$m

xPH

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 33 / 51

Univariate Plots Kernel Density Estimation

QQ Plots using ggplot2

ggplot(algae,aes(sample=mxPH)) + stat_qq() +ggtitle('QQ Plot of Maximum PH Values')

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●● ●

6

7

8

9

−3 −2 −1 0 1 2 3theoretical

sam

ple

QQ Plot of Maximum PH Values

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 34 / 51

Page 18: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

QQ Plots using ggplot2 (2)

q.x <- quantile(algae$mxPH,c(0.25,0.75),na.rm=TRUE)q.z <- qnorm(c(0.25,0.75))b <- (q.x[2] - q.x[1])/(q.z[2] - q.z[1])a <- q.x[1] - b * q.z[1]ggplot(algae,aes(sample=mxPH)) + stat_qq() +

ggtitle('QQ Plot of Maximum PH Values') +geom_abline(intercept=a,slope=b,color="red")

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●● ●

6

7

8

9

−3 −2 −1 0 1 2 3theoretical

sam

ple

QQ Plot of Maximum PH Values

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 35 / 51

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesBox Plots

Box plots provide interesting summaries of a variable distributionFor instance, they inform us of the interquartile range and of theoutliers (if any)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 36 / 51

Page 19: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

An Example with base graphics

boxplot(algae$mnO2,ylab='Minimum values of O2')

●●2

46

810

12

Min

imum

val

ues

of O

2

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 37 / 51

Univariate Plots Kernel Density Estimation

An Example with base ggplot2

ggplot(algae,aes(x=factor(0),y=mnO2)) + geom_boxplot() +ylab("Minimum values of O2") + xlab("") + scale_x_discrete(breaks=NULL)

5

10

Min

imum

val

ues

of O

2

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 38 / 51

Page 20: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Univariate Plots Kernel Density Estimation

Hands on Data Visualization - the Algae data set

Using the Algae data set from package DMwR answer to the followingquestions:

1 Create a graph that you find adequate to show the distribution ofthe values of algae a6

2 Show the distribution of the values of size3 Check visually if it is plausible to consider that oPO4 follows a

normal distribution

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 39 / 51

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used tocreate sub-groups of the data according to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtainedon each of these sub-groupsConditioned plots allow the simultaneous presentation of thesesub-group graphs to better allow finding eventual differencesbetween the sub-groupsBase graphics do not have conditioning but ggplot2 has itthrough the concept of facets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 40 / 51

Page 21: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Conditioned Graphs Conditioned histograms

Conditioned Histograms

Goal: Constrast the distribution of data sub-groups

algae$speed <- factor(algae$speed,levels=c("low","medium","high"))ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_wrap(~ speed) +

ggtitle("Distribution of A3 by River Speed")

low medium high

0

20

40

60

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50a3

coun

t

Distribution of A3 by River Speed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 41 / 51

Conditioned Graphs Conditioned histograms

Conditioned Histograms (2)

algae$season <- factor(algae$season,levels=c("spring","summer","autumn","winter"))ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_grid(speed~season) +

ggtitle('Distribution of a3 as a function of River Speed and Season')

spring summer autumn winter

0

5

10

15

0

5

10

15

0

5

10

15

lowm

ediumhigh

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50a3

coun

t

Distribution of a3 as a function of River Speed and Season

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 42 / 51

Page 22: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Conditioned Graphs Conditioned Box Plots

Conditioned Box Plots on base graphics

boxplot(a6 ~ season,algae,main='Distribution of a6 as a function of Season',xlab='Season',ylab='Distribution of a6')

●●

●●●

●●

●●

●●

spring summer autumn winter

020

4060

80

Distribution of a6 as a function of Season

Season

Dis

trib

utio

n of

a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 43 / 51

Conditioned Graphs Conditioned Box Plots

Conditioned Box Plots on ggplot2

ggplot(algae,aes(x=season,y=a6))+geom_boxplot() +ggtitle('Distribution of a6 as a function of Season and River Size')

●●

●●●

●●

●●

●●

●●

0

20

40

60

80

spring summer autumn winterseason

a6

Distribution of a6 as a function of Season and River Size

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 44 / 51

Page 23: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Conditioned Graphs Conditioned Box Plots

Hands on Data Visualization - Algae data set

Using the Algae data set from package DMwR answer to the followingquestions:

1 Produce a graph that allows you to understand how the values ofNO3 are distributed across the sizes of river

2 Try to understand (using a graph) if the distribution of algae a1varies with the speed of the river

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 51

Other Plots Scatterplots

Scatterplots in base graphics

The Scatterplot is the natural graph for showing the relationshipbetween two numerical variables

plot(algae$a1,algae$a6,main='Relationship between the frequency of a1 and a6',xlab='a1',ylab='a6')

●●●

●●

●●● ● ●● ● ●●●

●● ●

●●

●●

● ●

●● ● ● ●●●●

●● ●

●●

●●● ●● ●●● ●

● ●●

●● ●●● ●

●●●

●●●

●●

● ●

●● ●

●● ●

●●●

● ● ●●●

●●●

●●

●●

●●

● ●

●●

●●●●●

● ●●●●● ●●●

●●

● ●●●

●● ●●●

●●●●

●●

●●

●●●

0 20 40 60 80

020

4060

80

Relationship between the frequency of a1 and a6

a1

a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 46 / 51

Page 24: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Other Plots Scatterplots

Scatterplots in ggplot2

ggplot(algae,aes(x=a1,y=a6)) + geom_point() +ggtitle('Relationship between the frequency of a1 and a6')

● ●●

●●

●●● ● ●● ● ●●●

●● ●

●●

●●

● ●

●● ● ● ●●●●

●● ●

●●

●●● ●● ●●●

● ●

●● ●● ● ●

●●●

●●●

● ●

●● ●

●● ●

●●

● ● ●●●

●●●

●●

●●

● ●

●●

●●●●

● ●●●●● ●●●

●●

● ●●●

● ● ●●●

●●

●●

●● ●0

20

40

60

80

0 25 50 75a1

a6

Relationship between the frequency of a1 and a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 47 / 51

Other Plots Scatterplots

Conditioned Scatterplots using the ggplot2 package

ggplot(algae,aes(x=a4,y=Chla)) + geom_point() + facet_wrap(~season)

●●

●●●●

●●● ●

●●●●

●●

●●● ●

●●●●●

●●

●●●● ●●

●●●

● ●

●●

● ●●

●●

●●

●●

●●● ●●●●

●●

●●

● ●●●●

●●●

●●

● ●

●●●●

●●

●●●●●●

●●

● ●

●●

●●

●●

autumn spring

summer winter

0

30

60

90

0

30

60

90

0 10 20 30 40 0 10 20 30 40a4

Chl

a

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 48 / 51

Page 25: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Other Plots Scatterplot matrices

Scatterplot matrices through package GGally

These graphs try to present all pairwise scatterplots in a data set.They are unfeasible for data sets with many variables.

library(GGally)ggpairs(algae,columns=12:16,

diag=list(continuous="density", discrete="bar"),axisLabels="show")

a1a2

a3a4

a5

a1 a2 a3 a4 a5

0

25

50

75Corr:

−0.294

Corr:

−0.147

Corr:

−0.038

Corr:

−0.292

●●

● ●●●● ●●● ●

●●● ●●● ●● ●●

● ●●● ●

●●● ●

●●

●●●

● ●● ●● ●●

● ● ●●● ●●

●●● ●●●●● ●● ●●

● ●●●●●●

●●●●●●●●

●●●●●

●●●●●●●

●●●● ●●

●●

●●●●

●● ●

●●●

●●●

●●●●●●

● ●●

● ●●●●

●●

●●●

●●●

●●●●

●●

●●●●

●●

●●●●●

●● ●

●●●●

●●●●●●●●●●

●●

●●●●

0204060 Corr:

0.0321

Corr:

−0.172

Corr:

−0.16

●●●

●●●

●●● ●●●

●●●

●● ●● ●●● ●●

●●

● ●● ● ●●●●

● ●● ●● ●●● ● ●●● ●

●●●● ●●●

●●

●●● ●●●

●●

●●●

●●●●●●

●●●●●●●●●

● ●

●●●●

●●●● ● ●● ●

●●

●●

●●

●●

●●●●

●● ●●●

●●●

● ●●●

●●●

● ●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●●

●●●●●●

●●

●● ●●

●●●●

010203040

●●

●●●●

●●●●●●

●●●

●●●●● ●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●

●●●

●●●●●●

●●●●

●●●

●●

●●

● ●●●

●●●● ●●●●

●●

●●

●●

●●

●●●●

●●●● ●

●●●

● ● ●●

●●●

● ●● ●● ●●●

●●●●● ●

●● ●●

●●

●●

● ●●

●●●

●●●●●●

●●

●●● ●

● ●●●

Corr:

0.0123

Corr:

−0.108

●●●●●●●

●●●● ●●● ● ●●

● ●

●●●

●●●

●●● ●●●● ●● ● ●●●●●

●●● ●

●●● ●

●● ● ●●● ●

● ●●●●● ●● ●

●●

●●●●● ●●

●●●●●●●●●●●●●●●●●●●

●●●● ●●

●●● ●

● ●●●● ● ●● ●●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●

010203040

●● ●●● ●●●●●●●●●●●●●●

●●●● ●●●●●● ●●●● ●●●● ●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●● ●●●●●●●

●●

●●●

●●●●● ●●

●●●●● ●●●● ●● ●●●●●● ●●●●●●●

● ● ●●●● ●

●●●

● ●● ●●●● ●●●●

●● ●●●●●●●● ●●● ●●●●● ●● ●●●●●● ●●●●●●●●● ● ●● ●● ●●

● ●● ●●●●

●●●●●●●●●●

● ●

●●●●●●●● ●●● ●●●●●●●●● ●●

●●●●

●●●●●●●●●●

●●●● ●●● ●●

●●

●●●●

●● ●

●●● ●●

●●●●● ●●● ●●●●●●

●● ●●●●●●●●

● ●●●●●●●● ●●● ●●● ●●● ●●● ●●●

●●●●●●●

●●●

●●● ●●●● ●●●●●●●●●●●●●●●●●●

● ● ●●● ●●●●●● ●●●●●●● ●●● ●●●●●●●

Corr:

−0.109

●●●●

● ●●●● ●●● ●

●●

●●

●●

●●

●●● ●

●● ● ●●●●●

●●

● ●

●●

●●● ●● ●●● ●●●●● ●● ●●● ●

●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●● ●●●●

●● ●●

●●

●●●

●●● ●

● ● ●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●

●●●

● ●● ●●●●●●●●●●●●

●●●

●●●

●●●010203040

0 255075

●●●

●●●●●●●●●

●●

●●●●●

● ●

●●●●

● ●●●● ●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●

●●

●●

●●

●●

● ●

● ●●

● ●●●●●

●●● ●

●●

●● ●

● ●●●

●●●●

● ●● ●

●●● ●

●● ●● ●●●●●●●

● ●●● ●

●●●

●● ●●●●●● ●●●●●●●

●●

●●

●●

●●●

0 20 40 60

●● ●

●●●●●●●●●●●

●●

●●●

●●

●● ●●

●●●●●●● ●

●●

●●

●●●●●●●●●●●●● ●●● ●●●●●

●● ●●●

● ●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●● ●●●●

●●● ●

●●

● ●●

●●●●

● ●●●

●●● ●

●●● ●

●●●●●●●●●●●●●

●●●

●●

● ●●●●●● ●●●●●●● ●

●●

●●●

●●● ●

0 10203040

●●●●

●●●●●●●●●●●

●●

●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●● ● ●●●●●●●●●●● ●● ●

●●●●●●● ●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●●

●●●●

●●

●●●

●●●●

● ●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●● ●

0 10203040 0 10203040

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 49 / 51

Other Plots Scatterplot matrices

Scatterplot matrices involving nominal variables

ggpairs(algae,columns=c("season","speed","a1","a2"),axisLabels="show")

seas

onsp

eed

a1a2

season speed a1 a2

0

20

40

60

autumn

springsum

mer

winter

●● ●

● ●● ●●●

●● ●●

● ●

●●●●

●●

●●●●●

autumn

springsum

mer

winter

01020

01020

01020

● ●●

●●● ●● ●●●●

●●● ●● ●● ●● ●●

●●●●●● ●

highlow

medium

0

25

50

75Corr:

−0.294

0

20

40

60

0102030010203001020300102030020400204002040

●●

● ●●

●● ● ●●●

●●● ●●● ● ● ●●

● ●●● ●

●●

● ●

● ●

●●●

● ●● ●● ●●

● ● ●●● ●●

●●● ●●●●● ●● ●●

● ●●●●●●

●●●●●●●●

●●●●●

●●●●●●

●● ●● ●●

●●

●●●●

●● ●

●●

●●●

●●●●●●

● ●●

● ●●

●●

●●●

●● ●

●● ●●

●●●●

●●

●●●●●

● ●●●●●

●●●●●● ●●●

●●

●●●

0 25 50 75 0 20 40 60

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 50 / 51

Page 26: Introduction Motivation for Data Visualization · 2014-10-24 · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

Other Plots Parallel plots

Parallel Plots

Parallel plots are also interesting for visualizing a data set

ggparcoord(algae,columns=12:16,groupColumn="speed")

0.0

2.5

5.0

7.5

10.0

a1 a2 a3 a4 a5variable

valu

e

speed

high

low

medium

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 51 / 51