how to make a picture worth a thousand words: effectively ... · data visualization is… • a...

79
How to make a picture worth a thousand words: Effectively communicating your research results using statistical graphics Yates Coley, PhD Kaiser Permanente Washington Health Research Institute Seattle , WA Joint work with Mike Jackson, PhD, KPWHRI April 4, 2018 This talk is based on materials I developed with Mike Jackson for a conference workshop

Upload: others

Post on 27-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

How to make a picture worth a thousand words:

Effectively communicating your research results using statistical graphics

Yates Coley, PhD Kaiser Permanente Washington Health Research Institute

Seattle , WA

Joint work with Mike Jackson, PhD, KPWHRI

April 4, 2018

This talk is based on materials I developed with Mike Jackson for a conference workshop

Page 2: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Seminar Outline

• Introduction

• Fundamentals of Statistical Graphics

• Data Visualization Best Practices

• Resources

Page 3: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization is…

• a scientific discipline.

• both a principled and subjective art.

• work!

• important!

• an organizing framework.

DV is an active area of research, a discipline with technical vocabulary and evidence-based practices.While tenets of design do inform data viz best practices, there is rarely a single correct way to present information and results. Best graphic for a situation will depend on its purpose, the audience, as well as personal preferences.With this in mind, constructing e"ective data visualizations takes time, patience, and e"ort. This may be the most important lesson to take from this seminar! Good graphs don’t just happen in a matter of minutes but require a commitment to communicating your data and results as best as possible. DV is time-consuming but it is also important because it one of the most e"ective tools you have to communicate your research findings in presentations and publications. Your research has a greater potential for impact if it is easy to understand the study and results.Finally, the discipline of DV can provide you with an organizing framework for thinking about what works (and what doesn’t) when you are creating a graphic and planning how to approach DV for your publications and presentations. Many of us have an inherent understanding of many of the components of this framework through years of experience making graphs to present our research, but we can all gain insight and refine our skills by considering the underlying theory.

Page 4: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Objectives

• Present organizing framework for data visualization

• Describe conceptual best practices for creating statistical graphics and give concrete examples

• Provide sources and references for future consultation

May need to edit this

Page 5: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Seminar Outline

• Introduction

• Fundamentals of Statistical Graphics

• Data Visualization Best Practices

• Resources

Page 6: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Components of a data visualization

• Visual Cues

• Coordinate System

• Scale

• Context

Page 7: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Yau (2013) Data PointsFIGURE 33 Visual cues

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

Visual Cues

Visual cues use “visual encoding” to communicate information by mapping data items to plotting symbols and other graphical elements. Appropriate visual cues for data attributes depend on whether they are identity (categorical) vs. magnitude (quantitative) variables

Page 8: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Yau (2013) Data PointsFIGURE 33 Visual cues

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

Visual Cues

Quantitative Variables

Visual cues for quantitative variables are able to indicate the relative di"erences or distance between two values

Page 9: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Yau (2013) Data PointsFIGURE 33 Visual cues

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

Visual Cues

Categorical Variables

Shapes and color hue are used to encode categorical variables because there is not inherent distance between, for example, a square and a circle.

Page 10: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

!!

!

!

!

!!

!

!

!!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

! !Low volume High volume

Diagnostic characteristics of patients in active surveillance

Diagnostic biopsy

So that is all a bit abstract. To give a concrete example, this plot shows the diagnostic characteristics for about 300 active surveillance prostate cancer patients. Each circle represents one patient. The position of the circle along the x-axis is a visual cue indicating the patient’s age at prostate cancer diagnosis. The vertical position indicates the patient’s PSA at diagnosis. (PSA is a biomarker of inflammation in the prostate.) Age and PSA are both continuous variables.

This circle represents a patient who is X years old and had a PSA of Y at diagnosis. This patient is older than those to the left and younger than those to the right. This patient had a higher PSA than those with circles below him and a lower PSA than those above.

We use color as a visual cue to indicate whether a patient’s diagnostic biopsy showed a low or high volume of cancer (categorical variable).

Page 11: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Components of a data visualization

• Visual Cues

• Coordinate System

• Scale

• Context

The other components of a DV are coordinate system, scale, and context.Coordinate system is the space you are using to map data on, like a cartesian plane for scatterplot, polar coordinates for pie chart, and geographic systems for maps. In this talk, most of my examples will use a cartesian plane. Scale- using intervals that make sense can increase readability of graph and shift focusContext- We also must explain to audience what the data mean and why they matter. Frequently, we use figure title or caption to directly provide context. More generally, the presentation or paper provides greater context.

Page 12: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization Process

• What data do you have?

• What do you want to know about your data?

• What visualization method should you use?

• What do you see and does it make sense?

Yau (2013) Data Points

One helpful way to approach the data visualization process is to work through these four questions.

Page 13: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization Process

• What data do you have?

• Continuous, ordinal, or categorical?

• Time series?

• What do you want to know about your data?

• What visualization method should you use?

• What do you see and does it make sense?

Yau (2013) Data Points

What are the variable types—continuous, ordinal or categorical? Does it reflect values at a single moment in time, or the change in values over time?

Page 14: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization Process• What data do you have?

• What do you want to know about your data?

• Distributions of single variables?

• Relationships between variables?

• Summaries or unit-level detail?

• What visualization method should you use?

• What do you see and does it make sense?

Yau (2013) Data Points

Do you want to describe the distribution of each variable separately, or are you looking at how two or more variables are associated? Do you want to illustrate some summary of the data—such as mean and confidence interval—or do you want individual data points to be represented in your graph?

Page 15: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization Process

• What data do you have?

• What do you want to know about your data?

• What visualization method should you use?

• What do you see and does it make sense?

Yau (2013) Data Points

How you answered the first two questions will determine the best options for visualizing your data. There are a many di"erent types of graphs for each problem, so I will just go through some of the most common.

Page 16: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

0 5 10 15

020

4060

80

0 5 10 15

020

4060

80

PSA (ng/mL)

Num

ber o

f pat

ient

s

Histogram: Unit-level Boxplot: Summary

05

1015

05

1015

PSA

(ng/

mL)

For a single, continuous variable, there are many options for visualizing the variable’s distribution. A histogram shows the distribution on the level of the unit of observation. Here, the visual cue is the height of the bar (so length and area) at each PSA grouping along the x-axis. Density plots are another way to show the distribution of a single variable

Box plots give a summary of the variable (range, median, quartiles). Box plots use position as a visual cue, and they rely on the audience’s familiarity with the construct to decode median, IQR, etc.

Page 17: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

PSA density at diagnosis

Low High

050

100

150

200

250

PSA density at diagnosis

Num

ber o

f pat

ient

s

Low High

Cancer volume at diagnosis

Low High

050

100

150

200

250

Cancer volume at diagnosis

Low High

Num

ber o

f pat

ient

s

Add proportion of whole plots? Could do stacked barplot and pie chartFor a single categorical variable, bar plots can be used to show the number (or proportion) of observations in each category.Pie charts are an option for showing proportions, but they are less common in academic work (and typically ridiculed for reasons good and bad).

Page 18: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

!!

!

!

!

!!

!

!

!!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

Diagnostic PSA of patients in active surveillance

Scatterplots are commonly used to show the joint distribution of two continuous variables. Scatter plots use the visual cue of position or distance along the x- and y-axes for variable. Scatter plots also use the visual cue of direction or slope to suggest an association between variables.

Page 19: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

!

!!

! !

!

!

!

!

!

!

!

!

!

! !!

!

!

!!

! !

!

!

!

!

!

!!

!

!

!!

!!

! !

!

! !

!

!

! !

!

!! !

!

!!

!! ! !

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

! !

! !

!! !

! !

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

! !

!

! !

!

! !

!

!

! !

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!

! !!

!

!

!

!

!

!

!!

!! ! ! ! ! ! ! ! !

!

! ! ! !!!

!! !

!

!

! !!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!! !

!

!

!

!!

!

!!

!

!

!

! !

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

!

!

! ! !

!

!

!

!

!

!

!

! !

!

!!

!! !

!

! !

!

!

!

!

!

!!

!

!

!

!

!

!!

! !!

!!

!!!

!

!

!

!

!

!

!

!

!

!

! !

!!

! !

!

! !! !

!!

!

! !

!

!

!

!!

!

!!

!

!

! !

!

!

! !

!

!

!

!

! !

!

!!

!

!

!

!! !

!!

!!

!

! !

!

!

!

! !

!!

!

!

!

!!

!

! !

!

!!

!

!!

! ! ! ! ! !

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! ! !!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

! ! !

!

!

! !

!

!

!

!

!

!! !

!

!

!

!

! !

!

!

!

! !! !

!

!

!

!

!!

!! !

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!! !

!

!

!!

!

!

! ! ! !!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!

!!

!

!

!!

!

!

!!

! ! !

!

!!

!! !

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!! !

!

!

!

! !

!

!

! ! ! ! !

!

!

!

!

!

!

!

! ! !

!

!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

! !

!

!!

!

!

! ! !

!!

!

!!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!!! !

! ! !

!

!

!

!

!

!!

! !

!

!!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!!

! !

!

!!

!

!

!!

!

!

!

!

!

!

! !

!

!!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!!

!

!

!

!

!

!

!!

! !!

!

! !

!!

!

!!

!

!

!

! !

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

! !

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !! !! ! !

!!!!

!

! !!

! !

! !

!

!

!

!!

! !

!

!

!

!

!

!

!! !! ! ! ! !

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

! !

!

!

!

!

!

!! !

!

!

!!

!

!!

! !

!

!

!!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

! !

!! !

!!

!

!

!

!! !

!

!

!!!

!

! !

!

!

!

!

!

!

! !

!! !

!

!

! !

!!

!

!!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

! !

!

!

!

!

! !

!

!

!

!

!

! !

!

!

! !!

! !

!!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!!

!

!

!!

! !!

!!

!!

!

!

!

!

!

!

!

!!

!!

!

!

!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!!

! !!

!

!

!

!!

!!

!

!!!

!!

!

!!

!

!

!!

!

!

!

!!

!

!

!

!!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

! !

!!!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!!

!

!!!

!

!

!

!

!

!

! !

!

!

!

!!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!!! !

!

! !

!

! !! !

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !

! !

!

! !!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!!

!

!

!!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!!!

!

!

!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!! !!

!

!

!

! !!

!

!

!!!

!!

!!

!

!

!

!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!!

!

! ! !!

!!! !

! !!!

!

!

!!

! !

!

! !

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

! !!!

!

!

!

!!

! !

!! ! !

! !! !

!

!

!

!! !

! !!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

! ! !!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

! !

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!!!

!

!

!!

!!

!

!

!!! !!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!!

!

!

!

!

!!

!

!!

!

!

!

!

!

! !

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

20

PSA observations for patients throughout active surveillance

There are other options for visualizing the relationship between two continuous variables. For example, if we wanted to show ALL the PSA observations while these 300 patients were under surveillance (rather than just those observed at diagnosis), the scatterplot is way too crowded, and it is hard to tell how many points are overlapping and what the distribution actually is in this cloud.

Page 20: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

0

5

10

15

20

50 55 60 65 70

0

5

10

15

20

25

30

35

40

PSA observations for patients throughout active surveillance

Age (years)

PSA

(ng/

mL)

In this case, we can use a heat map to see the joint distribution of PSA and age. Here, color saturation across three di"erent hues is the visual cue and darker color indicates a larger number of observations in that region. Heat maps can also be particularly helpful for discrete data scatterplots where many points may overlaps

Page 21: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Diagnostic biopsy

!!

!

!

!

!!

!

!

!!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

! !Low volume High volume

Diagnostic characteristics of patients in active surveillance

Scatterplots also give us an option to examine 2 continuous and 1 or more categorical variables simultaneously. Here, we’ve added color to indicate high or low volume on diagnostic biopsy.

Page 22: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

40 45 50 55 60 65 70 75

010

3050

40 45 50 55 60 65 70 75

01

23

45

67

Age at prostate cancer diagnosis

Low volume on diagnostic biopsy

High volume on diagnostic biopsy

Age (years)

Age (years)

Num

ber o

f pat

ient

sN

umbe

r of p

atie

nts

If we wanted to compare the distribution of a continuous variable across levels of categorical variable, we could use vertically aligned histograms. Here, we are comparing the age at diagnosis for patients with low and high cancer volume on their diagnostic biopsy. We see a wider range for patients with low volume, but this particular graphic doesn’t help us make any determinations about how the middle of the distributions compare.

Page 23: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

4550

5560

6570

75

Age at prostate cancer diagnosis

Age

(yea

rs)

Low volume ondiagnostic biopsy

High volume ondiagnostic biopsy

Another option is using side-by-side box plots to compare summaries of the distribution of a continuous variable across levels of a categorical variable. Here, we see that patients with high volume at diagnosis are, on average, older. Of course, we can’t tell from the boxplot whether this di"erence is statistically significant—we’d have to perform some statistical test to determine that—or whether this di"erence in clinically meaningful.

Page 24: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Volume

PSAD

Low High

High

Low

PSA density

Mosaic plot

Volume on diagnostic

biopsy

Low High

Low

High

Mosaic plots are an option for examining the relationship between two categorical variables. Mosaic plots use both distance and area to help us understand univariate and bivariate trends. The width of the two boxes here on the x-axis tell us that there are considerably more patients with low PSAD than high, because this line is much longer. Next, we can compare the heights of these boxes. We see that patients with low PSAD and more likely to also have a low volume of cancer at diagnosis because the vertical distance here, the height of the box, is greater than that of the box for high PSAD. Patients with high PSAD are more likely to have high cancer volume at diagnosis as well.

Page 25: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Data Visualization Process

• What data do you have?

• What do you want to know about your data?

• What visualization method should you use?

• What do you see and does it make sense?

Yau (2013) Data Points

To prepare you to evaluate whether your DV makes sense, let’s move on to consider best practices for making e"ective statistical graphics

Page 26: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Seminar Outline

• Introduction

• Fundamentals of Statistical Graphics

• Data Visualization Best Practices

• Resources

Page 27: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

How do we define an “effective” statistical graphic?

• An effective statistical graphic enables the reader to

• extract information accurately

• with reasonable effort and

• high confidence.

Enrico Bertini Lecture #3

It is helpful to first consider how we might define an “e"ective” statistical graphic. I like this definition from Enrico Bertini, who says that…Note “reasonable e"ort” (not “easy”)- graphic doesn’t have to be so simple it is understood in a glance. For some of the plots I just showed, it took me several sentences to explain what we were looking at, and that is ok, as long as you don’t frustrate or discourage your audience.

Page 28: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Expressiveness Principle

Statistical graphic “should express all and only the information in the data” (and statistical results).

Enrico Bertini Lecture #4

Bertini outlines two principles that I think are helpful in guiding data visualization. First, your graphic should express all the information in your data but only the information in your data. And we can expand that to include the information in your analysis results.

Page 29: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #3

0 100 200 300 400 500 600 1200700 800 900 1000 1100

Number of observations

A B C D E F G H IJ K L M N O P Q

Category

Take, for example, a dataset with a number of observations in each category A-Q. This barplot shows the number of observations in each category, but it di#cult to distinguish the ordering. For example, I can see that K has the most observations, but it is not clear at a glance how D compares to P.

Page 30: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #3

Sorted Bar Chart

K B Q I E L O A D PJ H G C M N F

0 100 200 300 400 500 600 1200700 800 900 1000 1100

Category

Number of observations

If we instead order the categories with respect to the number of records, the graphic provides *more* information than the alphabetical ordering. In this way, the revised graph is better representing *all* the information in the data.

Page 31: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

line chart with categorical data (wrong!)

Enrico Bertini Lecture #3

A B C D E F G H I J K L M N O P Q

1200

1000

800

600

400

200

0

Category

Number of observations

On the other hand, if the original alphabetized bar plot was transformed into a line plot, then we are displaying information that is not in the data; there is no relevant information in the change between alphabetized categories. This plot emphasizes the di"erence between categories A and B and between B and C but not A and C. Line plots like this should be reserved for time series data or other quantitative variables on the x-axis where the change across x is meaningful.

This construction implies some distance between categories that is meaningful, when the ordering of categories is arbitrary.

Page 32: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Effectiveness Principle• “The importance of the information should match

the salience of the mode of visual encoding”.

• “Salience” is characterized by:

• Accuracy

• Discriminability

• Separability

• “Pop-out”

• GroupingEnrico Bertini Lecture #4

In addition to the expressiveness principle—your graphic should show all and only the information in your data—the other principle that Bertini outlines is the E"ectiveness Principle… accuracy: how accurately values can be estimateddiscriminability: how many di"erent values can be perceivedseparability: degree of interaction between multiple encodingspopout: how easy it is to spot some values from the restgrouping: how good a mode is at conveying groups, how easy it is to perceive natural groupings of observations with similar attributes

DO we want to illustrate these properties on separate slides?

Page 33: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

Here, e"ectiveness or salience of visual cues for quantitative and categorical variables are ranked from most to least e"ective.

Page 34: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

Position on a common scale is the most salient visual cue. This corresponds to the cartesian plane being the standard/default method for most statistical graphics.

Page 35: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Diagnostic biopsy

!!

!

!

!

!!

!

!

!!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

! !Low volume High volume

Diagnostic characteristics of patients in active surveillance AccuracyDiscriminability

SeparabilityPop-out

Grouping

If we evaluate the cartesian plane on our characteristics checklist, we see that we are able to accurately identify the age and PSA value associated with each point… This patient is about 54 years old and has a diagnostic PSA about 11 or 12. In this case, we could be even more accurate if I added grid lines at each year. We are able to discriminate well. It’s clear that this patient is younger than this patient and has a higher diagnostic PSA. We are able to separate the visual cues well, that is distinguish between horizontal and vertical distance, and distance and color cues do not interfere with one another. Outliers naturally pop-out because it’s a long distance from the cloud of points. And, we are able to identify any natural groupings that occur. These patients here are pretty similar, as are these…

Page 36: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

And we can go down the list of e"ectiveness. Next we have… tilt or angle is what we use for pie charts…

Page 37: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Source: New York Times

area is commonly used in infographics in journalism. Large values pop-out really well when using area, but we see here it can be di#cult to discriminate between circles of similar sizes, but the goal here is to highlight the US’s largest trading partners, emphasizing that most of the country’s trading is with Canada and Mexico. This graphic actually adds the ranks to the graph so you don’t have to struggle to compare circles of similar sizes. Overlap is another problem with using area in many contexts.

Page 38: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Source: The Economist

Here, the goal is to show that there is overwhelming opposition, both in number of organizations and their size

Page 39: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

Moving to categorical variables, spatial placement is the most salient cue, like we use for maps. Next, we have color hue.

Page 40: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

Note shapes: usually easy to accurately di"erentiate between two shapes, so it scores well on accuracy and discriminability, but the eye cannot as easily pick out patterns, group symbols, etc. like with color.

Page 41: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Diagnostic biopsy

!!

!

!

!

!!

!

!

!!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

! !Low volume High volume

Diagnostic characteristics of patients in active surveillance

2 continuous and a categorical variable

Page 42: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

!!

!

!!

!

!

!

!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

Age (years)

PSA

(ng/

mL)

45 55 65 75

05

1015

! Low volume High volume

Diagnostic characteristics of patients in active surveillance

Diagnostic biopsy

Add PSA plot with shapes for dx vol group instead of color

Page 43: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #8

Quantitative Variables Categorical Variables

Perceptionvs.

Cognition

This brings us to an important idea in data visualization:You want readers to readily perceive visual encoding of as many important attributes of data as possible rather than cognitively, explicitly decode value by, for e.g., reading legend or key.In this talk, I’ve been explicitly interpreting all the visual cues, but this is a task our minds do automatically with the most e"ective data visualizations. To return to an earlier example:

Page 44: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #3

0 100 200 300 400 500 600 1200700 800 900 1000 1100

Number of observations

A B C D E F G H IJ K L M N O P Q

Category

It is possible for a reader to painstakingly read through this list and order the categories based on number of observations. We are able to discriminate these values, but

Page 45: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Enrico Bertini Lecture #3

Sorted Bar Chart

K B Q I E L O A D PJ H G C M N F

0 100 200 300 400 500 600 1200700 800 900 1000 1100

Category

Number of observations

The ordering is naturally perceived and the audience can move on to drawing conclusions about where categories rank.

Page 46: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

0

5

10

15

20

50 55 60 65 70

0

5

10

15

20

25

30

35

40

PSA observations for patients throughout active surveillance

Age (years)

PSA

(ng/

mL)

For another example of where we rely on perception

Page 47: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Guiding Principles• Make the data stand out. Maximize the data-to-

ink ratio.

• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.

• Strive for clarity.

• Clear vision.

• Clear understanding.

Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information

These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when constructing a graphic. But, they are a little abstract. What do they actually mean?

Page 48: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Guiding Principles• Make the data stand out. Maximize the data-to-

ink ratio.

• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.

• Strive for clarity.

• Clear vision.

• Clear understanding.

Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information

These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when contributing a graphic. But, they are a little abstract, and they tend to overlap. What do they actually mean?

Page 49: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual Cues“Make graphical elements encoding data visually prominent.”

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 50: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual ProminencePlotting symbols are large, dark enough to be easily seen

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

Plotting symbols are large, dark enough to be easily seen

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 51: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual ProminencePlotting symbols aren’t obscured by connecting lines

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

Plotting symbols aren’t obscured by connecting lines

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 52: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual ProminenceOverlapping plotting symbols are easily distinguishable

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

Overlapping plotting symbols are easily distinguishable

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 53: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual ProminenceSuperposed data readily visually discriminated

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

Superposed data readily visually discriminated

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 54: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual ProminenceGraphical elements do not interfere with data

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980

!40

!30

!20

!10

0

YEAR

PER

CEN

T C

HAN

GE

IN D

EATH

RAT

E FR

OM

195

0

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cleveland (1983) VDQI, Ch. 2

Graphical elements do not interfere with data

One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)

Page 55: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual hierarchy

Yau (2013) Data Points

Visual Hierarchy | 203

point of interest. This creates a visual hierarchy that helps readers immediately focus on the vital parts of a data graphic and use the surroundings as context, as opposed to a flat graphic that a reader must visually rummage through.

For example, Figure 5-1 is the scatterplot from the previous chapter that shows NBA players’ usage percentage versus points per game. The dots, fitted line, grid, border, and labels are of the same color and thickness, so there is no clear visual focus. It’s a flat image, where all the elements are on the same level.

FIGURE 51 All visual elements on the same level

This is easily remedied with a few small changes. In Figure 5-2, the line width of the grid lines is reduced so that they are no longer as thick as the fitted line. In this example, you want the data to stand out. The grid lines also alternate in width so that it is easier to see where each data point lies in the coordinate system, and there’s no imaginary blur that you get in the original chart.

FIGURE 52 Width of grid lines reduced to fit in background

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:12:31.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

Place visual elements on different “levels” to shift focus, draw attention to most important aspect of data or results.

Another consideration for making the data stand out is to construct a visual hierarchy that draws attention to the data.Here, this image is “flat”, all graphical elements are on the same level. It is hard to distinguish data and trend line from the grid, and the image “vibrates”.

“Highlight data with bolder colors than the other visual elements, and lighten or soften other elements so that they sit in the background. Use arrows and lines to direct eyes to the point of interest. This creates a visual hierarchy that helps readers immediately focus on the vital parts of a data graphic and use the surroundings as context, as opposed to a flat graphic that a reader must visually rummage through.” (p. 203)

Page 56: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Visual hierarchyPlace visual elements on different “levels” to shift focus, draw attention to most important aspect of data or results.

Yau (2013) Data Points

204 | CHAPTER 5: Visualizing with Clarity

Still though, the fitted line is obscured by all the dots, because (1) it’s thin com-pared to the radius of each dot and (2) it still blends in with the grid behind it. Figure 5-3 changes the color to blue to make the data stand out more, and the width of the fitted line is increased so that it clearly rests on top of the dots.

FIGURE 53 Focus of chart shifted to fitted line with color and width

The chart is a lot more readable now, but if you imagine people viewing the graphic like they would a body of text—from top to bottom and left to right—more descriptive axis labels and less prominent value labels can help, as shown in Figure 5-4. The text within the chart works similar to how it does in an essay or a book. Headers are often printed bigger and in a bold font to provide both structure and a sense of flow. In this case, the bolder labels provide immediate context for what the chart is about. Also, notice fewer and less prominent gridlines, which directs focus further to the upward trend.

FIGURE 54 Grid and value labels adjusted and fewer, less prominent gridlines

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:12:31.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

As a solution, edit the image to place visual elements on di"erent levels. In particular, edit width, color, and frequency of grid lines; make trend line wider and color data points lighter to emphasize trend (could alternatively emphasize data points and mute trend line) Remove top and bottom frames of image.

Page 57: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Guiding Principles• Make the data stand out. Maximize the data-to-

ink ratio.

• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.

• Strive for clarity.

• Clear vision.

• Clear understanding.

Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information

One principal we see repeated across the DV literature is… Of course, there are disagreements about what is redundant or superfluous.

Page 58: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Reduce non-data ink?! CARDIOVASCULAR

OTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

“The four scale lines also provide a clearly defined region where out eyes can search for data. With just two, data can be camouflaged by virtue of where they lie.” (WC p. 35)

ET removes grid lines p. 100-105

Cleveland is particularly worried about the audience missing datapoints if the region isn’t clearly defined. here, that is not a concern. plus, we are connecting the points with a line, so the audience will see them all.

Page 59: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Reduce redundant data ink?! CARDIOVASCULAR

OTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Here, lines are helpful to see trend, and make it easier to see the other deaths data. Lines are particularly useful for time trend data.

Page 60: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

0 5 10 15

020

4060

80

0 5 10 15

020

4060

80

PSA (ng/mL)

Num

ber o

f pat

ient

s

05

1015

05

1015

PSA

(ng/

mL)

Reduce non-data ink?

In fact, our standard graphing tools are filled with redundant or non-data ink. For the histogram, we could just use vertical lines instead of bars since the meaningful visual cue here is distance, not area. Box plots are primarily constructed of non-data ink. Why draw a box when horizontal lines will do? In both of these cases, the graph types are familiar to a scientific audience and you don’t necessarily need to follow all guiding principles to the extreme.

Page 61: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Reduce redundant data ink!! CARDIOVASCULAR

OTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!!

!

! !

!

!

!

!

!

!

!

!

! !

! !

!!

!

!

!

!

!

!

!

! !

!

! !

!! !

!!

!

!

!

!

!

!

! !

!

!

!

!

! !

!

!

!

!

!

!

!

Here, size and % change are redundant. Don’t do this. (Also, plotting symbols overlap. Don’t do this either.)

Page 62: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Guiding Principles• Make the data stand out. Maximize the data-to-

ink ratio.

• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.

• Strive for clarity.

• Clear vision.

• Clear understanding.

Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information

These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when constructing a graphic. But, they are a little abstract. What do they actually mean?

Page 63: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

! CARDIOVASCULAROTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Data labels?

CARDIOVASCULARDEATHS

OTHERDEATHS

FIRST CARDIOVASCULAR CARE UNIT

“Avoid putting notes, keys, and markers in the data region. Put keys and markers just outside the data region and put notes in the legend or text.” Cleveland (p. 47)counterpoint: eyes don’t have to search to decodeET recommends markers/pointersverdict: depends on your particular graphic. try both and see which is easiest to read.

Page 64: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Grid lines?! CARDIOVASCULAR

OTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

“Forgo chartjunk, including moire vibration, the grid, and the duck” ET p. 121“Grid should usually be muted or completely suppressed so that it’s presence is only implicit-lest it compete with the data.” WCggplot2 has grid lines by defaultverdict: sometimes helpful! sometimes just clutter. depends on your message

Page 65: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Guidelines for Text! CARDIOVASCULAR

OTHER

FIRSTCARDIOVASCULAR

CARE UNIT

1950 1960 1970 1980!4

0!3

0!2

0!1

00

YEARPE

RC

ENT

CH

ANG

E IN

DEA

TH R

ATE

FRO

M 1

950

!

!

!!

! !!

!

!!

!

!

!

!! !

! !

!!

!

!!

!

!!

!! !

!

!

!!

!

!! !

!!

!

!

!

!!

!! !

!!

!

!! !

!

!

!!

!

!!

Cardiovasculardeaths

Other deaths

First cardiovascularcare unit

Year

Change indeath rate (%)

Words are spelled out, mysterious and elaborate encoding avoidedWords run from left to right. (this goes against convention for y-axis labels. you will have to decide which works best for your purpose.)Type is upper and lower case— much easier to read!Tufte (1985) Visual Display of Quantitative Information

Page 66: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Scales• “Choose the scales

so that data fill up as much of the data region as possible.”

• “Choose the range of the tick marks to include or nearly include the range of the data.”

Cleveland (1983) Elements of Graphing Data

Visualization Components | 109

FIGURE 315 Scales

Numeric

The visual spacing on a linear scale is the same regardless of where you are on the axis. So if you were to measure the distance between two points on the lower end of the scale, it’d be the same if they were at the high end of the scale.

On the other hand, a logarithmic scale condenses as you increase values. This scale is used less than the linear scale and is not as well understood or straightforward for those who don’t regularly work with data, but it’s useful if you’re interested in percent differences more than you are raw counts or your data has a wide range.

For example, when you compare state populations in the United States, you deal with numbers from the hundreds of thousands up to the tens of millions. As of this writing, California has a population of approximately 38 million peo-ple, whereas Wyoming has a population of approximately 600,000. As shown

Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-30 15:30:29.

Cop

yrig

ht ©

201

3. W

iley.

All

right

s re

serv

ed.

Yau (2013) Data Points

Rule of thumb: 3-5 labels on each axis. Can add additional tick marks.“Increments that make sense can increase readability as well as shift focus” Yau p. 94 Data PointsOther notes on scales (Pulled from WC EGD):It is sometimes helpful to use the pair of scale lines for a variable to show two di"erent scales.Choose appropriate scales when graphs are compared.Do not insist that zero always be included on a scale showing magnitude.Use a scale break only when necessary. If a break cannot be avoided, use a full scale break. Do not connect numerical values on two sides of a break.

Logarithmic scale:

Use a logarithmic scale when it is important to understand percent change or multiplicative factors.Showing data on a logarithmic scale can improve resolution.When logarithms of a variable are graphed the scale label should correspond to the tick mark labels.

Page 67: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Dual y-axes: NOT clear

This graph was made famous (or infamous) by Jason Cha"etz, a congressman from Utah in 2016 during hearings to defund Planned Parenthood. This graph gives the impression that the number of abortions performed at planned parenthood has surpassed the number of cancer screening and prevention procedures performed. Of course, this is the perception you get from the graph, you have have to read the labels to see that the number of procedures and rate of change are at completely di"erent magnitudes.

Page 68: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Dual y-axes: NOT clear

250,000

300,000

350,000

2 million

1 million

1.5 million

0.5 million

# Abortions # Cancer screening,Prevention

The most generous reading of this chart is that the author was using 2 separate scales for the y-axes, one for the number of abortions and one for the number of cancer screening and prevention procedures. Even with this addition, the audience still has a skewed perception of these data. Now this was widely acknowledged to be a misleading figure, but we see these dual y-axes graphs all the time when authors want to show the relationship between two time trends, but this graph gives us an example of how easy it is to change the story your data are telling based on how you choose to plot them.

Page 69: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Of course, many media outlets quickly published their own corrected versions. This plot shows the trend for both procedures on a single scale—number of procedures—and, at this scale, change in the number of abortions seems nearly flat. One other way you could plot the same data on a single scale would be graph the percent change in number of procedures. There we would see about a 10% increase in abortions with about a more than 50% decline in cancer screening and prevention services. The context surrounding this graph should also explain why cancer screening has declined so dramatically and in fact one reason is pretty straightforward: pap smears to screen for cervical cancer used to be performed once a year but guidelines have changed to recommend pap smears every 3-5 years.

Page 70: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Source Evergreen Data

Page 71: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Source Evergreen Data

Page 72: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Source Evergreen Data

Page 73: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Clear Understanding• Provide clear explanations for error bars,

confidence bands, etc.

• Make legends comprehensive and informative.

1. Describe everything that is graphed.

2. Draw attention to the important features of the data.

3. Describe the conclusions that are drawn from the data on the graph.

Cleveland (1983) Elements of Graphing Data

Page 74: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Keep it simple. Or not.

• “A large amount of quantitative information can be packed into a small region.” (p. 90)

• “Many useful graphs require careful, detailed study.” (p. 94)

Cleveland (1983) Elements of Graphing Data

Page 75: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Proofread. Edit. Revise. Repeat.

• Creating statistical graphics is an iterative process.

• Consider alternative graphical approaches.

• Share graphics with collaborators, colleagues to gauge understanding.

• For presentation: evaluate figures (size, color) when projected on big screen

Page 76: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Seminar Outline

• Introduction

• Fundamentals of Statistical Graphics

• Data Visualization Best Practices

• Resources

Page 77: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Books on Data Visualization• William Cleveland The Elements of Graphing Data (1985)• Edward Tufte:

• The Visual Display of Quantitative Information (1983, 2001)

• Envisioning Information (1990, 2001) • Visual Explanations (1997) • Beautiful Evidence (2006)

• Leland Wilkinson Grammar of Graphics (1999) • Nathan Yau

• Visualize This (2011) • Data Points (2013)

Page 78: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Online Resources

• Flowing Data (Nathan Yau)

• Information Visualization course from Enrico Bertini

• Data Remixed (Ben Jones)

• Dear Data (Giorgia Lupi and Stefanie Posavec)

• WTF Visualizations

Page 79: How to make a picture worth a thousand words: Effectively ... · Data Visualization is… • a scientific discipline. • both a principled and subjective art. • work! • important!

Short course by Mike Jackson, October 22