how to make a picture worth a thousand words: effectively ... · data visualization is… • a...
TRANSCRIPT
How to make a picture worth a thousand words:
Effectively communicating your research results using statistical graphics
Yates Coley, PhD Kaiser Permanente Washington Health Research Institute
Seattle , WA
Joint work with Mike Jackson, PhD, KPWHRI
April 4, 2018
This talk is based on materials I developed with Mike Jackson for a conference workshop
Seminar Outline
• Introduction
• Fundamentals of Statistical Graphics
• Data Visualization Best Practices
• Resources
Data Visualization is…
• a scientific discipline.
• both a principled and subjective art.
• work!
• important!
• an organizing framework.
DV is an active area of research, a discipline with technical vocabulary and evidence-based practices.While tenets of design do inform data viz best practices, there is rarely a single correct way to present information and results. Best graphic for a situation will depend on its purpose, the audience, as well as personal preferences.With this in mind, constructing e"ective data visualizations takes time, patience, and e"ort. This may be the most important lesson to take from this seminar! Good graphs don’t just happen in a matter of minutes but require a commitment to communicating your data and results as best as possible. DV is time-consuming but it is also important because it one of the most e"ective tools you have to communicate your research findings in presentations and publications. Your research has a greater potential for impact if it is easy to understand the study and results.Finally, the discipline of DV can provide you with an organizing framework for thinking about what works (and what doesn’t) when you are creating a graphic and planning how to approach DV for your publications and presentations. Many of us have an inherent understanding of many of the components of this framework through years of experience making graphs to present our research, but we can all gain insight and refine our skills by considering the underlying theory.
Objectives
• Present organizing framework for data visualization
• Describe conceptual best practices for creating statistical graphics and give concrete examples
• Provide sources and references for future consultation
May need to edit this
Seminar Outline
• Introduction
• Fundamentals of Statistical Graphics
• Data Visualization Best Practices
• Resources
Components of a data visualization
• Visual Cues
• Coordinate System
• Scale
• Context
Yau (2013) Data PointsFIGURE 33 Visual cues
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
Visual Cues
Visual cues use “visual encoding” to communicate information by mapping data items to plotting symbols and other graphical elements. Appropriate visual cues for data attributes depend on whether they are identity (categorical) vs. magnitude (quantitative) variables
Yau (2013) Data PointsFIGURE 33 Visual cues
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
Visual Cues
Quantitative Variables
Visual cues for quantitative variables are able to indicate the relative di"erences or distance between two values
Yau (2013) Data PointsFIGURE 33 Visual cues
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:03:12.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
Visual Cues
Categorical Variables
Shapes and color hue are used to encode categorical variables because there is not inherent distance between, for example, a square and a circle.
!!
!
!
!
!!
!
!
!!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
! !Low volume High volume
Diagnostic characteristics of patients in active surveillance
Diagnostic biopsy
So that is all a bit abstract. To give a concrete example, this plot shows the diagnostic characteristics for about 300 active surveillance prostate cancer patients. Each circle represents one patient. The position of the circle along the x-axis is a visual cue indicating the patient’s age at prostate cancer diagnosis. The vertical position indicates the patient’s PSA at diagnosis. (PSA is a biomarker of inflammation in the prostate.) Age and PSA are both continuous variables.
This circle represents a patient who is X years old and had a PSA of Y at diagnosis. This patient is older than those to the left and younger than those to the right. This patient had a higher PSA than those with circles below him and a lower PSA than those above.
We use color as a visual cue to indicate whether a patient’s diagnostic biopsy showed a low or high volume of cancer (categorical variable).
Components of a data visualization
• Visual Cues
• Coordinate System
• Scale
• Context
The other components of a DV are coordinate system, scale, and context.Coordinate system is the space you are using to map data on, like a cartesian plane for scatterplot, polar coordinates for pie chart, and geographic systems for maps. In this talk, most of my examples will use a cartesian plane. Scale- using intervals that make sense can increase readability of graph and shift focusContext- We also must explain to audience what the data mean and why they matter. Frequently, we use figure title or caption to directly provide context. More generally, the presentation or paper provides greater context.
Data Visualization Process
• What data do you have?
• What do you want to know about your data?
• What visualization method should you use?
• What do you see and does it make sense?
Yau (2013) Data Points
One helpful way to approach the data visualization process is to work through these four questions.
Data Visualization Process
• What data do you have?
• Continuous, ordinal, or categorical?
• Time series?
• What do you want to know about your data?
• What visualization method should you use?
• What do you see and does it make sense?
Yau (2013) Data Points
What are the variable types—continuous, ordinal or categorical? Does it reflect values at a single moment in time, or the change in values over time?
Data Visualization Process• What data do you have?
• What do you want to know about your data?
• Distributions of single variables?
• Relationships between variables?
• Summaries or unit-level detail?
• What visualization method should you use?
• What do you see and does it make sense?
Yau (2013) Data Points
Do you want to describe the distribution of each variable separately, or are you looking at how two or more variables are associated? Do you want to illustrate some summary of the data—such as mean and confidence interval—or do you want individual data points to be represented in your graph?
Data Visualization Process
• What data do you have?
• What do you want to know about your data?
• What visualization method should you use?
• What do you see and does it make sense?
Yau (2013) Data Points
How you answered the first two questions will determine the best options for visualizing your data. There are a many di"erent types of graphs for each problem, so I will just go through some of the most common.
0 5 10 15
020
4060
80
0 5 10 15
020
4060
80
PSA (ng/mL)
Num
ber o
f pat
ient
s
Histogram: Unit-level Boxplot: Summary
05
1015
05
1015
PSA
(ng/
mL)
For a single, continuous variable, there are many options for visualizing the variable’s distribution. A histogram shows the distribution on the level of the unit of observation. Here, the visual cue is the height of the bar (so length and area) at each PSA grouping along the x-axis. Density plots are another way to show the distribution of a single variable
Box plots give a summary of the variable (range, median, quartiles). Box plots use position as a visual cue, and they rely on the audience’s familiarity with the construct to decode median, IQR, etc.
PSA density at diagnosis
Low High
050
100
150
200
250
PSA density at diagnosis
Num
ber o
f pat
ient
s
Low High
Cancer volume at diagnosis
Low High
050
100
150
200
250
Cancer volume at diagnosis
Low High
Num
ber o
f pat
ient
s
Add proportion of whole plots? Could do stacked barplot and pie chartFor a single categorical variable, bar plots can be used to show the number (or proportion) of observations in each category.Pie charts are an option for showing proportions, but they are less common in academic work (and typically ridiculed for reasons good and bad).
!!
!
!
!
!!
!
!
!!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
Diagnostic PSA of patients in active surveillance
Scatterplots are commonly used to show the joint distribution of two continuous variables. Scatter plots use the visual cue of position or distance along the x- and y-axes for variable. Scatter plots also use the visual cue of direction or slope to suggest an association between variables.
!
!!
! !
!
!
!
!
!
!
!
!
!
! !!
!
!
!!
! !
!
!
!
!
!
!!
!
!
!!
!!
! !
!
! !
!
!
! !
!
!! !
!
!!
!! ! !
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!!
!!
!
!!
!
!
!
!
!
! !
! !
!! !
! !
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
! !
!
!
! !
!
!
!
!
!!
!
!!
!
!
!
!
!!
!
!
!
!
!
! !!
!
!
!
!
!
!
!!
!! ! ! ! ! ! ! ! !
!
! ! ! !!!
!! !
!
!
! !!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!! !
!
!
!
!!
!
!!
!
!
!
! !
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
! !
!
!!
!! !
!
! !
!
!
!
!
!
!!
!
!
!
!
!
!!
! !!
!!
!!!
!
!
!
!
!
!
!
!
!
!
! !
!!
! !
!
! !! !
!!
!
! !
!
!
!
!!
!
!!
!
!
! !
!
!
! !
!
!
!
!
! !
!
!!
!
!
!
!! !
!!
!!
!
! !
!
!
!
! !
!!
!
!
!
!!
!
! !
!
!!
!
!!
! ! ! ! ! !
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! ! !!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
!
!
!
!! !
!
!
!
!
! !
!
!
!
! !! !
!
!
!
!
!!
!! !
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!! !
!
!
!!
!
!
! ! ! !!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!!
!
!
!!
!
!
!!
!
!
!!
! ! !
!
!!
!! !
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!! !
!
!
!
! !
!
!
! ! ! ! !
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
! !
!
!!
!
!
! ! !
!!
!
!!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!!! !
! ! !
!
!
!
!
!
!!
! !
!
!!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!!
! !
!
!!
!
!
!!
!
!
!
!
!
!
! !
!
!!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!!
!
!
!
!
!
!
!!
! !!
!
! !
!!
!
!!
!
!
!
! !
!
!
!
!
! !
!!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !! !! ! !
!!!!
!
! !!
! !
! !
!
!
!
!!
! !
!
!
!
!
!
!
!! !! ! ! ! !
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
! !
!
!
!
!
!
!! !
!
!
!!
!
!!
! !
!
!
!!
!!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
! !
!! !
!!
!
!
!
!! !
!
!
!!!
!
! !
!
!
!
!
!
!
! !
!! !
!
!
! !
!!
!
!!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
! !
!
!
! !!
! !
!!
!
!
!
! !
!!
!
!
!
!
!
!
!
!
!
!
! !
!!
!
!!
!
!
!!
! !!
!!
!!
!
!
!
!
!
!
!
!!
!!
!
!
!!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!!
! !!
!
!
!
!!
!!
!
!!!
!!
!
!!
!
!
!!
!
!
!
!!
!
!
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
! !
!!!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!!
!
!!
!
!!!
!
!
!
!
!
!
! !
!
!
!
!!
!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!!! !
!
! !
!
! !! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
! !
! !
!
! !!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!!
!
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!! !!
!
!
!
! !!
!
!
!!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!!
!
! ! !!
!!! !
! !!!
!
!
!!
! !
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
! !!!
!
!
!
!!
! !
!! ! !
! !! !
!
!
!
!! !
! !!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
! ! !!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
! !
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!!!
!
!
!!
!!
!
!
!!! !!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!!
!
!
!
!
!!
!
!!
!
!
!
!
!
! !
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
20
PSA observations for patients throughout active surveillance
There are other options for visualizing the relationship between two continuous variables. For example, if we wanted to show ALL the PSA observations while these 300 patients were under surveillance (rather than just those observed at diagnosis), the scatterplot is way too crowded, and it is hard to tell how many points are overlapping and what the distribution actually is in this cloud.
0
5
10
15
20
50 55 60 65 70
0
5
10
15
20
25
30
35
40
PSA observations for patients throughout active surveillance
Age (years)
PSA
(ng/
mL)
In this case, we can use a heat map to see the joint distribution of PSA and age. Here, color saturation across three di"erent hues is the visual cue and darker color indicates a larger number of observations in that region. Heat maps can also be particularly helpful for discrete data scatterplots where many points may overlaps
Diagnostic biopsy
!!
!
!
!
!!
!
!
!!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
! !Low volume High volume
Diagnostic characteristics of patients in active surveillance
Scatterplots also give us an option to examine 2 continuous and 1 or more categorical variables simultaneously. Here, we’ve added color to indicate high or low volume on diagnostic biopsy.
40 45 50 55 60 65 70 75
010
3050
40 45 50 55 60 65 70 75
01
23
45
67
Age at prostate cancer diagnosis
Low volume on diagnostic biopsy
High volume on diagnostic biopsy
Age (years)
Age (years)
Num
ber o
f pat
ient
sN
umbe
r of p
atie
nts
If we wanted to compare the distribution of a continuous variable across levels of categorical variable, we could use vertically aligned histograms. Here, we are comparing the age at diagnosis for patients with low and high cancer volume on their diagnostic biopsy. We see a wider range for patients with low volume, but this particular graphic doesn’t help us make any determinations about how the middle of the distributions compare.
4550
5560
6570
75
Age at prostate cancer diagnosis
Age
(yea
rs)
Low volume ondiagnostic biopsy
High volume ondiagnostic biopsy
Another option is using side-by-side box plots to compare summaries of the distribution of a continuous variable across levels of a categorical variable. Here, we see that patients with high volume at diagnosis are, on average, older. Of course, we can’t tell from the boxplot whether this di"erence is statistically significant—we’d have to perform some statistical test to determine that—or whether this di"erence in clinically meaningful.
Volume
PSAD
Low High
High
Low
PSA density
Mosaic plot
Volume on diagnostic
biopsy
Low High
Low
High
Mosaic plots are an option for examining the relationship between two categorical variables. Mosaic plots use both distance and area to help us understand univariate and bivariate trends. The width of the two boxes here on the x-axis tell us that there are considerably more patients with low PSAD than high, because this line is much longer. Next, we can compare the heights of these boxes. We see that patients with low PSAD and more likely to also have a low volume of cancer at diagnosis because the vertical distance here, the height of the box, is greater than that of the box for high PSAD. Patients with high PSAD are more likely to have high cancer volume at diagnosis as well.
Data Visualization Process
• What data do you have?
• What do you want to know about your data?
• What visualization method should you use?
• What do you see and does it make sense?
Yau (2013) Data Points
To prepare you to evaluate whether your DV makes sense, let’s move on to consider best practices for making e"ective statistical graphics
Seminar Outline
• Introduction
• Fundamentals of Statistical Graphics
• Data Visualization Best Practices
• Resources
How do we define an “effective” statistical graphic?
• An effective statistical graphic enables the reader to
• extract information accurately
• with reasonable effort and
• high confidence.
Enrico Bertini Lecture #3
It is helpful to first consider how we might define an “e"ective” statistical graphic. I like this definition from Enrico Bertini, who says that…Note “reasonable e"ort” (not “easy”)- graphic doesn’t have to be so simple it is understood in a glance. For some of the plots I just showed, it took me several sentences to explain what we were looking at, and that is ok, as long as you don’t frustrate or discourage your audience.
Expressiveness Principle
Statistical graphic “should express all and only the information in the data” (and statistical results).
Enrico Bertini Lecture #4
Bertini outlines two principles that I think are helpful in guiding data visualization. First, your graphic should express all the information in your data but only the information in your data. And we can expand that to include the information in your analysis results.
Enrico Bertini Lecture #3
0 100 200 300 400 500 600 1200700 800 900 1000 1100
Number of observations
A B C D E F G H IJ K L M N O P Q
Category
Take, for example, a dataset with a number of observations in each category A-Q. This barplot shows the number of observations in each category, but it di#cult to distinguish the ordering. For example, I can see that K has the most observations, but it is not clear at a glance how D compares to P.
Enrico Bertini Lecture #3
Sorted Bar Chart
K B Q I E L O A D PJ H G C M N F
0 100 200 300 400 500 600 1200700 800 900 1000 1100
Category
Number of observations
If we instead order the categories with respect to the number of records, the graphic provides *more* information than the alphabetical ordering. In this way, the revised graph is better representing *all* the information in the data.
line chart with categorical data (wrong!)
Enrico Bertini Lecture #3
A B C D E F G H I J K L M N O P Q
1200
1000
800
600
400
200
0
Category
Number of observations
On the other hand, if the original alphabetized bar plot was transformed into a line plot, then we are displaying information that is not in the data; there is no relevant information in the change between alphabetized categories. This plot emphasizes the di"erence between categories A and B and between B and C but not A and C. Line plots like this should be reserved for time series data or other quantitative variables on the x-axis where the change across x is meaningful.
This construction implies some distance between categories that is meaningful, when the ordering of categories is arbitrary.
Effectiveness Principle• “The importance of the information should match
the salience of the mode of visual encoding”.
• “Salience” is characterized by:
• Accuracy
• Discriminability
• Separability
• “Pop-out”
• GroupingEnrico Bertini Lecture #4
In addition to the expressiveness principle—your graphic should show all and only the information in your data—the other principle that Bertini outlines is the E"ectiveness Principle… accuracy: how accurately values can be estimateddiscriminability: how many di"erent values can be perceivedseparability: degree of interaction between multiple encodingspopout: how easy it is to spot some values from the restgrouping: how good a mode is at conveying groups, how easy it is to perceive natural groupings of observations with similar attributes
DO we want to illustrate these properties on separate slides?
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
Here, e"ectiveness or salience of visual cues for quantitative and categorical variables are ranked from most to least e"ective.
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
Position on a common scale is the most salient visual cue. This corresponds to the cartesian plane being the standard/default method for most statistical graphics.
Diagnostic biopsy
!!
!
!
!
!!
!
!
!!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
! !Low volume High volume
Diagnostic characteristics of patients in active surveillance AccuracyDiscriminability
SeparabilityPop-out
Grouping
If we evaluate the cartesian plane on our characteristics checklist, we see that we are able to accurately identify the age and PSA value associated with each point… This patient is about 54 years old and has a diagnostic PSA about 11 or 12. In this case, we could be even more accurate if I added grid lines at each year. We are able to discriminate well. It’s clear that this patient is younger than this patient and has a higher diagnostic PSA. We are able to separate the visual cues well, that is distinguish between horizontal and vertical distance, and distance and color cues do not interfere with one another. Outliers naturally pop-out because it’s a long distance from the cloud of points. And, we are able to identify any natural groupings that occur. These patients here are pretty similar, as are these…
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
And we can go down the list of e"ectiveness. Next we have… tilt or angle is what we use for pie charts…
Source: New York Times
area is commonly used in infographics in journalism. Large values pop-out really well when using area, but we see here it can be di#cult to discriminate between circles of similar sizes, but the goal here is to highlight the US’s largest trading partners, emphasizing that most of the country’s trading is with Canada and Mexico. This graphic actually adds the ranks to the graph so you don’t have to struggle to compare circles of similar sizes. Overlap is another problem with using area in many contexts.
Source: The Economist
Here, the goal is to show that there is overwhelming opposition, both in number of organizations and their size
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
Moving to categorical variables, spatial placement is the most salient cue, like we use for maps. Next, we have color hue.
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
Note shapes: usually easy to accurately di"erentiate between two shapes, so it scores well on accuracy and discriminability, but the eye cannot as easily pick out patterns, group symbols, etc. like with color.
Diagnostic biopsy
!!
!
!
!
!!
!
!
!!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
! !Low volume High volume
Diagnostic characteristics of patients in active surveillance
2 continuous and a categorical variable
!!
!
!!
!
!
!
!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Age (years)
PSA
(ng/
mL)
45 55 65 75
05
1015
! Low volume High volume
Diagnostic characteristics of patients in active surveillance
Diagnostic biopsy
Add PSA plot with shapes for dx vol group instead of color
Enrico Bertini Lecture #8
Quantitative Variables Categorical Variables
Perceptionvs.
Cognition
This brings us to an important idea in data visualization:You want readers to readily perceive visual encoding of as many important attributes of data as possible rather than cognitively, explicitly decode value by, for e.g., reading legend or key.In this talk, I’ve been explicitly interpreting all the visual cues, but this is a task our minds do automatically with the most e"ective data visualizations. To return to an earlier example:
Enrico Bertini Lecture #3
0 100 200 300 400 500 600 1200700 800 900 1000 1100
Number of observations
A B C D E F G H IJ K L M N O P Q
Category
It is possible for a reader to painstakingly read through this list and order the categories based on number of observations. We are able to discriminate these values, but
Enrico Bertini Lecture #3
Sorted Bar Chart
K B Q I E L O A D PJ H G C M N F
0 100 200 300 400 500 600 1200700 800 900 1000 1100
Category
Number of observations
The ordering is naturally perceived and the audience can move on to drawing conclusions about where categories rank.
0
5
10
15
20
50 55 60 65 70
0
5
10
15
20
25
30
35
40
PSA observations for patients throughout active surveillance
Age (years)
PSA
(ng/
mL)
For another example of where we rely on perception
Guiding Principles• Make the data stand out. Maximize the data-to-
ink ratio.
• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.
• Strive for clarity.
• Clear vision.
• Clear understanding.
Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information
These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when constructing a graphic. But, they are a little abstract. What do they actually mean?
Guiding Principles• Make the data stand out. Maximize the data-to-
ink ratio.
• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.
• Strive for clarity.
• Clear vision.
• Clear understanding.
Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information
These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when contributing a graphic. But, they are a little abstract, and they tend to overlap. What do they actually mean?
Visual Cues“Make graphical elements encoding data visually prominent.”
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual ProminencePlotting symbols are large, dark enough to be easily seen
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
Plotting symbols are large, dark enough to be easily seen
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual ProminencePlotting symbols aren’t obscured by connecting lines
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
Plotting symbols aren’t obscured by connecting lines
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual ProminenceOverlapping plotting symbols are easily distinguishable
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
Overlapping plotting symbols are easily distinguishable
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual ProminenceSuperposed data readily visually discriminated
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
Superposed data readily visually discriminated
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual ProminenceGraphical elements do not interfere with data
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980
!40
!30
!20
!10
0
YEAR
PER
CEN
T C
HAN
GE
IN D
EATH
RAT
E FR
OM
195
0
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cleveland (1983) VDQI, Ch. 2
Graphical elements do not interfere with data
One obvious thing we can do to make the data stand out (that we don’t always do!) is make the graphical elements encoding the data visually prominent. Plotting symbols should be large/dark enough to be easily seen; shouldn’t be “obscured by line connecting” them; should not be on the scale line; “overlapping symbols should be distinguishable”; “superposed data must be readily visually discriminated”; “don’t allow other graphical elements to interfere” (Cleveland Ch 2)
Visual hierarchy
Yau (2013) Data Points
Visual Hierarchy | 203
point of interest. This creates a visual hierarchy that helps readers immediately focus on the vital parts of a data graphic and use the surroundings as context, as opposed to a flat graphic that a reader must visually rummage through.
For example, Figure 5-1 is the scatterplot from the previous chapter that shows NBA players’ usage percentage versus points per game. The dots, fitted line, grid, border, and labels are of the same color and thickness, so there is no clear visual focus. It’s a flat image, where all the elements are on the same level.
FIGURE 51 All visual elements on the same level
This is easily remedied with a few small changes. In Figure 5-2, the line width of the grid lines is reduced so that they are no longer as thick as the fitted line. In this example, you want the data to stand out. The grid lines also alternate in width so that it is easier to see where each data point lies in the coordinate system, and there’s no imaginary blur that you get in the original chart.
FIGURE 52 Width of grid lines reduced to fit in background
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:12:31.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
Place visual elements on different “levels” to shift focus, draw attention to most important aspect of data or results.
Another consideration for making the data stand out is to construct a visual hierarchy that draws attention to the data.Here, this image is “flat”, all graphical elements are on the same level. It is hard to distinguish data and trend line from the grid, and the image “vibrates”.
“Highlight data with bolder colors than the other visual elements, and lighten or soften other elements so that they sit in the background. Use arrows and lines to direct eyes to the point of interest. This creates a visual hierarchy that helps readers immediately focus on the vital parts of a data graphic and use the surroundings as context, as opposed to a flat graphic that a reader must visually rummage through.” (p. 203)
Visual hierarchyPlace visual elements on different “levels” to shift focus, draw attention to most important aspect of data or results.
Yau (2013) Data Points
204 | CHAPTER 5: Visualizing with Clarity
Still though, the fitted line is obscured by all the dots, because (1) it’s thin com-pared to the radius of each dot and (2) it still blends in with the grid behind it. Figure 5-3 changes the color to blue to make the data stand out more, and the width of the fitted line is increased so that it clearly rests on top of the dots.
FIGURE 53 Focus of chart shifted to fitted line with color and width
The chart is a lot more readable now, but if you imagine people viewing the graphic like they would a body of text—from top to bottom and left to right—more descriptive axis labels and less prominent value labels can help, as shown in Figure 5-4. The text within the chart works similar to how it does in an essay or a book. Headers are often printed bigger and in a bold font to provide both structure and a sense of flow. In this case, the bolder labels provide immediate context for what the chart is about. Also, notice fewer and less prominent gridlines, which directs focus further to the upward trend.
FIGURE 54 Grid and value labels adjusted and fewer, less prominent gridlines
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-28 16:12:31.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
As a solution, edit the image to place visual elements on di"erent levels. In particular, edit width, color, and frequency of grid lines; make trend line wider and color data points lighter to emphasize trend (could alternatively emphasize data points and mute trend line) Remove top and bottom frames of image.
Guiding Principles• Make the data stand out. Maximize the data-to-
ink ratio.
• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.
• Strive for clarity.
• Clear vision.
• Clear understanding.
Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information
One principal we see repeated across the DV literature is… Of course, there are disagreements about what is redundant or superfluous.
Reduce non-data ink?! CARDIOVASCULAR
OTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
“The four scale lines also provide a clearly defined region where out eyes can search for data. With just two, data can be camouflaged by virtue of where they lie.” (WC p. 35)
ET removes grid lines p. 100-105
Cleveland is particularly worried about the audience missing datapoints if the region isn’t clearly defined. here, that is not a concern. plus, we are connecting the points with a line, so the audience will see them all.
Reduce redundant data ink?! CARDIOVASCULAR
OTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Here, lines are helpful to see trend, and make it easier to see the other deaths data. Lines are particularly useful for time trend data.
0 5 10 15
020
4060
80
0 5 10 15
020
4060
80
PSA (ng/mL)
Num
ber o
f pat
ient
s
05
1015
05
1015
PSA
(ng/
mL)
Reduce non-data ink?
In fact, our standard graphing tools are filled with redundant or non-data ink. For the histogram, we could just use vertical lines instead of bars since the meaningful visual cue here is distance, not area. Box plots are primarily constructed of non-data ink. Why draw a box when horizontal lines will do? In both of these cases, the graph types are familiar to a scientific audience and you don’t necessarily need to follow all guiding principles to the extreme.
Reduce redundant data ink!! CARDIOVASCULAR
OTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!!
!
! !
!
!
!
!
!
!
!
!
! !
! !
!!
!
!
!
!
!
!
!
! !
!
! !
!! !
!!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
Here, size and % change are redundant. Don’t do this. (Also, plotting symbols overlap. Don’t do this either.)
Guiding Principles• Make the data stand out. Maximize the data-to-
ink ratio.
• Avoid superfluidity. Remove “chartjunk”. Reduce non-data ink and redundant data-ink.
• Strive for clarity.
• Clear vision.
• Clear understanding.
Cleveland (1983) Elements of Graphing Data Edward Tufte (1985) Visual Display of Quantitative Information
These are the guiding principles that you will see in any DV book or presentation. They are taken from seminal DV texts and are helpful to keep in mind when constructing a graphic. But, they are a little abstract. What do they actually mean?
! CARDIOVASCULAROTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Data labels?
CARDIOVASCULARDEATHS
OTHERDEATHS
FIRST CARDIOVASCULAR CARE UNIT
“Avoid putting notes, keys, and markers in the data region. Put keys and markers just outside the data region and put notes in the legend or text.” Cleveland (p. 47)counterpoint: eyes don’t have to search to decodeET recommends markers/pointersverdict: depends on your particular graphic. try both and see which is easiest to read.
Grid lines?! CARDIOVASCULAR
OTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
“Forgo chartjunk, including moire vibration, the grid, and the duck” ET p. 121“Grid should usually be muted or completely suppressed so that it’s presence is only implicit-lest it compete with the data.” WCggplot2 has grid lines by defaultverdict: sometimes helpful! sometimes just clutter. depends on your message
Guidelines for Text! CARDIOVASCULAR
OTHER
FIRSTCARDIOVASCULAR
CARE UNIT
1950 1960 1970 1980!4
0!3
0!2
0!1
00
YEARPE
RC
ENT
CH
ANG
E IN
DEA
TH R
ATE
FRO
M 1
950
!
!
!!
! !!
!
!!
!
!
!
!! !
! !
!!
!
!!
!
!!
!! !
!
!
!!
!
!! !
!!
!
!
!
!!
!! !
!!
!
!! !
!
!
!!
!
!!
Cardiovasculardeaths
Other deaths
First cardiovascularcare unit
Year
Change indeath rate (%)
Words are spelled out, mysterious and elaborate encoding avoidedWords run from left to right. (this goes against convention for y-axis labels. you will have to decide which works best for your purpose.)Type is upper and lower case— much easier to read!Tufte (1985) Visual Display of Quantitative Information
Scales• “Choose the scales
so that data fill up as much of the data region as possible.”
• “Choose the range of the tick marks to include or nearly include the range of the data.”
Cleveland (1983) Elements of Graphing Data
Visualization Components | 109
FIGURE 315 Scales
Numeric
The visual spacing on a linear scale is the same regardless of where you are on the axis. So if you were to measure the distance between two points on the lower end of the scale, it’d be the same if they were at the high end of the scale.
On the other hand, a logarithmic scale condenses as you increase values. This scale is used less than the linear scale and is not as well understood or straightforward for those who don’t regularly work with data, but it’s useful if you’re interested in percent differences more than you are raw counts or your data has a wide range.
For example, when you compare state populations in the United States, you deal with numbers from the hundreds of thousands up to the tens of millions. As of this writing, California has a population of approximately 38 million peo-ple, whereas Wyoming has a population of approximately 600,000. As shown
Yau, Nathan. Data Points, edited by Nathan Yau, Wiley, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/jhu/detail.action?docID=1158630.Created from jhu on 2017-05-30 15:30:29.
Cop
yrig
ht ©
201
3. W
iley.
All
right
s re
serv
ed.
Yau (2013) Data Points
Rule of thumb: 3-5 labels on each axis. Can add additional tick marks.“Increments that make sense can increase readability as well as shift focus” Yau p. 94 Data PointsOther notes on scales (Pulled from WC EGD):It is sometimes helpful to use the pair of scale lines for a variable to show two di"erent scales.Choose appropriate scales when graphs are compared.Do not insist that zero always be included on a scale showing magnitude.Use a scale break only when necessary. If a break cannot be avoided, use a full scale break. Do not connect numerical values on two sides of a break.
Logarithmic scale:
Use a logarithmic scale when it is important to understand percent change or multiplicative factors.Showing data on a logarithmic scale can improve resolution.When logarithms of a variable are graphed the scale label should correspond to the tick mark labels.
Dual y-axes: NOT clear
This graph was made famous (or infamous) by Jason Cha"etz, a congressman from Utah in 2016 during hearings to defund Planned Parenthood. This graph gives the impression that the number of abortions performed at planned parenthood has surpassed the number of cancer screening and prevention procedures performed. Of course, this is the perception you get from the graph, you have have to read the labels to see that the number of procedures and rate of change are at completely di"erent magnitudes.
Dual y-axes: NOT clear
250,000
300,000
350,000
2 million
1 million
1.5 million
0.5 million
# Abortions # Cancer screening,Prevention
The most generous reading of this chart is that the author was using 2 separate scales for the y-axes, one for the number of abortions and one for the number of cancer screening and prevention procedures. Even with this addition, the audience still has a skewed perception of these data. Now this was widely acknowledged to be a misleading figure, but we see these dual y-axes graphs all the time when authors want to show the relationship between two time trends, but this graph gives us an example of how easy it is to change the story your data are telling based on how you choose to plot them.
Of course, many media outlets quickly published their own corrected versions. This plot shows the trend for both procedures on a single scale—number of procedures—and, at this scale, change in the number of abortions seems nearly flat. One other way you could plot the same data on a single scale would be graph the percent change in number of procedures. There we would see about a 10% increase in abortions with about a more than 50% decline in cancer screening and prevention services. The context surrounding this graph should also explain why cancer screening has declined so dramatically and in fact one reason is pretty straightforward: pap smears to screen for cervical cancer used to be performed once a year but guidelines have changed to recommend pap smears every 3-5 years.
Source Evergreen Data
Source Evergreen Data
Source Evergreen Data
Clear Understanding• Provide clear explanations for error bars,
confidence bands, etc.
• Make legends comprehensive and informative.
1. Describe everything that is graphed.
2. Draw attention to the important features of the data.
3. Describe the conclusions that are drawn from the data on the graph.
Cleveland (1983) Elements of Graphing Data
Keep it simple. Or not.
• “A large amount of quantitative information can be packed into a small region.” (p. 90)
• “Many useful graphs require careful, detailed study.” (p. 94)
Cleveland (1983) Elements of Graphing Data
Proofread. Edit. Revise. Repeat.
• Creating statistical graphics is an iterative process.
• Consider alternative graphical approaches.
• Share graphics with collaborators, colleagues to gauge understanding.
• For presentation: evaluate figures (size, color) when projected on big screen
Seminar Outline
• Introduction
• Fundamentals of Statistical Graphics
• Data Visualization Best Practices
• Resources
Books on Data Visualization• William Cleveland The Elements of Graphing Data (1985)• Edward Tufte:
• The Visual Display of Quantitative Information (1983, 2001)
• Envisioning Information (1990, 2001) • Visual Explanations (1997) • Beautiful Evidence (2006)
• Leland Wilkinson Grammar of Graphics (1999) • Nathan Yau
• Visualize This (2011) • Data Points (2013)
Online Resources
• Flowing Data (Nathan Yau)
• Information Visualization course from Enrico Bertini
• Data Remixed (Ben Jones)
• Dear Data (Giorgia Lupi and Stefanie Posavec)
• WTF Visualizations
Short course by Mike Jackson, October 22