session 42: visualization: a picture speaks a thousand words · 2018 predictive analytics symposium...

Post on 27-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2018 Predictive Analytics Symposium

Session 42: Visualization: A Picture Speaks a Thousand Words

SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer

Telling Your Data StoryMARY PAT CAMPBELL, FSA, MAAA, PRM

VP, Insurance Research, Conning

21 September 2018

https://en.wikipedia.org/wiki/Charles_Joseph_Minard

3

The Why of Data Visualization. https://www.soa.org/News-and-Publications/Newsletters/Compact/2016/march/The-Why-of-Data-Visualization.aspx

Evaluate Your Visualization

Completeness

Perceptibility

Intuitiveness

Source: http://www.perceptualedge.com/articles/visual_business_intelligence/data_visualization_effectiveness_profile.pdf

No relevant data

All relevant data

Unclear and difficult

Clear and easy

Unfamiliar; hard to understand

Familiar; easy to understand

What is Your Story?

Distribution

Change over time

Correlation or Relationship

Comparison between items (ranking)

Comparison over space (maps)

Parts of a whole

Things to Try to Improve Readability

REMOVEGridlinesLegend – replace with data labels

Instead:Add explanatory textHighlight key elementsUse multiples of same graph

Some Data Stories

10

Data Set 1: Modeled Income Percentiles

Data source: http://go.epi.org/unequalstates2018data

Report: Sommeiller, Estelle and Price, Mark. “The New Gilded Age”. Economic Policy Institute. 19 July 2018. https://www.epi.org/publication/the-new-gilded-age-income-inequality-in-the-u-s-by-state-metropolitan-area-and-county/

12

Source: “See How Much the Top 1% Earn in Every State”, 30 Aug 2018 https://howmuch.net/articles/average-annual-income-of-the-top-1-percent

13

Connecticut, #1

District of Columbia, #5 Massachusetts, #3

New York,#2

Wyoming, #4

$0

$500,000

$1,000,000

$1,500,000

$2,000,000

$2,500,000

$3,000,000

The Long Tail of High Income99th percentile Average income of top 1%

14

Average Income of Top 1% Taxpayers

15

AlabamaAlaska

ArizonaArkansas

California

Colorado

Connecticut

Delaware

District of Columbia

Florida

Georgia

HawaiiIdaho

Illinois

IndianaIowa

Kansas

KentuckyLouisiana

Maine

Maryland

Massachusetts

Michigan

Minnesota

Mississippi

MissouriMontana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North CarolinaNorth Dakota

OhioOklahoma

Oregon

Pennsylvania

Rhode Island

South Carolina

South Dakota

Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

R² = 0.7231

$0.0

$0.5

$1.0

$1.5

$2.0

$2.5

$3.0

$0.2 $0.3 $0.4 $0.5 $0.6 $0.7 $0.8

Average Income

of Top 1%Taxpayers less the

99th PercentileIncome

99th Percentile Income

Higher Percentile, Longer Tail(circle size scales by number of taxpayers, $ in millions)

16

Alabama

Alaska

Arizona

ArkansasCalifornia

Colorado

Connecticut

Delaware

District of Columbia

Florida

GeorgiaHawaiiIdaho

Illinois

Indiana

Iowa

KansasKentuckyLouisiana

Maine

Maryland

Massachusetts

MichiganMinnesota

Mississippi

Missouri

MontanaNebraska

Nevada

New HampshireNew Jersey

New Mexico

New York

North Carolina

North DakotaOhio

Oklahoma

Oregon

Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

VermontVirginia

Washington

West Virginia

Wisconsin

Wyoming

R² = 0.1053

100%

150%

200%

250%

300%

350%

400%

0 50,000 100,000 150,000 200,000

Percent Difference BetweenAverage Income

of Top 1% and

99th Percentile

Number of Taxpayers in the 1%

Low Correlation Between Population and Income Tail Length

17

California

Connecticut

District of ColumbiaFlorida

Illinois

Massachusetts

New Jersey

New York

Texas

Wyoming

R² = 0.1476

$0.0

$0.5

$1.0

$1.5

$2.0

$2.5

0 50,000 100,000 150,000 200,000

Average Incomeof Top 1%Taxpayers,

$ in millions

Number of Taxpayers in the 1%

Geographic Outliers of Top Income

18

Data Set 2: Mortality by Cause

Source: National Center for Health Statistics

Data Visualization Gallery

https://www.cdc.gov/nchs/data-visualization/index.htm

19Source: https://www.cdc.gov/nchs/data-visualization/mortality-trends/

20

Accidents

Cancer

Heart Disease

Influenza and Pneumonia

Stroke

0

100

200

300

400

500

600

700

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Age

-Ad

just

ed D

eath

Rat

esAccidents Cancer Heart Disease Influenza and Pneumonia Stroke

21

Accidents, 66Accidents, 43

Cancer, 196

Cancer, 159

Heart Disease, 543

Heart Disease, 169

Influenza and Pneumonia, 47 Influenza and

Pneumonia, 15

Stroke, 166

Stroke, 38

1965 2015

Age-Adjusted Death Rates, per 100,000

22

Data Set 3: Public Plans Data

http://publicplansdata.org/

23

The most frequently used return assumption is

7.5%

24

Return Assumptions Are Concentrated, And Shifting Down

25

Return Assumptions Are Concentrated, And Shifting Down

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

6.0% 6.5% 7.0% 7.5% 8.0% 8.5% 9.0% 9.5%

Cumulative Percentage

of Public Plans

Investment Return Assumption

In FY 2001, 19% did

In FY 2016, 82%of plans in the Public Plans Databaseused return assumptions of 7.75% or less

26

Public Plan Funded Ratios

2011 201620062001

Choosing a Visualization Type

What Kind of Data Do You Have?Dimensionality:

One: histogram, box-and-whisker, pie chart, table with summary statsTwo: line, bar/column, scatterplotMany: multiples

Numerical or categoricalCategorical: bar/column (may want to sort categories), histogram

GroupedClustered columns, multiple graphs

Large set – or just a few numbersLarge: will generally need to simplify/summarize/group along some dimensionFew: consider table or just a number

GeographicDoes location actually count?Tile grid when entities equally weighted

What is Your Story?

DistributionDensity plot, histogram, box-and-whisker

Change over time Line, slope

Correlation or RelationshipScatterplot, bubble plot

Comparison between items (ranking)Slope, list/table, conditional formatting on table

Comparison over space (maps)Choropleths, tile maps

Parts of a wholePie, stacked bar/column

Additional Resources

Additional Resources

Storytelling with Data

Looks at how to design graphs and other displays for maximum effect

Most can be done in Excel

Websites

The Chartmaker Directory

http://chartmaker.visualisingdata.com/

Visualization Universe

Chart types: http://visualizationuniverse.com/charts/

Charting books: http://visualizationuniverse.com/books/

PolicyViz

https://policyviz.com/

Can You See It?

CLIMBING THE ZEN MOUNTAINCLIMBING THE ZEN MOUNTAIN

WHAT WE’LL TALK ABOUTWHAT WE’LL TALK ABOUT

Seeing numbersSeeinghypothesesSeeing models

SEEING NUMBERSSEEING NUMBERS

THE TREACHERY OF IMAGESTHE TREACHERY OF IMAGES

Image taken from a University of Alabama site, “Approaches toModernism”: [1], Fair use,https://en.wikipedia.org/w/index.php?curid=555365

THE NUMBER 7THE NUMBER 7

WE WE CANNOTCANNOT SEE NUMBERS SEE NUMBERS

Arabic or sanskrit are no more legitimate than any other representationof numbers.

We can no more see numbers than we can hear, smell or taste them.

SCALING THE ZEN MOUNTAINSCALING THE ZEN MOUNTAIN

“Before I studied Zen, I saw mountains as mountains and rivers asrivers. When I had studied Zen for thirty years I no longer sawmountains as mountains and rivers as rivers. But now that I havefinally mastered Zen, I once again see mountains as mountains andrivers as rivers.”

Ch’an master Ch’ing Yuan

MANY NUMBERS - STATISTICSMANY NUMBERS - STATISTICS

Statistics maps a set of many numbers into a set of fewer numbers.

set.seed(1234)

meanlog_actual <- log(10e3)

sdlog_actual <- 0.5

tbl_obs <- tibble(

x = rlnorm(5e3, meanlog = meanlog_actual, sdlog = sdlog_actual)

)

tbl_obs$x %>%

summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1830 7132 9970 11266 13947 49429

MANY NUMBERS VISUALLYMANY NUMBERS VISUALLY

MANY NUMBERS VISUALLYMANY NUMBERS VISUALLY

Looking at summary statistics is always reduced information.

Looking at a visualization represents all of the data, but forces our eyesto compute the statistics.

Increased efficiency vs. decreased accuracy

SEEING HYPOTHESESSEEING HYPOTHESES

STATISTICAL HYPOTHESESSTATISTICAL HYPOTHESES

Many different sorts:

Were the data generated by this form of distribution?Were these two samples generated by different processes?Is there a relationship between these two variables?

[list influenced by ]http://had.co.nz/stat645/graphical-

inference.pdf

SAMPLE DATASAMPLE DATA

SAMPLE AND HYPOTHESISSAMPLE AND HYPOTHESIS

HYPOTHESIS TESTINGHYPOTHESIS TESTING

Kolmogorov-SmirnovParameter significance\(\chi^2\) test

Also:

Test against other candidates, visually

COULD THE DATA HAVE COME FROM SOMEWHERECOULD THE DATA HAVE COME FROM SOMEWHEREELSE?ELSE?

EXERCISE FOR THE STUDENTEXERCISE FOR THE STUDENT

The same, but with:

p-p or q-q plotCumulative distribution functionIsolate important areas of the distribution

BUT NOW …BUT NOW …

Test the null itself!!

GRAPHICAL INFERENCEGRAPHICAL INFERENCE

Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja

H/T -> Xan Gregg @xangregg

Graphical inference helps us answer the question“Is what we see really there?”

http://had.co.nz/stat645/graphical-inference.pdf

HOW IT WORKSHOW IT WORKS

Visual test

1. Generate many (or 19) samples of the NULL2. Add your actual data3. Shuffle4. Observe5. Power may be increased by using more than one observer

CAN YOU SPOT THE SAMPLE DATA?CAN YOU SPOT THE SAMPLE DATA?

HOW ABOUT NOW?HOW ABOUT NOW?

NOW?NOW?

A BIT EASIERA BIT EASIER

A BIT HARDERA BIT HARDER

THE STATISTICAL LINEUPTHE STATISTICAL LINEUP

If can pick my data out of a lineup, I may reject the null hypothesis.

SEEING MODELSSEEING MODELS

SEEING MODELSSEEING MODELS

A “good” model is one which displays noise. We are most interested inseeing something which isn’t there.

MOVE ALONG, NOTHING TO SEE HEREMOVE ALONG, NOTHING TO SEE HEREsegment adj.r.squared sigma

1 0.6294916 1.236603

2 0.6291578 1.237214

3 0.6292489 1.236311

4 0.6296747 1.235696

NOTHING TO SEE?NOTHING TO SEE?

RESIDUALSRESIDUALS

MISSING VARIABLESMISSING VARIABLES

Let’s look at ozone data from mlbench package.

At first, we will only fit to las_wind_speed.

A simple model may tell us more than we think!

BASIC EDABASIC EDA

OUR APPROACHOUR APPROACH

A very messy PoissonFit a GLM with a subset of predictorsPlot residuals against all predictorsLook for pattern

MISSING VARIABLESMISSING VARIABLES

AUGMENT OUR MODELAUGMENT OUR MODEL

Let’s add lax inversion temperature!

MISSING VARIABLES REDUXMISSING VARIABLES REDUX

TREESTREES

Simple trees are easy to visualizeThey’re also not too usefulEnsemble models are tough to see

VARIABLE IMPORTANCEVARIABLE IMPORTANCE

PARTIAL PLOTSPARTIAL PLOTS

PARTIAL PLOTSPARTIAL PLOTS

CONCLUSIONCONCLUSION

THE ZEN MOUNTAINTHE ZEN MOUNTAIN

-Me

Numbers are not numbers, models are notmodels …

THANK YOU!THANK YOU!

REFERENCESREFERENCES

http://dicook.github.io/nullabor/index.html

WHERE TO FIND THISWHERE TO FIND THIS

This presentation may be found at:

Code to produce the examples and slides:

http://pirategrunt.com/soa_symposium_2018/#/

https://github.com/PirateGrunt/soa_symposium_2018

Understanding the Layers of Your DataSession 42 – Visualization: A Picture Speaks a Thousand Words

September 2018 – Predictive Analytics Symposium

Good Graphics Get to the Point

2

Bad Graphics Do More Harm than Good

3

Identify Possible Solutions

More Bad Graphics

4

LinkedIn Body Language for Leaders

…. what?

5

Using a Layered Approach to Displaying Data

6

Guide: ggplot2 R package ggplot2 is an implementation of the concept of the grammar of graphics

Basics of the grammar: Data Geometric objects (e.g. points, lines, bars) Aesthetic attributes (e.g. color, size, shape)

Additional components: Statistical transformations of data (e.g. count, mean) Coordinate system (generally assumed to be Cartesian)

The combination and layering of these components defines the grammar

7

Variable Description Examples

manufacturer manufacturer name Audi, Chevrolet, Nissanmodel model name A4, Corvette, Altimadispl engine displacement, in liters 2.0, 4.2, 6.0year year of manufacture 1999 or 2008cyl number of cylinders 4, 6, 8trans type of transmission auto, manualdrv front-wheel, rear-wheel, 4wd f, r, 4wdcty city miles per gallon 14, 16, 20hwy highway miles per gallon 15, 20, 27fl fuel type e: E85, d: diesel, r: regular, p: premium, c: CNGclass type of car compact, midsize, SUV

Sample Dataset ‘mpg’Fuel economy data from 1999 and 2008 for 38 popular models of car

Basic Comparisons – Density

8

Basic Comparisons – The Structure of Data Matters

9

## City mpg density (basic)ggplot(data = mpg, aes(x = cty)) +

geom_density()

## City mpg density (full prettied) ggplot(data = mpg, aes(x = cty)) +

geom_density(col = 'lightblue', fill = 'lightblue') +

scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('City MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

## Highway mpg density (basic)ggplot(data = mpg, aes(x = hwy)) +

geom_density()

## Highway mpg density (full prettied) ggplot(data = mpg, aes(x = hwy)) +

geom_density(col = 'lightblue', fill = 'lightblue') +

scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('Highway MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Basic Comparisons – The Structure of Data Matters

10

Basic Comparisons – The Structure of Data Matters

11

## Create a new format for our dataplot_data <- mpg %>%

gather(key = 'mpg_type', value = 'mpg', cty, hwy)

## Plot city and highway mpg under same plot controls (basic)ggplot(plot_data, aes(x = mpg)) +

geom_density() +facet_wrap(~ mpg_type, nrow = 2)

## Plot city and highway mpg under same plot controls (prettied) ggplot(plot_data, aes(x = mpg)) +

geom_density(col = 'lightblue', fill = 'lightblue') +

facet_wrap(~ mpg_type, nrow = 2, labeller = as_labeller(c('cty' = 'City',

'hwy' = 'Highway'))) +scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Scatterplots – More Than Just Dots

12

## Highway mpg as a function of city mpg (basic)ggplot(data = mpg, aes(x = cty, y = hwy)) +

geom_point()

## Highway mpg as a function of city mpg (prettied) ggplot(data = mpg, aes(x = cty, y = hwy)) +

geom_point() +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Scatterplots – More Than Just Dots

13

## Highway mpg as a function of city mpg (basic)## Add color based on classggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +

geom_point()

## Highway mpg as a function of city mpg (prettied)## Add color based on classggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +

geom_point() +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Scatterplots – More Than Just Dots

14

## Highway mpg as a function of city mpg (basic)## Add a trend lineggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +

geom_count() +geom_smooth(aes(group = 1), method = 'lm', se = FALSE,

linetype = 'dashed')

## Highway mpg as a function of city mpg (prettied)## Add a trend lineggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +

geom_count() +geom_smooth(aes(group = 1), method = 'lm', se = FALSE,

linetype = 'dashed') +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Scatterplots – More Than Just Dots

15

## Highway mpg as a function of city mpg (basic)## Add multiple trend linesggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +

geom_count() +geom_smooth(method = 'lm', se = FALSE)

## Highway mpg as a function of city mpg (prettied)## Add multiple trend linesggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +

geom_count() +geom_smooth(method = 'lm', se = FALSE) +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),

axis.title = element_text(size = 16))

Bar Charts – Not So Boring After All

16

## Plot count of cars by manufacturer (basic)ggplot(data = mpg, aes(x = manufacturer)) +

geom_bar(stat = 'count')

## Plot count of cars by manufacturer (prettied)ggplot(data = mpg, aes(x = manufacturer)) +

geom_bar(stat = 'count') +theme(axis.text.x = element_text(angle = 45, hjust = 1),

axis.title = element_text(size = 16))

Bar Charts – Not So Boring After All

17

## Plot count of cars by manufacturer (basic)## Add transmission type as a “fill”ggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +

geom_bar(stat = 'count', position = ‘dodge’)

## Plot count of cars by manufacturer (prettied)ggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +

geom_bar(stat = 'count', position = ‘dodge’) +theme(axis.text.x = element_text(angle = 45, hjust = 1),

axis.title = element_text(size = 16))

Bar Charts – Not So Boring After All

18

## Plot count of cars by manufacturer (basic)## Facet on no. of cylindersggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +

geom_bar(stat = 'count', position = ‘dodge’) +facet_grid(cyl ~ .)

## Plot count of cars by manufacturer (prettied)## Facet on no. of cylindersggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +

geom_bar(stat = 'count‘, position = ‘dodge’) +facet_grid(cyl ~ .) +theme(axis.text.x = element_text(angle = 45, hjust = 1),

axis.title = element_text(size = 16))

Conclusion: Layers Help Tell the Story

19

Coordinate system Data Coordinates of where shot was taken Make or miss

Geometrics Bins of court coordinates Percentages within bins

Aesthetics Size of hexagons Color based on relative percentage

Statistical Transformations Count of shots, makes within bin

Thank youMike Hoyer, Actuary and Product ManagerMilliman IntelliScript

Telling Your Data StoryMARY PAT CAMPBELL, FSA, MAAA, PRMVP, Insurance Research, Conning21 June 2018

https://en.wikipedia.org/wiki/Charles_Joseph_Minard

23

The Why of Data Visualization. https://www.soa.org/News-and-Publications/Newsletters/Compact/2016/march/The-Why-of-Data-Visualization.aspx

Evaluate Your VisualizationCompleteness

Perceptibility

Intuitiveness

Source: http://www.perceptualedge.com/articles/visual_business_intelligence/data_visualization_effectiveness_profile.pdf

No relevant data All relevant data

Unclear and difficult

Clear and easy

Unfamiliar; hard to understand

Familiar; easy to understand

What is Your Story?Distribution

Change over time

Correlation or Relationship

Comparison between items (ranking)

Comparison over space (maps)

Parts of a whole

Things to Try to Improve ReadabilityREMOVE

GridlinesLegend – replace with data labels

Instead:Add explanatory textHighlight key elementsUse multiples of same graph

Some Data Stories

30

Examples To ComeI will be telling some data stories in the session, and full slides will be available after the meeting.

Photo by Casey Horner on Unsplash

Choosing a Visualization Type

What Kind of Data Do You Have?• Dimensionality:

• One: histogram, box-and-whisker, pie chart, table with summary stats• Two: line, bar/column, scatterplot• Many: multiples

• Numerical or categorical• Categorical: bar/column (may want to sort categories), histogram

• Grouped• Clustered columns, multiple graphs

• Large set – or just a few numbers• Large: will generally need to simplify/summarize/group along some dimension• Few: consider table or just a number

What is Your Story?• Distribution

• Density plot, histogram, box-and-whisker

• Change over time • Line, slope

• Correlation or Relationship• Scatterplot, bubble plot

• Comparison between items (ranking)• Slope, list/table, conditional formatting on table

• Comparison over space (maps)• Choropleths, tile maps

• Parts of a whole• Pie, stacked bar/column

Additional Resources

Additional Resources

Storytelling with Data

Looks at how to design graphs and other displays for maximum effect

Most can be done in Excel

WebsitesThe Chartmaker Directoryhttp://chartmaker.visualisingdata.com/

Visualization UniverseChart types: http://visualizationuniverse.com/charts/Charting books: http://visualizationuniverse.com/books/

PolicyVizhttps://policyviz.com/

Can You See It?

top related