exploratory data analysis - ohsu informatics · exploratory data analysis visualizing your data...
TRANSCRIPT
Exploratory Data AnalysisVisualizing your data
Shannon McWeeney, PhD14th January 2016
Exploratory Data Analysis (EDA)
2
1st step in a 2‐step process
Main Objectives• ASSESS Assumptions
• SUPPORT Selection
• PROVIDE Basis
3
• Examines distributions + relationships
• Utilizes visualization + numerical summaries
4
EDA Features
• Examines distributions + relationships
• Utilizes visualization + numerical summaries
5
EDA FEATURES
Everything is done in framework of the analysis plan!
6
Context is Key
Data File Asessment:#variables#subjectsRange
% Missing
7
Starting Point
R Commands:Summary()
Dim()
8
Assessing the File
9
OHSU Resources• R Boot‐camp: Created by Dr. Ted Laderas, Division of Bioinformatics and
Computational Biology , Department of Medical Informatics and Clinical Epidemiology
https://www.coursesites.com/s/_Rbootcamp(Coursesites registration required)
• Improved data quality• Improved trust in data integrity• Improved documentation and control• Reduced data redundancy• Reuse of data• Consistency in data use• Easier data analysis• Improved decision making based on better data• Simpler programming• Enforcement of standards
10
Benefits of a Data dictionary*
*From Ahima.org
First Example
11
Source: AHIMA.ORG
12
2nd Example
https://tcga‐data.nci.nih.gov/docs/dictionary/
• Provide visual information
• Examine relationships; distribution
13
DISPLAYING Data: GRAPHICS
14
Assessing: Relationships
-2 -1 0 1 2
-2-1
01
23
dat$Germinal.center.B.cell.signature
dat$
Lym
ph.n
ode.
sign
atur
e
R Commands:Plot()Cor()
mple..LYM
1.0 1.8 1.0 1.8 1.0 3.5 -2 1 -2 1 -1 2
030
0
1.0
1.8
nalysis.Se
ow.up..yea
015
1.0
1.8
us.at.follow
Subgroup
1.0
2.5
1.0
3.5
IPI.Group
enter.B.ce
-21
-21 h.node.sig
ration.sign
-11
-21 BMP6
class.II.sig
-30
0 300
-12
0 15 1.0 2.5 -2 1 -1 1 -3 0
me.predicto
15
Assessing: Distributions
R Commands:Hist() ( also histogram() in lattice library)
boxplot()
ABC GCB Type III
05
1015
20
dat$Follow.up..years
Perc
ent o
f Tot
al
0
20
40
60
0 5 10 15 20
Alive
0 5 10 15 20
Dead
• Layout
• “Stand alone”
• Comparison of interest/focus
16
Displaying Data: TABLES
17
Sample Inspection
6RINng=15156x3252ng=16441x8005
Inf_Status=MInf_Status=W
Timepoint=D12Timepoint=D2
Timepoint=D21Timepoint=D28
Timepoint=D4Timepoint=D7
Lab=GLab=L
Tissue=BrTissue=Sp
−2
−1
0
1
2
3
R Commands:Heatmap()
18
Sample Inspection
69
RIN
Mating=15156x3252Mating=16441x8005
Inf_Status=MInf_Status=W
Timepoint=D12Timepoint=D2
Timepoint=D21Timepoint=D28
Timepoint=D4Timepoint=D7
Lab=GLab=L
Tissue=Sp
−6
−4
−2
0
2
4
6
Display data accurately and clearly
Good and bad data visualization
20
21
22
23
24
Bad Data Viz• Not informative
• Data is obscured (Tufte’s “Chart junk”*)
• Pie charts (3d!!)
• Issues of scale
25
*Tufte, E. R. The visual display of quantitative information
• WHAT IS THE STORY?
• WHAT DO YOU NEED TO KNOW TO INTERPRET IT?
26
Graphical Proficiency
Interactive Visualization
Examples & Tools you can use
28
INTERACTIVE GRAPHICS
GAPMINDER.ORG
29
Interactive Data: Path Models
30
Google Charts
https://developers.google.com/chart/?csw=1
31
Shiny (Rstudio)
http://shiny.rstudio.com/
32
Data Driven (D3JS)
http://d3js.org
“If you don’t think you have a quality problem with your data, you haven’t
looked at it”
Every data set has quirks.
• Denial
• Anger
• Bargaining
• Depression
• Acceptance (+ Hope!)
34
5 Stages of Data Grief
• Software shouldn’t dictate the Visual
• Tell a story
• Follow best practices (be mindful)
35
Visual Points to remember