getting into visualization of large biological data...

2 0 I M P E R A T I V E S O F I N F O R M A T I O N D E S I G NGetting Into Visualization of Large Biological Data Sets

Martin Krzywinski, Inanc Birol, Steven Jones, Marco Marra

BC Cancer AgencyCARE & R E SEARCHAn agency of the Provincial Health Services Authority

Never use hue to encode magnitude.Hue does not communicate relative change in values because we perceive hue categorically (blue, green, yellow, etc). Changes within one category have less per-ceptual impact than transitions between categories. For example, variations across the green/yellow boundary are perceived to be larger than variations across the same sized hue interval in other parts of the spectrum.

Use perceptual palettes.Selecting perceptually favorable colors is dif�-cult because most software does not support the required color spaces. Brewer palettes [BRE11] exist for the full range of colors to help us make useful choices. Qualitative palettes have no per-ceived order of importance. Sequential palettes are suitable for heat maps because they have a natural order and the perceived difference be-tween adjacent colors is constant. Twin hue di-verging palettes, are useful for two-sided quan-titative encodings, such as immuno�uorescence and copy number.

Crop scale to reveal fine structure in data.Biological data sets are typically high-resolution (changes at base pair level can meaningful), sparse (distances between changes are orders of magnitude greater than the affected areas) and connect distant regions by adjacency relation-ships (gene fusions and other rearrangements). It is dif�cult to take these properties into ac-count on a �xed linear scale, the kind used by traditional genome browsers. To mitigate this, crop and order axis segments arbitrarily and apply a scale adjustment to a segment or por-tion thereof.

Use encodings that are robust and comparable.In addition to being transparent and predictable, visu-alizations must be robust with respect to the data. Changes in the data set should be re�ected by propor-tionate changes in the visualization. Be wary of force-directed network layouts, which have low spatial auto-correlation. In general, these are neither sensitive nor speci�c to patterns of interest.

Do not exceed resolution of visual acuity.Our acuity is ~50 cycles/degree or about 1/200 (0.3 pt) at 10 inches. Ensure the reader can comfortably see detail by limiting resolution to no more than 50% of acuity. Where possible, elements that require visual separation should be at least 1 pt part.

Use no more than ~500 scale intervals.Ensure data elements are at least 1 pt on a two-column Nature �gure (6.22 in), 4 pixels on a 1920 horizontal resolution display, or 2 pixels on a typical LCD projector. These restrictions become challenges for large genomes.

Show variation with statistics.Data on large genomes must be downsampled. Depict variation with min/max plots and consider hiding it when it is within noise levels. Help the reader notice signi�cant outliers.

Do not draw small elements to scale.Map size of elements onto clearly legible symbols. Legibility and clarity are more important than precise positioning and sizing. Discretize sizes and positions to facilitate making meaningful comparisons.

Use the simplest encoding.Choose concise encodings over elaborate ones.

Help the reader judge accurately.Accuracy and speed in detecting differences in visual forms depends on how information is presented. We judge relative lengths more accurately than areas, particularly when elements are aligned and adjacent. Our judgment of area is poor because we use length as a proxy, which causes us to systematically underestimate.

Distinguish between exploration and communication.Use exploratory tools (e.g. genome browsers) to discover patterns and validate hypotheses. Avoid using screenshots from these applications for communication – they are typically too complex and cluttered with navigational elements to be an effective static �gure.

Encapsulate details.Including details not relevant to the core message of the �gure can create confusion. Encapsulation should be done to the same level of detail and to the simplest visual form. Duplication in labels should be avoided unless required to preserve semantic forms.

Be aware of the luminance effect.Color is a useful encoding – the eye can distinguish about 450 levels of gray, 150 hues, and 10-60 levels of saturation, depending on the color – but our ability to perceive differences varies with context. Adjacent tones with different luminance values can interfere with discrimination, in interaction known as the luminance effect.

Be aware of color blindness.In an audience of 8 men and 8 women, chances are 50% that at least one has some degree of color blindness. Use a palette that is color-blind safe. In the palette below the 15 colors appear as 5-color tone progressions to those with color blindness. Additional encodings can be achieved with symbols or line thickness.

Respect natural hierarchies.When the data set embodies a natural hierarchy, use an encoding that emphasizes it clearly and memorably. The use hierarchy in layout (e.g. tabular form) and encoding can signi�cantly improve a muddled �gure.

Use consistent alignment. Center on theme.Establish equivalence using consistent alignment. Awkward callouts can be avoided if elements are logically placed.

Well-designed �gures illustrate complex concepts and patterns that may be dif�cult to express concisely in words. Figures that are clear, concise and attractive are effective – they form a strong connection with the reader and communicate with immediacy. These qualities can be achieved with methods of graphic design, which are based on theories of how we perceive, interpret and organize visual information.

Create legible visualizations with a strong message. Make elements large enough to be resolved comfortably. Bin dense data to avoid sacri�cing clarity.

ENSURE LEGIBILITY AND FOCUS ON THE MESSAGEMatch the visual encoding to the hypothesis. Use encodings speci�c and sensitive to important patterns. Dense annotations should be independent of the core data in distinct visual layers.

USE EFFECTIVE VISUAL ENCODINGS TO ORGANIZE INFORMATION. USE EFFECTIVE DESIGN PRINCIPLES TO EMPHASIZE AND COMMUNICATE PATTERNS.

When drawing the position and size of densely packed genes, encode the gene’s size using a non-linear mapping. When the number of data values is large, such as in the OMIM gene track, hollow glyphs are effective. For even greater number of points, a density map is preferred.

Circos image of patterns in densely distributed breakpoint positions between chromosomes 1, 2, 9 and 17 are emphasized when scale is zoomed. (A) full genome (B) chrs 1, 2, 9, 17 (C) 31 regions on chrs 1, 2, 9, 17 totaling 223 kb [KRZ09].

Hive Plots are a robust and quantitative visual form of networks based on meaningful properties chosen by the user. Unlike in a conventional layout, the removal of a node from a network can be easily spotted in a hive plot [KRZ11].

Some Brewer palettes, such as the pink-yellow-green diverging palette, are color-blind safe. Use them for images where red/green discrimination is critical for comprehension.

Adapted from [NEL03].[WON10][GRE10]

[HEE10]

[STE07]

download posterv2 16 oct 2012

Complex information can be organized by consistent alignment. Notice how placing the gene fusion product in the center emphasizes the subject of the figure. Explaining the concept of gene fusion does not require the internal structure of the two genes, whose size and exon count/distribution should be removed to improve clarity. Original figure from [SES12].

You can count the objects on the left almost instantly, without conscious effort. This is called pre-attentive cognition. This example uses the concept of proximity to partition the dots into pre-attentive groupings. Elements can be effectively organized by reducing spatial variation, here achieved by symmetrical layout and two levels of spacing,

[BRE11] C Brewer (2011) Color Brewer. http://www.colorbrewer.org[GRE10] HE Grecco et al. (2010) In situ analysis of tyrosine phosphorylation networks by FLIM on cell arrays. Nat Methods 7: 467-472.[HEE10] J Heer et al. (2010) Crowdsourcing graphical perception: using mechanical turk to assess visualization design. Proceedings of the 28th international conference on Human factors in computing systems. Atlanta, Georgia, USA: ACM. pp. 203-212.[KRZ09] M Krzywinski et al. (2009) Circos: an information aesthetic for comparative genomics. Genome Res, vol. 19, pp. 1639-45[KRZ11] M Krzywinski et al. (2011) Hive plots—rational approach to visualizing networks, Brief Bioinform, Dec 9 2011.[NEL03] S Nelander et al. (2003) Prediction of cell type-specific gene modules: identification and initial characterization of a core set of smooth muscle-specific genes. Genome Res 13: 1838-1854.[SES12] S Seshagiri et al. (2012) Recurrent R-spondin fusions in colon cancer. Nature 488: 660-664.[STE07] SE Von Stetina et al. (2007) Cell-specific microarray profiling experiments reveal a comprehensive picture of gene expression in the C. elegans nervous system. Genome Biol 8: R135.[WON10] B Wong (2010) Points of view: Color coding. Nat Methods 7: 573.

Content of this poster is drawn from my book chapter Visualization Principles for Scientific Communication (Martin Krzywinski & Jonathan Corum) in the upcoming open access Cambridge Press book Visualizing biological data - a practical guide (Seán I. O'Donoghue, James B. Procter, Kate Patterson, eds.), a survey of best practices and unsolved problems in biological visualization. This book project was conceptualized and initiated at the Vizbi 2011 conference.

If you are interested in guidelines for data encoding and visualization in biology, see the Points of View column by Bang Wong in Nature Methods.

uniform HSV hue

luminance

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

150%200%

QUALITATIVE SEQUENTIAL DIVERGINGset1

set2

pastel2

dark2

blues

greens

reds

ylorbr

spectral

rdylbu

rdylgn

piyg

29

0

0 00

2911

X

Z Y

69

40

11

X

Z

Y

XYZ

694011

A B

C D

HOW MANY DOTS?

E F

Z Y X

Z Y X

11 40 69

29 2940

1169

X

Y

Z

69

40

11

11 29 29

694011

11 29 29XYZ

XYZ

11

40

11

29

29

69

1.5

2

2.5

log2

(erro

r)

NO MESSAGE MESSAGE IN ISOLATIONMESSAGE IN CONTEXT

GENE DISEASE GENE COMPLEX

CONNECTING GENE

SYSTEM GENE COMPLEX

GENES INCOMPLEX

CONNECTINGGENE

S1

S1

S2

S3

S4

A

AG1

G1

SY

ST

EM

G2 G3

G5 G6 G7

G8

G4

G2G3

G3

G8

G4

G5

G6G7

CC

D

D

B

B

B

F

F

E

ES4

S3

S3

S2

Consider whether showing the full data set is useful.The reader’s attention can be focused by increasing the salience of interesting patterns. Other complex data sets, such as networks, are shown more effectively when context is carefully edited or even removed.

12 54 82 29 25 22 67 61 23 79Aggregate data for focused theme.A strong visual message has no uncertainty in its interpretation. Focus on a single theme by aggregating unnecessary detail.

Show density maps and outliers.Establishing context is helpful when emergent patterns in the data provide a useful perspective on the message. When data sets are large, it is dif�cult to maintain detail in the context layer because the density of points can visually overwhelm the area of interest. In this case, consider showing only the outliers in the data set.

What is communicated? (A) The raw data imparts no clear message.(B). Binning indicates ranges, not individual values, are important. (C). Frequency distribution suggests that there is a shortage of medium-sized values. (D) Individual data points can be removed to emphasize trend and significance.

0-3031-6061-100 30 60

*

A

B

C D

30 60

2925232212 54

82796761

manuallayout

springembedded

hive plot

N1

ENTIRE NETWORK NODE N1 REMOVED

N1

N1

DEUTERANOPIAcommon (6%)

NORMAL VISIONrare (2%) very rare (<1%)

PROTANOPIA TRITANOPIA

15-COLOR PALETTE FOR COLOR BLINDNESS

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

000

255255

073146109182

073146182119

730

182109182

0109109182219

146219255255255

14614621936255

073209255255

00036109

R G B

12345

6789

10

1112131415

DEUTERANOPIA PROTANOPIANORMAL VISIONred-green palette

magenta-green palette

all genes

specific in situ result

selective in situ result

non-selective in situ result

specific in literature

non-specific in literature

NO HIERARCHY

specific

selective

non-selective

non-specific

gene

in litin situ

LAYOUT

HIERARCHICAL

specific

selective

non-selective

non-specific

gene

in litin situ

LAYOUT AND ENCODING

FULL ENCAPSULATION

7

N

DAF N

3 516

16

12

9

148

11

28

2 1 4

INS 1

AGE 1

AAP 1

AKT 1AKT 2

DAF 11

DAF 7

DAF 1

DAF 14

DAF 9

AGE 1

AAP 1

AKT 1AKT 2

DAF 16

DAF 2

DAF 28INS 1

DAF 16

DAF 8

DAF 12 DAF 3 DAF 5

INCONSISTENT AND INCOMPLETE ENCAPSULATION

DAF 4

6q22.3

REFERENCE

SAMPLE

TGCATCCTAACGTTAGTCAAGGCTGCCAAGGAGGCTGTGCAACATGCTCATCTCCTGGGATCGGCCCAAGGCCAGTTCTCCGCAG

RSPO3 PTPRK802kb

2 31

1 2 3 3 2 1

Reduce unnecessary variation.The reader does not know what is important in a �gure and will assume that any spatial or color variation is meaningful. The �gure’s variation should come solely from data or act to organize information.

0.250.500.75

11.5

2

line

sepa

ratio

n (p

t)

line thickness (pt)

resolvable? no barely comfortable generous

RESOLVING DATA TRACKS

RESOLVING COLOR DIFFERENCES

RESOLVING DETAIL

0.25 0.50 0.75 1 1.5 2

S. cerevisiae chrIV 1,531 kb · scale length 6.22 in, 158 mm, 448 pt · scale resolution 246 kb/in, 9.6 kb/mm, 3.4 kb/pt

widthpt kbsamples

2

1.5

1

0.75

0.5

0.25

225

300

450

600

900

1,800

6.8

5.1

3.4

2.6

1.7

0.85

Approaches to encoding min/avg/max values of downsampled data. In the top hi-low trace, the vertical bars are perceived as a separate layer and effectively show variance without obscuring trends in the average.

Visual acuity limits the perception of color and spatial differences. For standard reading distance, we can comfortably resolve objects separated by 1 pt. For larger viewing distances (e.g. this poster), this limit is proportionately higher. The standard unit of distance in print is 1 pt = 1/72 inch.

Rendering human chromosome 1 (249 Mb) with 450 divisions provides a resolution of 550 kb, which is larger than 98% of all human genes. To depict 50% of genes without magnification we would need to use 15 kb bins, which would have to be 1/16,600 of the figure (0.0004 in, 10 microns)!

chr 1

<10 10-30 30-50 50-100 100-200 >200size (kb)

RAD54LG>A rs121908690

RNASELC>Trs74315365

EPHB

2

SFPQ

TPM

3

PBX1

PAX7

RBM15

BCL9

PRCC

PRRX

1AB

L2LH

X4

CDC7

3

LCK

MYC

L1M

UTYH

TAL1

BCL1

0

CSF1

CSDE1

ARNT

RIT1

NTRK

1

TPR

PRG4CANCER

CENSUS

SNP

OMIM

50 100 150 200 Mb

ALL DATA SHOWN ONLY OUTLIERS SHOWN

getting into visualization of large biological data...

Documents