embl-ebi visualization & data mining. embl-ebi visualisation the process of representing...

35
EMBL-EBI Visualization & Data mining

Upload: audra-octavia-pitts

Post on 26-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Visualization& Data mining

Page 2: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Visualisation

The process of representing abstract data to aid in understanding the meaning of the data.

Not to be confused with rendering data (drawing pictures)

Typically though, we render data in such a way to visualize the information within that data.

Page 3: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Introduction

Biological data comes from & is of interest to: Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media :

Presentation of diverse information to a diverse audience. Each has there own point of view (context).

Expert = scientist working within their own field of expertise Non-expert = scientist using data/information outside their field Novice = Non-scientist

Page 4: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Web pages These are notoriously badly designed often resulting in

the information on that site being unusable.The front page should load quicklyThe main point should appear on the first full screenClutter – not logically laid outToo busy – cannot find the salient point8% men & 0.5% women are colour blindBad text/fonts

Too often it doesn’t workUser will go somewhere elseThe latest wiz-bang stuff only works on the latest browsersOnly works in one browser – they only tested on one.

– Does not conform to standard HTMl

Not just presentation of results

Google is a good design

Page 5: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Asking questions

Asking questionsBiological data is very complex

Chemistry, Biology, Physics, Statistics, Medicine..Most users will be from a different field

Asking the right question is difficult.The user cannot use the correct terminologyToo many things to query (2000 attributes in MSD)SQL : not suitable for most users

Interface too complexToo many check boxes, widgets etc Trying to be too cleverThe “Go” button is buried somewhere

Page 6: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Result presentation

ResultsBiological data is complex

Chemistry, physics, biology, statistics, medicine…

Experts users want all the detail Ie : want to use a specific methodThey want all the detailsThe want (I hope) the statistical validity of the results

The non-expert wants the best practice answer returned within their own context.The want comparative analysis with other fieldsThe want to know the results are valid

Page 7: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Query design

Suitable for text queries

Only one logicAND or OR

PredefinedEasy to useLimited scope2000 attributes ->

2000 check-boxes !

The simple text box design is very common

Page 8: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Query design

Graphical interface Multiple logic

AND/OR/NOT

Under users control Slower Steep learning curve

Some users just cannot get it

Intuitive once mastered Pretty

Page 9: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Query design

HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.

[n]/T>C2.0

Figurative 2D sketch for 3D query (Active sites) Informative – presents meaning for the questionSlower Less error prone

Page 10: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

YAMGP (yet another molecular graphics program)

Many different programs are available

AstexViewer@MSD-EBI

Quanta

Rasmol

MolMol

Chime

O

Spock

Swiss-PDBviewer

Molscript

iMol

Pymol

Chimera

XtalView

FrodoBobscript InsightII

Raster3D

WebLab-viewer

POVRay

Yasara

LigPlotWebMol

PymolGrasp

Mage

Whatif

VMD

Frodo

Page 11: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Result visualisation

Multiple types of biological data Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance

Patented !

Page 12: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Visualisation : AstexViewer@MSI-EBI

Visualisation Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc…

Structure/sequence/data

Page 13: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Visualisation : comparative analysis

Similarity/DifferenceData superpositionAttribute display

Colour, size…

CorrelationAttribute mapping

Sequence colour by structure alignment

Analysis Example

Page 14: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Animation

Animation Time dependent display

Reaction chemistry Visual clues. Expression data

Shown as… Rotation Flash On/off Object Synchronization Size, Colour….

Sound NO : incredibly annoying

Animation Example

Page 15: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Multidimensional analysis

Comparative analysis on multiple dataEg. Phi,Psi, Bvalue, Omega

1D & 2D easy3D graphs are difficult to see.4D requires 3D + iso-surfacesHigher – too busy

Use 2D + multiple propertiesSPOTFIRE is the most well knownUse : X/Y/Colour/size/shape… Interactive bracketing Example

Page 16: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Visualization- Summary

Rendering data is not visualization

Not just the display of results

Huge array of non-specific techniques – and entire scientific field !

Page 17: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Data mining

“Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary)

“True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)

Page 18: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Data mining & Data analysis

Traditional analysis is via “verification-driven analysis”Requires hypothesis of the desired information

(target)Requires correct interpretation of proposed query

Discovery-driven data miningFinds data with common characteristicsResults are ideal solutions to discovery Finds results without previous hypothesisResults have unbiased mean and variance

Page 19: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

So what is Hypothesis driven data analysis ?

Define a target = hypothesis Search for target There are/are-not “hits”

Verify/negate hypothesis

Distribution is centred on target

“catalytic triad” : text string matchingAtomic coordinates : coordinate superpositionMathematical graph : graph matchingHIS,ASP,SER : data hierarchy knowledge

Page 20: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Four types of data mining

Creation of predictive models : future data expectation

Link analysis : connections between data objects

Database segmentation : classification

Deviation detection : finding outliers.IBM : white papers

Page 21: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Given multiple sets of primary data (dependant variables)

Characters, numbers, Function(numbers),…. Find anomalies

To many : numerical occurrenceData variation : DerivativesSingularities…..

Correlations and clustersWithin primary datawith other data (independent variables)

So what is this data mining ?

Finds new things !But not what it

means !

Page 22: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Eg

Wife rings husband, “get some nappies for the weekend”Husband takes opportunity to buy some beer !

You won’t grant funding to test this hypothesis !

Retail and Financial industry are heavily into DM.A well known US food supermarket chain found a

correlation :Babies nappies Beer 5pm on Friday

Page 23: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Self/Cross data mining

Most mining software looks for correlations between dependent variables.Rainfall, temperature, cloud-cover

It rains when it is cloudyFree : http://www.cs.waikato.ac.nz/~ml/

Bioinformatics usually involves anomalies within data objects Sequence clusters (sequence finger prints)Local coordinate clusters (active sites)Global coordinate cluster (folds)

Page 24: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Data mining – not idiot proof

Date of birth and age will give 100 % correlation Authors for structure submission will be correlated

to authors on primary citation. “Lysozyme” is the most common fold pattern 36 spelling’s of E.Coli will mask results. Requires representative sets

Statistically valid ones too !

Signal/Noise ratio is a problem

Page 25: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Discovery driven data mining of the PDB

Analysis of 3-dimensional coordinates Defined common patterns of atomic interactions locally

DB segmentation - active sites & common packing features Link analysis - Similarity between different functional group

Defined globally DB segmentation - common patterns of super-secondary str’ Link analysis - common folds in diverse protein families Outlier detection - unique folds

Page 26: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Issues

Systematic “error” propagates as solution300 lysozyme structures return as a strong solution

Results cannot be found below the noise levelNeed to characterise the noise levelNeed to improve signal/noise ratio (S/N) to see

information Target is not biologically defined

It does not give you the biological answerResults should reproduce known biology Can give you new results not previously observed

Page 27: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Data selection

Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact alignment

Different “phase space” to select data

Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers

Using properties NOT target parameters of structure solution

Page 28: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Local atomic interactions

Data Function(3D coordinates) = distance Atom names (independent variable) Residue names (independent variable)

Create 3D Hash table of triplets of distances(*) between “points” This is the dependant variable Order = 3

Page 29: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Local atomic interactions

Merge triplets Any pair of N-fold interactions are a (N+1)

interaction if they have (N-1) equivalence. Order = N

Just keep going until no more (N+1) interaction are found.

Time = 8 seconds to find ~ 2000 interactions(Digital alpha ES40)

Page 30: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Catalytic quartet

Page 31: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Electrostatic interaction

Ligands are found close by rather than associated with the

residues

Page 32: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Iron binding site

Page 33: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Double disulphide

Page 34: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

N-linked glycosolation binding site +

Spot the non-sugar

This glycosolation site is the same as active site found in “1a53” – indol-3-glycerolphosphate synthase

Page 35: EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data

EMBL-EBI

Summary

Nearly all Bioinformatics is based on hypothesis driven data analysis Data mining has lost its meaning within Bioinformatics.

Discovery driven data-analysis (true data mining) : Can find unknown dependencies, clusters, outliers Is based on statistical probability Returns distributions unbiased by previous ideas

Information technology may be better for genomes (1D) “A numerical measure of the uncertainty of an outcome” Information content of gene sequences can be defined by the

normalized probability of finding “words” within that sequence