embl-ebi visualization & data mining. embl-ebi visualisation the process of representing...

Post on 26-Dec-2015

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EMBL-EBI

Visualization& Data mining

EMBL-EBI

Visualisation

The process of representing abstract data to aid in understanding the meaning of the data.

Not to be confused with rendering data (drawing pictures)

Typically though, we render data in such a way to visualize the information within that data.

EMBL-EBI

Introduction

Biological data comes from & is of interest to: Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media :

Presentation of diverse information to a diverse audience. Each has there own point of view (context).

Expert = scientist working within their own field of expertise Non-expert = scientist using data/information outside their field Novice = Non-scientist

EMBL-EBI

Web pages These are notoriously badly designed often resulting in

the information on that site being unusable.The front page should load quicklyThe main point should appear on the first full screenClutter – not logically laid outToo busy – cannot find the salient point8% men & 0.5% women are colour blindBad text/fonts

Too often it doesn’t workUser will go somewhere elseThe latest wiz-bang stuff only works on the latest browsersOnly works in one browser – they only tested on one.

– Does not conform to standard HTMl

Not just presentation of results

Google is a good design

EMBL-EBI

Asking questions

Asking questionsBiological data is very complex

Chemistry, Biology, Physics, Statistics, Medicine..Most users will be from a different field

Asking the right question is difficult.The user cannot use the correct terminologyToo many things to query (2000 attributes in MSD)SQL : not suitable for most users

Interface too complexToo many check boxes, widgets etc Trying to be too cleverThe “Go” button is buried somewhere

EMBL-EBI

Result presentation

ResultsBiological data is complex

Chemistry, physics, biology, statistics, medicine…

Experts users want all the detail Ie : want to use a specific methodThey want all the detailsThe want (I hope) the statistical validity of the results

The non-expert wants the best practice answer returned within their own context.The want comparative analysis with other fieldsThe want to know the results are valid

EMBL-EBI

Query design

Suitable for text queries

Only one logicAND or OR

PredefinedEasy to useLimited scope2000 attributes ->

2000 check-boxes !

The simple text box design is very common

EMBL-EBI

Query design

Graphical interface Multiple logic

AND/OR/NOT

Under users control Slower Steep learning curve

Some users just cannot get it

Intuitive once mastered Pretty

EMBL-EBI

Query design

HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.

[n]/T>C2.0

Figurative 2D sketch for 3D query (Active sites) Informative – presents meaning for the questionSlower Less error prone

EMBL-EBI

YAMGP (yet another molecular graphics program)

Many different programs are available

AstexViewer@MSD-EBI

Quanta

Rasmol

MolMol

Chime

O

Spock

Swiss-PDBviewer

Molscript

iMol

Pymol

Chimera

XtalView

FrodoBobscript InsightII

Raster3D

WebLab-viewer

POVRay

Yasara

LigPlotWebMol

PymolGrasp

Mage

Whatif

VMD

Frodo

EMBL-EBI

Result visualisation

Multiple types of biological data Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance

Patented !

EMBL-EBI

Visualisation : AstexViewer@MSI-EBI

Visualisation Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc…

Structure/sequence/data

EMBL-EBI

Visualisation : comparative analysis

Similarity/DifferenceData superpositionAttribute display

Colour, size…

CorrelationAttribute mapping

Sequence colour by structure alignment

Analysis Example

EMBL-EBI

Animation

Animation Time dependent display

Reaction chemistry Visual clues. Expression data

Shown as… Rotation Flash On/off Object Synchronization Size, Colour….

Sound NO : incredibly annoying

Animation Example

EMBL-EBI

Multidimensional analysis

Comparative analysis on multiple dataEg. Phi,Psi, Bvalue, Omega

1D & 2D easy3D graphs are difficult to see.4D requires 3D + iso-surfacesHigher – too busy

Use 2D + multiple propertiesSPOTFIRE is the most well knownUse : X/Y/Colour/size/shape… Interactive bracketing Example

EMBL-EBI

Visualization- Summary

Rendering data is not visualization

Not just the display of results

Huge array of non-specific techniques – and entire scientific field !

EMBL-EBI

Data mining

“Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary)

“True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)

EMBL-EBI

Data mining & Data analysis

Traditional analysis is via “verification-driven analysis”Requires hypothesis of the desired information

(target)Requires correct interpretation of proposed query

Discovery-driven data miningFinds data with common characteristicsResults are ideal solutions to discovery Finds results without previous hypothesisResults have unbiased mean and variance

EMBL-EBI

So what is Hypothesis driven data analysis ?

Define a target = hypothesis Search for target There are/are-not “hits”

Verify/negate hypothesis

Distribution is centred on target

“catalytic triad” : text string matchingAtomic coordinates : coordinate superpositionMathematical graph : graph matchingHIS,ASP,SER : data hierarchy knowledge

EMBL-EBI

Four types of data mining

Creation of predictive models : future data expectation

Link analysis : connections between data objects

Database segmentation : classification

Deviation detection : finding outliers.IBM : white papers

EMBL-EBI

Given multiple sets of primary data (dependant variables)

Characters, numbers, Function(numbers),…. Find anomalies

To many : numerical occurrenceData variation : DerivativesSingularities…..

Correlations and clustersWithin primary datawith other data (independent variables)

So what is this data mining ?

Finds new things !But not what it

means !

EMBL-EBI

Eg

Wife rings husband, “get some nappies for the weekend”Husband takes opportunity to buy some beer !

You won’t grant funding to test this hypothesis !

Retail and Financial industry are heavily into DM.A well known US food supermarket chain found a

correlation :Babies nappies Beer 5pm on Friday

EMBL-EBI

Self/Cross data mining

Most mining software looks for correlations between dependent variables.Rainfall, temperature, cloud-cover

It rains when it is cloudyFree : http://www.cs.waikato.ac.nz/~ml/

Bioinformatics usually involves anomalies within data objects Sequence clusters (sequence finger prints)Local coordinate clusters (active sites)Global coordinate cluster (folds)

EMBL-EBI

Data mining – not idiot proof

Date of birth and age will give 100 % correlation Authors for structure submission will be correlated

to authors on primary citation. “Lysozyme” is the most common fold pattern 36 spelling’s of E.Coli will mask results. Requires representative sets

Statistically valid ones too !

Signal/Noise ratio is a problem

EMBL-EBI

Discovery driven data mining of the PDB

Analysis of 3-dimensional coordinates Defined common patterns of atomic interactions locally

DB segmentation - active sites & common packing features Link analysis - Similarity between different functional group

Defined globally DB segmentation - common patterns of super-secondary str’ Link analysis - common folds in diverse protein families Outlier detection - unique folds

EMBL-EBI

Issues

Systematic “error” propagates as solution300 lysozyme structures return as a strong solution

Results cannot be found below the noise levelNeed to characterise the noise levelNeed to improve signal/noise ratio (S/N) to see

information Target is not biologically defined

It does not give you the biological answerResults should reproduce known biology Can give you new results not previously observed

EMBL-EBI

Data selection

Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact alignment

Different “phase space” to select data

Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers

Using properties NOT target parameters of structure solution

EMBL-EBI

Local atomic interactions

Data Function(3D coordinates) = distance Atom names (independent variable) Residue names (independent variable)

Create 3D Hash table of triplets of distances(*) between “points” This is the dependant variable Order = 3

EMBL-EBI

Local atomic interactions

Merge triplets Any pair of N-fold interactions are a (N+1)

interaction if they have (N-1) equivalence. Order = N

Just keep going until no more (N+1) interaction are found.

Time = 8 seconds to find ~ 2000 interactions(Digital alpha ES40)

EMBL-EBI

Catalytic quartet

EMBL-EBI

Electrostatic interaction

Ligands are found close by rather than associated with the

residues

EMBL-EBI

Iron binding site

EMBL-EBI

Double disulphide

EMBL-EBI

N-linked glycosolation binding site +

Spot the non-sugar

This glycosolation site is the same as active site found in “1a53” – indol-3-glycerolphosphate synthase

EMBL-EBI

Summary

Nearly all Bioinformatics is based on hypothesis driven data analysis Data mining has lost its meaning within Bioinformatics.

Discovery driven data-analysis (true data mining) : Can find unknown dependencies, clusters, outliers Is based on statistical probability Returns distributions unbiased by previous ideas

Information technology may be better for genomes (1D) “A numerical measure of the uncertainty of an outcome” Information content of gene sequences can be defined by the

normalized probability of finding “words” within that sequence

top related