analysis and visualization of biological data · biological data balaji rajashekar, ph.d researcher...

9
ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday, September 18, 13

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA

Balaji Rajashekar, Ph.DResearcher in BIIT group, Room 314

Bioinformatics, Algorithmics and Data Mining Group

Wednesday, September 18, 13

Page 2: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

TOPICS

1. Image analysis

2. Clustering (numeric and sequence) data

3. Visualizing whole genome

Wednesday, September 18, 13

Page 3: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

1. IMAGE ANALYSIS

• Problem: To score each spot/image.

• Result: Develop a software and score the spots.

• Verification: with manual scoring of images from an expert (pathologist)

• example data: Tissue micro arrays http://fimm.webmicroscope.net/MainCollection/cyclin

Wednesday, September 18, 13

Page 4: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

1. IMAGE ANALYSIS - DATA AND RESULTS

Wednesday, September 18, 13

Page 5: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

2. CLUSTERING (NUMERIC AND SEQUENCE DATA)

• a. Numerical vectors

• task: to generate significant clusters

• b. sequence data (short words which are 12 characters)

• task: to generate clusters having low similarity (~40%)

• Challenges : method should work on large data (in millions), require a fast method.

Wednesday, September 18, 13

Page 6: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

HUMAN GENOME

http://www.eyeondna.com/wp-content/uploads/2007/08/humangenometshirt.jpg

Wednesday, September 18, 13

Page 7: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

3. VISUALIZATION OF GENOMES

• Problem: Cluster individuals and identify mutations in genome alignment.

• Result: a user friendly software with data browsing, use public gene annotations, hide identical sites, recalculate tree based on selection, save results, etc.

Wednesday, September 18, 13

Page 8: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

Tutorial: Environment for Tree Exploration, Release 2.2

print ’current model log likelihood:’, current_model.lnLif current_model.lnL > best_lnl:

best_lnl = current_model.lnLbest_model = current_model

Finally in order to have a quick look of the selctive pressures over our phylogeny:

tree.show()

by default this will the picture obtained:

Node size, and color dependant of the ! value. But other displays are available:

from ete2.treeview.layouts import evol_clean_layout

tree.show(layout=evol_clean_layout)

With here ! ratios in red and also in gray the dN and dS values.

Site model

Another way to look at selective pressures, is to compute directly along the alignment, the value of !for a whole column (putting all leaves together). For doing this, we can use for example the model M2of CodeML or directly use SLR. As before we just have to:

tree.run_model(’M2’)tree.run_model(’SLR.lele’)

and to display the results:

tree.show (histfaces=[’M2’])

when a site model is computed, an histface is automatically generated. Thus with this call, what we aredoing is to draw the default histface corresponding to the model named M2.lala. This is the result:

However customizing this face is feasible:

model2 = tree.get_evol_model (’M2’)

col2 = {’NS’ : ’black’, ’RX’ : ’black’,’RX+’: ’black’, ’CN’ : ’black’,’CN+’: ’black’, ’PS’ : ’black’, ’PS+’: ’black’}

3.8. Testing Evolutionary Hypothesis 101

Tutorial: Environment for Tree Exploration, Release 2.2

model2.set_histface (up=False, kind=’curve’, colors=col2, ylim=[0,4], hlines = [2.5, 1.0, 4.0, 0.5], header = ’Many lines, error boxes, background black’, hlines_col=[’orange’, ’yellow’, ’red’, ’cyan’], errors=True)

tree.show(histfaces=[’M2’])

or:

col = {’NS’ : ’grey’, ’RX’ : ’black’,’RX+’: ’grey’, ’CN’ : ’black’,’CN+’: ’grey’, ’PS’ : ’black’, ’PS+’: ’black’}

model2.set_histface (up=False, kind=’stick’, hlines = [1.0,0.3], hlines_col=[’black’,’grey’])

tree.show(histfaces=[’M2’])

The col dictionary contains the colors for sites detected to be under positive selection (PS), relaxation(RX), or conserved (CN). However, it is not a good idea to use them now as we do not know if there isindeed positive selection.

To be able to accept M2 results we will have to test this model against a null model.

3.8.4 Hypothesis Testing

In order to know if the parameters estimated under a given model a reliable, we have to compare itslikelihood to a null model.

Usually, the alternative model is a model that estimates the proportion of sites with ! > 1 and wecompare its likelihood with a null model, usually a model that do not (letting ! <= 1). This comparisonis done through a likelihood ratio test. If the alternative model has the best fit than we are able to acceptthe possibility of ! > 1.

To see a non-exhaustive list of famous comparison see the documentation of the function:EvolNode.get_most_likely()

Test on sites

In order to know if some sites are significantly under positive selection, relaxed or conserved we haveusually to compare 2 models. However using the model “SLR” we can directly infer positive selectionor relaxation through the SLR program [massingham2005].

The most usual comparison, and perhaps the most robust, is the comparison of models M2 and M1.

102 Chapter 3. The ETE tutorial

Tutorial: Environment for Tree Exploration, Release 2.2

print ’current model log likelihood:’, current_model.lnLif current_model.lnL > best_lnl:

best_lnl = current_model.lnLbest_model = current_model

Finally in order to have a quick look of the selctive pressures over our phylogeny:

tree.show()

by default this will the picture obtained:

Node size, and color dependant of the ! value. But other displays are available:

from ete2.treeview.layouts import evol_clean_layout

tree.show(layout=evol_clean_layout)

With here ! ratios in red and also in gray the dN and dS values.

Site model

Another way to look at selective pressures, is to compute directly along the alignment, the value of !for a whole column (putting all leaves together). For doing this, we can use for example the model M2of CodeML or directly use SLR. As before we just have to:

tree.run_model(’M2’)tree.run_model(’SLR.lele’)

and to display the results:

tree.show (histfaces=[’M2’])

when a site model is computed, an histface is automatically generated. Thus with this call, what we aredoing is to draw the default histface corresponding to the model named M2.lala. This is the result:

However customizing this face is feasible:

model2 = tree.get_evol_model (’M2’)

col2 = {’NS’ : ’black’, ’RX’ : ’black’,’RX+’: ’black’, ’CN’ : ’black’,’CN+’: ’black’, ’PS’ : ’black’, ’PS+’: ’black’}

3.8. Testing Evolutionary Hypothesis 101

Tutorial: Environment for Tree Exploration, Release 2.2

model2.set_histface (up=False, kind=’curve’, colors=col2, ylim=[0,4], hlines = [2.5, 1.0, 4.0, 0.5], header = ’Many lines, error boxes, background black’, hlines_col=[’orange’, ’yellow’, ’red’, ’cyan’], errors=True)

tree.show(histfaces=[’M2’])

or:

col = {’NS’ : ’grey’, ’RX’ : ’black’,’RX+’: ’grey’, ’CN’ : ’black’,’CN+’: ’grey’, ’PS’ : ’black’, ’PS+’: ’black’}

model2.set_histface (up=False, kind=’stick’, hlines = [1.0,0.3], hlines_col=[’black’,’grey’])

tree.show(histfaces=[’M2’])

The col dictionary contains the colors for sites detected to be under positive selection (PS), relaxation(RX), or conserved (CN). However, it is not a good idea to use them now as we do not know if there isindeed positive selection.

To be able to accept M2 results we will have to test this model against a null model.

3.8.4 Hypothesis Testing

In order to know if the parameters estimated under a given model a reliable, we have to compare itslikelihood to a null model.

Usually, the alternative model is a model that estimates the proportion of sites with ! > 1 and wecompare its likelihood with a null model, usually a model that do not (letting ! <= 1). This comparisonis done through a likelihood ratio test. If the alternative model has the best fit than we are able to acceptthe possibility of ! > 1.

To see a non-exhaustive list of famous comparison see the documentation of the function:EvolNode.get_most_likely()

Test on sites

In order to know if some sites are significantly under positive selection, relaxed or conserved we haveusually to compare 2 models. However using the model “SLR” we can directly infer positive selectionor relaxation through the SLR program [massingham2005].

The most usual comparison, and perhaps the most robust, is the comparison of models M2 and M1.

102 Chapter 3. The ETE tutorial

1

2

3

4

3. VISUALIZATION OF GENOMES

Wednesday, September 18, 13

Page 9: ANALYSIS AND VISUALIZATION OF BIOLOGICAL DATA · BIOLOGICAL DATA Balaji Rajashekar, Ph.D Researcher in BIIT group, Room 314 Bioinformatics, Algorithmics and Data Mining Group Wednesday,

FOR MORE DETAILS

•Contact: [email protected], Room - 314

•Group: http://biit.cs.ut.ee

Wednesday, September 18, 13