bioinformatics a communication/signal processing...

55
Bioinformatics A Communication/Signal Processing Perspective Khalid Sayood Occult Information Lab Department of Electrical and Computer Engineering University of Nebraska-Lincoln

Upload: others

Post on 09-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Bioinformatics A Communication/Signal Processing Perspective

Khalid Sayood

Occult Information Lab

Department of Electrical and Computer

Engineering

University of Nebraska-Lincoln

Page 2: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Outline

• Beginning

• Middle

• End

Page 3: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

My roots

• Signal Processing

• Communication

• Information Theory

• Data Compression

Page 4: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

My roots

• Data Compression - The art or science of finding compact

representations of information in data

• Speech Images Video DNA

Page 5: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

DNA

RNA transcript with cap and tail

mRNA

Exon Intron Intron Exon Exon

Transcription Addition of cap and tail

Introns removed

Exons spliced together

Coding sequence

NUCLEUS

CYTOPLASM

Tail

Cap

Prokaryotes

Eukaryotes

Analogy to Communications:

Start, Message, Stop

Page 6: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Definitions of Bioinformatics

• Original definition: Study of informatic processes in biotic systems – Pauline Hogeweg and

Ben Hesper

Page 7: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Definitions of Bioinformatics

• Original definition: Study of informatic processes in biotic systems – Pauline Hogeweg and

Ben Hesper

• Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-

chemistry) and then applying i for atics techniques (derived from disciplines such as

applied math, CS, and statistics) to understand and organize the information associated with

these molecules, on a large-scale. (Mark Gerstein, 1999)

Page 8: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

.

Data Storage and

Management Data Analysis Interpretation

of Results

Molecular Sequence

Analysis

• Homology Search

• Phylogeny Construction

• Whole Genome

Sequencing

• Gene Finding

Protein Structure

Prediction

•Protein/RNA tertiary

structure

•Docking

•Drug Design

Functional

Genomics and

Proteomics

• Microarrays

• Biomarker

Discovery

Systems Biology

•Pathways

•Network based

wholistic approach

Bioinformatics is a management and analysis information system for life sciences.

Page 9: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Fro

m R

NA

-seq

rea

ds to

diffe

ren

tial e

xp

ressio

n re

sults

•Alicia

Osh

lack

, Ma

rk D

Ro

bin

son

an

dM

atth

ew

D Yo

un

g

Ge

no

me

Bio

log

y2

01

01

1:2

20

Page 10: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Preprocessing:

FastQC

Trimmomatic

Indexing the Reads:

Bowtie 2

Aligning to hg19/KSHV

TopHat 2

Local alignment

Extracting Features

CuffDiff

P < 0.01

FPKMs values

Differentially Expressed Genes

Page 11: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Ap

ach

e T

ave

rna

Page 12: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Wa

ge

nin

ge

n U

niv

ersity

an

d R

ese

arch

Page 13: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

So what is the problem?

• What is actually going on.

Page 14: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Is there hope (for me)?

Digital Camera (1975) Genome Sequencer (2018)

Hope springs eternal

Page 15: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 16: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

What do we mean by a Communication Theory

Perspective

Page 17: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Information exists in the form of a stochastic process

What do we I mean by a Communication Theory

Perspective

Page 18: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

How do we deal with stochastic processes?

• Look at the signal using different basis sets – frequency domain processing.

• Look at correlation structures.

• Look at models.

All these involve averaging of some sort

All these result in the discovery of underlying structure

They can also result in dimensionality reduction.

Page 19: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Realizations of a stochastic process

Page 20: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Frequency profile

Page 21: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Statistical Profile

Page 22: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Models

Page 23: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

But we have sequences – we don’t have numbers

GAGACATTCAGTG

G

C

T

A

Page 24: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

But we have sequences – we don’t have numbers

GAGACATTCAGT

A

C

G

T

Page 25: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

But we have sequences – we don’t have numbers

G A G A C A T T C A G T G

A 0 1 0 1 0 1 0 0 0 1 0 0 0

C 0 0 0 0 1 0 0 0 1 0 0 0 0

G 1 0 1 0 0 0 0 0 0 0 1 0 1

T 0 0 0 0 0 0 1 1 0 0 0 1 0

Identification of Protein Coding Regions Using the

Modified Gabor-Wavelet Transform Mena-Chalco et al. IEEE/ACM Trans. Comp. Bio

Page 26: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

But we have sequences – we don’t have numbers

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

0 1 2 1 2 0 0 0 2 0 0 1 0 1 1 1

G A G A C A T T C A G T G

Page 27: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

But we have sequences – we don’t have numbers

Page 28: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

AMI Profile for Human Chromosome 1

Page 29: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Human chromosomes

Page 30: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Mouse Chromosome

Page 31: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

C. Elegans Chromosome

Page 32: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 33: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 34: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 35: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

• Phylogeny for Candida and

Saccharomyces clades based

on multiple sequence

alignment of 706 orthologous

genes

• Posterior probabilities shown

• WGD: Whole Genome

Duplication

• CTG: Translation of CTG codons

as serine rather than leucine

Page 36: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

• Ribosomal DNA (rDNA) is commonly used to evaluate species relatedness

• The rDNA gene complex contains 3 genes, each of which are ribosomal

components once transcribed

• Internal transcribed spacer (ITS) 1 and ITS2 separate these genes

• ITS regions have 2 benefits:

1. Easy to design primers (ribosome genes highly conserved, many

copies)

2. Spacers diverge more quickly than ribosome genes

Page 37: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

• Distance matrix D generated by calculating pairwise distance dij

between AMI profiles xi and xj

• Distance defined in two ways:

1. Correlation distance (angle between profiles) 𝑑 = 1 − cos 𝜃 = 1 − 𝒙 ∙ 𝒙𝒙 𝒙

2. Euclidean distance 𝑑 = 𝒙 − 𝒙

• Phylogenetic trees generated using PHYLIP (neighbor joining)

Page 38: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 39: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 40: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 41: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

GO Prediction

BP: Biological Processes, CC: Cellular Component, MF: Molecular Function

“High Abundance" GO terms “Low Abundance” GO terms

Page 42: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 43: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 44: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

M E W S I N A N Q M Q L D Y Q R T R L K

M D C F I R S K H L K L G H Q H I H L N

M D G S I N A N Q M N F G H Q R I H L K

M D C S E N A K Y I K L A Q H R I H L N

M E C S I N A N Q R N L G H Q R I Y L K

M D W S I N A N H M K L D R P Q I H L T

M D C S I N A N Q I K F A Q Q R M H L N

M E C S E N A K R M K S G H Q H I H F K

M D G S I N A N Q K I L G H H R I Q L R

M Q C S I N A N H K K F G Q Q R T H L K

M D G E N A K H I K L D Q H R I Q L N

M R C S T M D N Q M N L G R Q H I H F K

S

k m

M D C S I N A N Q M K L G H Q R I H L K

Page 45: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 46: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Slow progressor populations Rapid progressor populations

Page 47: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Metagenomics

Page 48: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 49: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 50: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

𝑝 , =𝐺 ∩𝐺𝐺

Page 51: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

RRMSE for the simulated metagenomes corresponding to a mixture of 10, 20, 50, and 100 randomly

selected organisms for 0.01X and 0.1X average genome coverages.

Page 52: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 53: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results
Page 54: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results

Oc

cu

lt

In

fo

rm

at

ion

La

b

Bio

inf

or

ma

tic

s

Mark Bauer Hasan Otu David Russell Ufuk Nalbantoglu Sam Way

Amirsalar Mansouri Garin Newcomb Dicle Yalcin Keith Murray Jacob Bohac

Page 55: Bioinformatics A Communication/Signal Processing Perspectivebyoyo.cmpe.boun.edu.tr/sunumlar/khalidsayood-byoyo18.pdfData Storage and Management Data Analysis Interpretation of Results