1 hku cs bioinformatics research siu ming yiu department of computer science the university of hong...

14
1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof. TW Lam Dr HF Ting

Upload: david-king

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

1

HKU CS Bioinformatics Research

Siu Ming YiuDepartment of Computer Science

The University of Hong Kong

Other faculty members: Prof. Francis Chin

Prof. TW LamDr HF Ting

Page 2: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

2

Medical research

Impact of bioinformatics

Biological research

e.g. finding a cancer-causing

gene?

e.g. can we make rice grow faster?

Environmental study

e.g. how to remove harmful bacteria

Biofuel

e.g. how bacteria digest food to

produce energy?

Huge volume of data

e.g. human genome: 3G long; Medical study: 100 personse.g. human gut contains 1000+ bacteria (data: 500G) obesity

Page 3: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

3

Given an unknown genome, Genome X

The de novo assembly problem (single genome)

NO existing technology is able to read out the DNA sequence (ACCG…..) of it as the sequence is too long (e.g. human = 3 billions long; even bacteria are about 10k – several millions). What we can do?

High-throughput sequencing technology (next generation sequencing (NGS)):

Multiple copies of Genome X

………………….

DNA sequencing

machine

[Inside the machine, the genomes are randomly cut into short fragments (reads), the machine can read out the DNA sequence of the reads.]

ACCGGTCG

CTTG

AACG CTCGGTCG

CTAGCAAG

GGAGGTTG

Page 4: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

4

Multiple copies of Genome X

Bad news

(1)The reads are really short: 100-150 bp (c.f. genome of a bacterium – 10K to several millions).

(2)They are mixing together (no idea where from the genome each read is from!!).

(3)There are errors in the read. [AACCGTTC => AACGGTCC]

The (de novo) assembly problem: Can we reconstruct the original genome from the reads?

Page 5: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

5

Data volume: HUGE!!Take human genome as an example. The genome is of 3x109 (3 billion) long.

The average number of copies of reads from each position of a genome is referred as the depth of the sequencing.

Recall: multiple copies are cut (fragmented). At any position of the genome, multiple copies of reads may be obtained.

……………….

Note that they are mixed together, no ordering information

For depth = 30,# of reads: (3x109x30)/100 ≈ 109

Page 6: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

6

Good news There are some clues inside the reads:The reads are overlapping!

AACCGGTTGCACGTTCCACTTGGCC………

AACCGGTTG

ACCGGTTGT

CCGGTTGTC

CGGTTGTCA

GGTTGTCAC

GTTGTCACG

TTGTCACGT

TGTCACGTT

Unknown genome:

Ideal case: every position has at least one read, no errors in the read, then….

[But the reality…. is a lot worse]

Page 7: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

7

AACCGGTTGCACGTTCCACTTGGCC………

AACCGCTTG

ACCGGTTTT

CCGGTTGTC

CGGTTGTCA

GGTTGTCAC

GTTGTCACG

TTGTCACGT

TGTCACGTT

Unknown genome:

The reality:(a) There are errors in the reads; not easy to locate the next read!

(b) At some positions, we may have no reads.

Page 8: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

8

PublicationsBioinformatics (impact factor: 5.323)BMC genomics (impact factor: 4.4)PloS One (impact factor: 3.73)BMC bioinformatics (impact factor: 3.02)Journal of Computational Biology (impact factor: 1.56)IEEE/ACM TCBB (impact factor: 1.54)……Top conferences: RECOMB, ISMB, ECCBNature papers with our collaborators

HKU-BGI research center:BGI (Shenzhen) is the largest genomic center in the world

Other international collaborators:JGI, dept. of energy, US (biofuel); Sidekid hospital, Canada (diabetes); CAS-MPG PICB, Shanghai (C4 Rice project); UC San Francisco (Optical mapping data analysis); NUS, Singapore (RNA study); ….

Page 9: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

9

How to solve the problem?A few general approachesString graph, de Bruijn graph, …

Idea: we still make use of the overlapping parts in reads to connect them together. We do not need reads of every position.--------------------------Graph: Vertex: k-mer (k consecutive nucleotides in a read)Edge: two k-mers appear consecutively in a read

Genome…. A C G T G T A C C T C…….

Read G T G T A C C T C (k = 4)

GTGT TGTA GTAC TACC ACCT CCTC

Page 10: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

10

Genome: A A C G A C G T G T A C C T C A G T

Reads(len = 9)

A A C G A C G T G A C G A C G T G TC G A C G T G T A G A C G T G T A CA C G T G T A C CC G T G T A C C TG T G T A C C T CT G T A C C T C AG T A C C T C A GT A C C T C A G T

Ideal case-No errors-Reads at every position-The graph can read out one single path, that will be the genome!

AACG

ACGA

CGAC

GACG

ACGT

CGTG

GTGT

TGTA

GTAC

TACCACCT

CCTC

CTCA

TCAG

CAGT

Page 11: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

11

Genome: A A C G A C G T G T A C C T C A G T

Reads(len = 9)

A A C G A C G T G A C G A C G T G TC G A C G T G T A G A C G T G T A CA C G T G T A C CC G T G T A C C TG T G T A C C T CT G T A C C T C AG T A C C T C A GT A C C T C A G T

Note: even a few reads are missing, we are still ok!

AACG

ACGA

CGAC

GACG

ACGT

CGTG

GTGT

TGTA

GTAC

TACCACCT

CCTC

CTCA

TCAG

CAGT

Can anyone see that how many reads can be missed depends on the value of k (when constructing the graph!)?

Q: to allow more missing reads, larger or smaller k is better?

Page 12: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

12

Genome: A A C G A C G T G T A C C T C A G T

Reads(len = 9)

A A C G A C G T G A C G A C G T G TC G A C G T G T A G A C G T G T A CA C G T G T A C CC G T G T A C C TG T G T A C C T CT G T A C C T C AG T A C C T C A GT A C C T C A G T

G

G

ACGT

CGTG

CGTC

Contigs: Maximal path without branches/paths

CGAC GACG ACGT

contig

CGACGTReal case is more complicated:Even no error, in a genome, some patterns may repeat!

In reality, we seldom can construct the whole genome in one piece, but stop at junctions, resulting with a set of contigs

Page 13: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

13

A part of the de Bruijn graph for Ecoli (~4M long); you can imagine how complicated for human

genome (3G long)

Page 14: 1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof

14

Conclusions Our team:

Core Faculty members: Prof. Francis Chin, Prof. TW Lam, me

1 Research Assistant Professor (Henry Leung) 1 Postdoc (Jianyu Shi) about 8 PhD/master students + a team in HKU-BGI Lab

Some collaborators: Beijing Genome Institute at Shenzhen (BGI)

- HKU-BGI Laboratory HKU medical schools; life science departments Sickkids hospital, Canada JGI, DoE, US CAS-MPG PICB, Shanghai (C4 Rice project) UC San Francisco (Pui’s group) GIS (Genome Institute at Singapore) Universities: NUS, CUHK, U of Liverpool etc.

<Thank you>