introduction to human genomics and genome informatics · head, bioinformatics & integrative...

40
Introduction to human genomics and genome informatics Prince of Wales Clinical School Session 1 Dr Jason Wong ARC Future Fellow Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH WALES, SYDNEY NSW 2052

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Introduction to human genomics and genome informatics

Prince of Wales Clinical School

Session 1

Dr Jason Wong

ARC Future Fellow

Head, Bioinformatics & Integrative Genomics

Adult Cancer Program, Lowy Cancer Research Centre

Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH

WALES, SYDNEY NSW 2052

Page 2: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

What we will cover

• Structure of the human genome

• Layers of genomic information – DNA (Sequence variation) – RNA (Genes & gene expression) – Epigenetics (DNA methylation) – Epigenetics (Histone code/Transcription factors)

• Genomic data acquisition technologies – Microarray – Next-generation sequencing

Page 3: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Structure of human genome

• Consist of 23 pairs of chromosomes.

• Each chromosome is paired meaning that it is diploid.

• Each individual chromosome made up of double stranded DNA.

• Approximately ~3 billion bases in total.

Page 4: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Reference human genome

• Human genomes vary significantly between individuals (~0.1%)

• Computationally, a reference genome is used.

• Important things to note about the reference genome: – Is haploid (i.e. only 1 sequence)

– Is a composite sequence (i.e. does not correspond to anyone’s genome)

Page 5: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Representation of genomic data

• Genomic data is most common represented in two ways:

1. Sequence data – fasta format (.fa or .fasta)

2. Location data – bed format (.bed)

>chr1

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC

TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa

tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc

ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt

....

chr1 934343 935552 HES4 0 -

chr1 948846 949919 ISG15 0 +

...

All about genomic formats here - http://genome.ucsc.edu/FAQ/FAQformat.html

chromosome start end name score strand

Page 6: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

What do chromosomes contain?

Genes: ~1.2% coding ~2% non-coding

Regulatory regions: ~2%

Repetitive elements comprise another ~50% of the human genome

Page 7: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Layers of genetic information

• DNA sequence variation

• Gene expression – Coding

– Non-coding

• Epigenetic regulation – DNA methylation

– Histone/transcription factor binding

Page 8: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Sequence variation

Page 9: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Variations in DNA sequence

• Cytological level: – Chromosome numbers – Segmental duplications, rearrangements,

and deletions

• Sub-chromosomal level: – Transposable elements – Short Deletions/Insertions, Tandem repeats

• Sequence level: – Single Nucleotide Polymorphisms (SNPs) – Small Nucleotide Insertions and Deletions

(Indels)

Page 10: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Sequence variation

• Single nucleotide polymorphisms (SNPs) – DNA sequence variations that

exist with members of a species. – They are inherited at birth and

therefore present in all cells.

• Somatic mutations – Are somatic – i.e. only present

in some cells – Mutations are often observed in

cancer cells

Page 11: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Types of SNPs/Mutations

• Most SNPs and mutations fall in intergenic regions.

• Within genes, they can either fall in the non-coding or coding regions.

• Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.

Intergenic region Non-coding

Synonymous Coding

Non-Synonymous TSS

Page 12: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Effects of sequence variation

• Non-synonymous variants: – Missense (change protein structure)

– Nonsense (truncates protein)

• Synonymous or non-coding variants: – Alter transcriptional/translational efficiency

– Alter mRNA stability

– Alter gene regulation (i.e. alter TF binding)

– Alter RNA-regulation (i.e. affect miRNA binding)

Majority of sequence variation are neutral

Page 13: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Genes and gene expression

Page 14: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

• A gene is a functional unit of DNA that is transcribed into RNA.

• Total genes in the human genome – 57,445

Types of genes

Source: GENCODE (version 18)

Page 15: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Coding genes

Source: http://www.news-medical.net

• Traditionally considered to be the most important functional unit of genomes.

• ~ 20,000 in the human genome.

• Due to splicing one gene can make many proteins.

Page 16: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Non-coding genes

Page 17: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

microRNA • Plays a role in post-

transcriptional regulation.

• Only discovered in 1993.

• Acts by either causing RNA degradation or inhibition of translation.

• Implicated in many aspects of health and disease including: – Development – Cancer – Heart disease

Page 18: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Long non-coding RNA (lncRNA)

• Arbitrarily defined as non-coding transcripts > 200 nt in length.

• Implicated in many functions including: – Altering protein/DNA

interaction. – Binds mRNA. – Sink for miRNAs. – Etc…

• Unlike coding and miRNAs, lncRNA are less conserved and function of many are unknown.

Prensner and Chinnaiyan (2011) Cancer Discov. 1:391

Page 19: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Gene expression

• Measuring the level of RNA (typically mRNA) in the sample.

• Generally microarray- or sequencing-based.

• Commonly used for measuring differential expression – between samples, or – between genes

• Computation analysis and

normalisation of expression data can be complicated.

Source: OPENbeta

Page 20: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Gene-set/Pathway analysis

• Differential expression of individual genes not necessarily informative.

• Genes are often grouped in gene-sets based on ontology or biological pathways.

• Gene-set and pathway analyses are therefore a common downstream after differential gene expression analysis.

Page 21: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Gene regulation

Page 22: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Gene regulation/epigenetics

• Epigenetics is the study of mechanisms that alter cellular function independent to any changes in DNA sequence

• Mechanisms include: – DNA methylation

– Nucleosome positioning/Histone modification

– Transcription factors

– Non-coding RNA

Page 23: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

DNA methylation

• DNA is methylated on cytosines in CpG dinucleotides

Page 24: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Nucleosomes & Histones

• Histones are proteins that package DNA into nucleosomes.

Page 25: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Histone modifications

• Acetylation

• Methylation

• Phosphorylation

• Ubiquitination

• Can enhance or repress gene expression

Page 26: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Transcription factors

• Proteins that bind DNA to regulate gene expression.

• Typically binds at gene promoters or enhancers.

Page 27: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Studying gene regulation

• Has traditionally been more difficult than studying gene expression because:

– Location of many regulatory regions are poorly defined.

– Regulatory regions differ greatly between cell types.

– Many modes of gene regulation.

• Next-generation sequencing technologies has enabled great progress to be made.

Page 28: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Genomic technologies

Page 29: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Genomic technologies

• Microarray-based data

– SNP profiling

– Copy number profiling

– DNA methylation profiling

– Gene expression profiling

• Next-generation sequencing

– “Swiss-army knife” of genomics

Page 30: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Data acquisition

• Relies on fluorescence-based on hybridisation of DNA against complementary probe on array.

• Can be used to study DNA or any molecule that can be converted to cDNA. – SNP array (probe for two alleles)

– Methylation array (probe for bisulfide converted DNA)

– Expression array (probe for exonic DNA regions)

• Limited by probes present on the array.

Page 31: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Microarray gene expression analysis

•Gene signatures

• Sample classification

Gene Value

D26528_at 193

D26561_cds1_at -70

D26561_cds2_at 144

D26561_cds3_at 33

D26579_at 318

D26598_at 1764

D26599_at 1537

D26600_at 1204

D28114_at 707

C la s s S n o D 2 6 5 2 8 D 6 3 8 7 4 D 6 3 8 8 0 …

A L L 2 1 9 3 4 1 5 7 5 5 6

A L L 3 1 2 9 1 1 5 5 7 4 7 6

A L L 4 4 4 1 2 1 2 5 4 9 8

A L L 5 2 1 8 8 4 8 4 1 2 1 1

A M L 5 1 1 0 9 3 5 3 7 1 3 1

A M L 5 2 1 0 6 4 5 7 8 9 4

A M L 5 3 2 1 1 2 4 3 1 2 0 9

Data Mining

and analysis

Microarray chips Images scanned by laser

Datasets

Page 32: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Next-generation sequencing

Page 33: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

What is NGS?

A number of different technologies. We use the technology by Illumina sequencers as an example.

Figures provided by Illumina Inc.

Page 34: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Sequences are inferred from fluorescence signals during synthesis

Figures provided by Illumina Inc.

Short sequencing reads

Page 35: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Aligned reads

Gene

Alignment

Page 36: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

NGS file formats

• Fastq – Stores sequencing reads from NGS. Contains read sequence and quality scores.

• BAM/SAM – A BAM file (.bam) is a binary file containing coordinates of where a read has mapped to in a genome. SAM is the same file in text format

• BedGraph/Wig – for storing continuous profile

information for visualisation.

• VCF – for storing information about variants.

https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip

Page 37: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Pros/cons of each technology

• NGS – Greater dynamic range (only limited by depth of

sequencing)

– Coverage of genome does not need to be limited.

– Many more applications from sequencing data.

– Data analysis and management can be challenging.

• Microarrays – Microarrays are still significantly cheaper.

– Largest public datasets are likely to be microarray based.

– Data analysis pipelines are well standardised.

Page 38: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Example of using public resources to tell us more about our data

http://www.powcs.med.edu.au/OncoCis

Page 39: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

OncoCis uses public data from various sources to assign potential function to non-coding mutations

Given a non-coding mutation what do we want to know? 1. Does the mutation fall within a cis-

regulatory region (ENCODE/Human Epigenome Atlas).

2. Is the mutation site highly conserved (UCSC)?

3. What gene might the mutation affect (FANTOM5)?

4. What transcription factor binding site might be altered (JASPAR)?

5. Does the mutation affect a gene which is druggable (DGIdb)?

Page 40: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,

Gene mapping from FANTOM5 or GREAT Link out to UCSC

genome browser

Epigenetic data from ENCODE/Epigenome project

Conservation data from UCSC

Motif data from JASPAR

FANTOM5 regulatory data

Link out to Drug-Gene interaction database (DGIdb)