introduction to genomics cm226: machine learning for...

74
Introduction to genomics CM226: Machine Learning for Bioinformatics Fall 2017 Sriram Sankararaman

Upload: nguyendieu

Post on 27-Feb-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Introduction to genomics

CM226: Machine Learning for Bioinformatics

Fall 2017 Sriram Sankararaman

Page 2: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.

Machine Learning: Learning from data

Page 3: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Personalized medicine

A biological question

Page 4: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Cystic fibrosis

Genetic disease

Caused by mutations in the CFTR gene

Page 5: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Cystic fibrosis

Ivacaftor is a drug approved for patients with a mutation that changes G to D at position 551 in the protein encoded by CFTR

Ramsey et al. NEJM 2011

Page 6: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Relapse in ALL

Acute Lymphoblastic Leukemia in children

Five-year survival rates ~80%

Poorer survival rates among African-Americans and Hispanics

Higher risk of relapse among children with native American ancestry

Yang et al. Nature Genetics 2011

Risk

of r

elap

se

Page 7: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Path to personalized medicine

Basic biology

Understand the architecture of a disease

Is the disease genetic or environmental ?

What are the genetic and environmental factors?

Can we develop drug to target defective genes ?

Disease prediction

Predict disease risk for an individual from genetic data

Page 8: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Path to personalized medicine

Basic biology

Understand the architecture of a disease

Is the disease genetic or environmental ?

What are the genetic and environmental factors?

Disease prediction

Predict disease risk for an individual from genetic data

Problems of statistical learning

Page 9: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.

Machine Learning: Learning from data

Page 10: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What is this course about?Genomic revolution in biology

2001 Human genome project

2010 1000 genomes project

Page 11: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What is this course about?Genomic revolution in biology

Stephens et al. PLoS Biology 2015

Num

ber o

f hum

an g

enom

es

Num

ber o

f bas

es

Page 12: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What is this course about?Personal genomics

Page 13: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Statistics and computing important in obvious and subtle ways

What does this mean for us?

Page 14: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Administrivia

Page 15: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Course goalsIdentify biological questions.

Translate these questions into a statistical model.

Perform inference within the model.

Apply the model to data.

Interpret the results

Statistically sound ? Biologically meaningful ?

Page 16: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Course goals

For CS/Stats/EE students, identify important quantitative problems in Bioinformatics.

For Bioinformatics/Human Genetics students, learn a new set of tools and understand their strengths and limitations.

Page 17: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Enrolment/PTEsCourse over-subscribed.

Students on the waitlist will be enrolled in case some one drops.

I am not giving out PTEs until after problem set 0 submission date.

Expect several students will drop the course.

Last year, several students dropped late in the quarter.

Does require mathematical maturity (more later).

Page 18: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Enrolment/PTEsHope that all qualified students will be able to enroll.

Problem set 0 intended to mitigate late drops.

This is a course requirement.

Graded to assess your background but not part of final grade.

Be honest/realistic with yourself about your background.

Page 19: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Prerequisites

Some programming experience (R strongly encouraged)

Page 20: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

PrerequisitesMachine learning builds upon

Probability and Statistics

Linear algebra

Optimization

Page 21: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Math background reviewProblem set 0 intended to evaluate if you have the necessary background.

1. Minimum background section

2. Moderate background section

If you pass (2), in good shape.

If you pass (1) but not (2), you should expect to fill in the background needed ( we will also cover this material in class).

If you find (1) difficult, you should consider other classes to obtain the necessary background.

Page 22: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Course formatIntroduce important problems in Bioinformatics.

Use these problems to motivate the study of learning algorithms

Will study different aspects of learning algorithms

Statistical

Computational

Issues that arise in practice

Lectures will switch between studying the methodology and the applications

Page 23: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Course formatProblem sets (worth 50% each).

Will include programming and data analyses.

Homeworks are due at 11:59pm on due date.

Late submissions not accepted.

Will use gradescope to manage submissions and grading. Code for the class: M6PPWB.

All solutions must be clearly written or typed. Unreadable answers will not be graded. We encourage using LaTeX to typset answers.

For some assignments, it might be the case that only a subset of the problems will be graded in detail. Students may or may not be told which problems will be graded in advance.

Solutions will be graded on both correctness and clarity.

You are free to discuss homework problems. However, you must write up your own solutions. You must also acknowledge all collaborators.

Page 24: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Course formatExams

In-class midterm (20%) Date: November 7

In-class final (30%) Date: December 7

Exams are in class, closed-book and closed-notes and will cover material from the lectures and the problem sets.

No alternate or make-up exams will be administered, except for disability/medical reasons documented and communicated to the instructor prior to the exam date. In particular, exam dates and times cannot be changed to accommodate scheduling conflicts with other classes.

Problem set 0 will be graded but not count towards final grade. We will not grade any other homeworks/exams unless you submit problem set 0.

Page 25: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

ForumsPiazza

Must have already got an email.

Otherwise sign up at: https://piazza.com/ucla/fall2017/cm226/home

Strongly encourage students to post here rather than email course staff directly (you will get a faster response this way)

If you do need to contact the staff privately, Piazza allows you to do this.

Page 26: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

ForumsGradescope

Manage problem sets and exam submissions.

For problem sets, you will need to upload pdfs of your submission to gradescope.

Regrade requests must be made through gradescope within one week after the graded homeworks have been handed out, regardless of your attendance on that day and regardless of any intervening holidays such as Memorial Day.

We reserve the right to regrade all problems for a given regrade request.

Page 27: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Tentative class syllabusSee class website:

http://web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/html/index.html

Page 28: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Office hoursMy Office hours: Wednesday 11am-noon or by appointment@Boelter 4531D

TA: Gleb Kichaev

Office hours: TBD

Page 29: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Questions?

Page 30: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Goals for this class and the next

Introduction to genomics

Introduction to statistics

Page 31: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Crash-course in genomicsMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

Population and evolutionary genetics: How does the genome vary across populations and species?

Genome sequencing: How do we read the genome ?

Page 32: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

OutlineMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

What can we learn from genetic variation ?

Genome sequencing: How do we read the genome ?

Page 33: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Traits/Phenotype

Trait/phenotype: Any observable that is inherited

Height, eye color, disease status, cellular measurements, IQ

Instructions that modulate traits found in the genome

Galton et al. 1877

Page 34: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

How the genome codes for function?

Page 35: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

OutlineMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

What can we learn from genetic variation ?

Genome sequencing: How do we read the genome ?

Page 36: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Correlation of traits across relatives

Galton et al. 1877

Page 37: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Typical human cell has 46 chromosomes

22 pairs of homologous chromosomes (autosomes)

1 pair of sex chromosomes

Genetics and inheritance

The chromosome painting collective

Page 38: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Genetics and inheritanceOne member of each pair of homologous chromosomes comes from the father (paternal) and the other from the mother (maternal)

In males, Y from father and X from mother

The chromosome painting collective

Page 39: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

OutlineMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

What can we learn from genetic variation ?

Genome sequencing: How do we read the genome ?

Page 40: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Causes of genetic variationDNA not always inherited accurately

Mutations: changes in DNA

Changes at a single base (single nucleotide)

Can have more complex changes

Page 41: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

More definitions

Locus: position along the chromosome (could be a single base or longer).

Allele: set of variants at a locus

Genotype: sequence of alleles along the loci of an individual

Individual 1: (1,CT),(2,GG)

Individual 2: (1,TT), (2,GA)

A T C C T T A G G A

A T C T T T C A G A

Locus 1 Locus 2

A T C T T T C A G A

A T C T T T C A A A

Individual 1

Individual 2

Maternal

Paternal

Page 42: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Single Nucleotide Polymorphism (SNP)

Form the basis of most genetic analyses

Easy to study in high-throughput (million at a time)

Common (80 million SNPs discovered in 2500 individuals)

Two human chromosomes have a SNP every ~1000 bases

A T C C T T A G G A

A T C T T T C A G A

Locus 1 Locus 2

A T C T T T C A G A

A T C T T T C A A A

Individual 1

Individual 2

Maternal

Paternal

Page 43: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Genetic inheritanceSegregation (Mendel’s first law)

AA aa

A

A a

a

Aa

p(A) = 1 p(A) = 0

p(A) = 0.5

Generation 0

Gametes

Generation 1

Gametes

Page 44: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Segregation (Mendel’s first law)

AA X aa

Aa

AA X Aa

AaAA

Aa X Aa

AaAA aa

0.5 0.5

0.25 0.250.50

1.0

Generation 0

Generation 1

Generation 0

Generation 1

Genetic inheritance

Page 45: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Assortment (Mendel’s second law)

AaBb

AB aB abAb

0.25 0.25 0.25 0.25

Generation 0

Gametes

Locus 1 Locus 2

Genetic inheritance

Page 46: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Assortment (Mendel’s second law)

Not quite

AaBb

AB aB abAb

0.45 0.05 0.05 0.45

Generation 0

Gametes

Genetic inheritance

Page 47: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Assortment (Mendel’s second law)

Not quite. Recombination

AB aBab AbParental Recombinant

Genetic inheritance

Page 48: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Back to genetic inheritanceAssortment (Mendel’s second law)

Linkage: Positions nearby inherited together.

The chromosome painting collective

Page 49: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Back to genetic inheritanceMutation and recombination (among other forces that we will learn about later) produce genetic variation

Mutation produces differences

Recombination shuffles these differences

Page 50: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

OutlineMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

What can we learn from genetic variation ?

Genome sequencing: How do we read the genome ?

Page 51: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What can we learn from genetic variation ?

Evolution and history

Biological function and disease

Page 52: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What can we learn from genetic variation ?

History of human populations learned from genetics

https://blog.23andme.com/23andme-and-you/genetics-101/a-beautiful-ancestry-painting/

Page 53: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

What can we learn from genetic variation ?

Evolution and history

Biological function and disease

Page 54: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Genome-wide Association Studies (GWAS)

Page 55: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

OutlineMolecular biology : How does the genome code for function?

Genetics: How is the genome passed on from parent to child ?

Genetic variation: How does the genome change when it is passed on ?

What can we learn from genetic variation ?

Genome sequencing: How do we read the genome ?

Page 56: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading our genomesGoal: Determine the sequence of bases along each chromosome

Details depend on the technology

Computationally hard

Page 57: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

The human genome projectGoals

Sequence an accurate reference human genome

Draft published in 2001

High-quality version completed in 2003

Cost: ~$3 billion.

Time: ~13 years.

Two competing groups (public and private)

Page 58: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

The human genome projectMajor findings

Fewer genes than previously thought (~20K)

Pertea and Salzberg, Genome Biology 2010

Page 59: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

The human genome project

Other outcomes

International collaborations

Power of computing

Page 60: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading the genomeHuman genome provides a reference

Humans share most of their genome (~99.9% )

Can focus on reading the differences

Simpler problem

Page 61: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading the genome

High-throughput genotyping

Hybridization of DNA molecules

Nucleotides bind to their complementary bases

A=T, C=G

Can be used to get the genotype at a chosen set of SNPs

A T C C T T A G G A

A T C T T T C A G A

Page 62: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Maps of genetic variation

Goals: Describe common patterns of genetic variation in human populations

Phase 1: Genotyped ~1 million SNPs from 270 individuals in 4 populations. Captured most SNPs with a frequency of >5%.

Phase 3: 7 additional populations included

All data publicly available.

International HapMap Project

Page 63: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading the genomeLimitations of genotyping

Can only read SNPs that are on the chip

Biased by how these SNPs are chosen (e.g. common SNPs)

Page 64: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading the genomeTechnologies: Illumina, IonTorrent, PacBio

Can read only small pieces of the genome (~100bp)

But can read hundreds of thousands of fragments in parallel

Use the reference human genome to find the locations of the reads (and to infer mutations)

High-throughput (or next-gen) sequencing

Page 65: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Reading the genomeCost of genome sequencing

Page 66: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Maps of genetic variation

2500 individuals from 26 populations

Discover ~90 million SNPs

Includes >99% of SNPs with frequency >1%

All data publicly available

1000 Genomes Project

Page 67: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Maps of genetic variation

Many more such efforts underway

Example: Simons Genome Diversity Project : 260 genomes from 127 populations

Also publicly available

Page 68: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Other interesting data

EXAC data: Exomes from ~60,000 individuals

Also publicly available

UK Biobank: 500,000 individuals with 200 phenotypes

Not publicly available

Page 69: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Computational and statistical problems

Recurring theme:

How to better model a biological problem?

Page 70: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Computational and statistical problems

Recurring theme:

How to scale up our favorite algorithms to massive datasets?

Hundreds of thousands of individuals at hundreds of millions of SNPs measured for thousands of phenotypes

Page 71: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Computational and statistical problems

Recurring theme:

Tradeoffs

Computational efficiency

Statistical accuracy

Constraints from technology

Constraints from the data generation. e.g. privacy

Page 72: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Computational and statistical problems

Learning the genotype-phenotype map

Genome-wide association studies

Inferring genetic architecture

Correcting for confounding

Multiple hypothesis testing

Risk prediction

Causality

Population genomics

Inference of human history

Admixture inference

Page 73: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

This classOverview of the biological questions

Basic concepts in genetics

Genomes are inherited according to well-known rules (Mendel’s laws)

Genomes change

Genetic variation forms the starting point for inference.

Possible inferences: history, disease risk and many more

Advances in technology are allowing us to read many more genomes

Page 74: Introduction to genomics CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2017/slides/intro.pdf · Is the disease genetic or environmental ? ... next Introduction

Next classBasic concepts in statistics

Given data, how do we perform inference ?

What are the set of inferential tasks?

How do we derive procedures for these tasks ?

How do we evaluate these procedures ?