introduction to analysis of genomic data using r lecture 1...

27
- - : Introduction to Analysis of Genomic Data Using R Lecture 1: Review Basic Terminology of Genetics Dr. Yen-Yi Ho ([email protected]) Jan 17, 2018 1/27

Upload: others

Post on 05-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

- - :

Introduction to Analysis of Genomic Data Using RLecture 1: Review Basic Terminology of Genetics

Dr. Yen-Yi Ho ([email protected])

Jan 17, 2018

1/27

- - :

Logistics

Lectures W F& Labs: 3:55p to 5:10pOffice Hours : Yen-Yi WF 5:10p-6:10p, Thursday 4-5p or by appointmentTextbook: Gondro (2015): Primer to Analysis of Genomic Data

Using RKrijnen (2009) Applied Statistics for BioinformaticsUsing RCoghlan (2017) A Little Book of R for BioinformaticsBioconductor Case StudiesJohn Verzani’s SimpleR notes

Website: http://people.stat.sc.edu/hoyen/Biol599/Biol599.html

2/27

- - :

Goals for the Course

• Basic knowledge of R

• Basics of statistics for genomic data analysis

• Basics of genetic data analyses using R/Bioconductor

• Interpreting results and simple diagnoses

3/27

- - :

I Grade: For graduate student, 70% for homework and quizzes,30% for final project (no exam).

I Grade: For undergraduate honor student, 80% for homeworkand quizzes, 20% for final project (no exam).

I Homework: there will be weekly homework graded over thecourse. (6-9 homework assignments.)

I Please hand in a hard copy of your Homework and R code inclass and also email your R code to [email protected].

I Late homework will lose 30% of total points per day, unlessarrangements have been made with the instructor for anextension. Homework will not be accepted after the time atwhich graded homework are returned.

4/27

- - :

Objectives of Lecture 1

I Biology in a nutshellI Central dogma of molecular biologyI Chromosomes, genes, DNA, RNA, and proteinsI Gene expressionI Genetic variationI Mutations

I Technologies for Genome Analysis

5/27

- - :

Mendelian Genetics (1866)

Segregation of alleles in the production of sex cells1. the principle of segregation2. the principle of independent assortment

6/27

- - :

Mendelian Genetics Translates to Modern Genetics

I A parent contributes only a single chromosome within a pairto the offspring.

I A fixed location on a chromosome pair is called a locus, andonly those loci coding (for proteins or functional RNA) aretypically called genes.

I An allele is the state or type of genetic info at a locus on asingle chromosome. Thus there are two alleles at each locusin an individual (for autosomes, and for sex chromosomes infemales).

7/27

- - :

I Example: A particular disease locus has two possible alleletypes in the population: d (the disease allele) and D (normal).

I Genotype: the joint (unordered) state of the two alleles.Could be dd, DD (called homozygous genotypes), or Dd (heterozygous genotype).

I Alleles that are common in the population are often calledwild type while disease alleles are called mutant.

I Phenotype: an observed trait we care about, such as diseasestatus, etc.

8/27

- - :

Central Dogma of Biology: Classic ViewI Francis Crick (1970) Nature: The central dogma of molecular

biology deals with the detailed residue-by-residue transfer ofsequential information. It states that such information cannot betransferred from protein to either protein or nucleic acid.

9/27

- - :

DNA (DeoxyriboNucleic Acid)

I A molecule contains the genetic instruction for all knownliving organisms and some viruses.

I Resides in the cell nucleus, where DNA is organized into longstructures called chromosomes.

I Most DNA molecule consists of two long polymers (strands),where two stands entwine in the shape of a double helix.

I Each stand is a chain of simple units (bases), callednucleotides: A, C, G, T.

I The bases from two stands are complementary by basepairing: A-T, C-G.

10/27

- - :

DNA sequence

I The order of occurrence of the bases in a DNA molecule iscalled the sequence of the DNA. The DNA sequence isusually stored in a big text file:

I Some interesting facts:I Total length of the human DNA is 3 billion bases.I Difference in DNA sequencce between two individuals is less

than 1%.I Human and chimpanzee have 96% of the sequences identical.

Human and mouse: 70%.

11/27

- - :

Gene

I A locatable region of genomic sequence, corresponding to aunit of inheritance, which is associated with regulatory regions,transcribed regions, and or other functional sequence regions.

I Or simply, a piece of “useful” DNA sequence.

12/27

- - :

In a nutshell (for statistician)

I enhancer: a region for enhancing gene expression. Notnecessarily closes to the gene.

I promoter: at the beginning of the gene, helps transcription.

I exons: the “useful” part of the gene, will appear in themRNA product.

I introns: the “spacer” between exons, will NOT be in themRNA product.

I splicing: the process to remove introns and join exons.

I alternative splicing: different splicing patterns for the samepre-mRNA. For example, mRNA could be from exons 1 and 2or exons 1 and 3. Those are different transcripts of the samegene.

13/27

- - :

Gene structure and splicing

14/27

- - :

RNA (RiboNucleic Acid)I Similar to DNA but,

I RNA is usually single-stranded.I The base U is used in place of T.I The backbone is different.

I Many different types: mRNA, tRNA, rRNA, miRNA, snRNA,etc.

15/27

- - :

Protein

I The final project of geneexpression process,workhorses in the cells.

I A chain of amino acid.

I Every 3 nucleotide istranslated into one aminoacid during translation.

I There are 20 types ofamino acids, so a proteincan be thought as a stringfrom a 20 characteralphabet.

I 3D protein structure isoften important for itsfunction.

16/27

- - :

Gene Expression

Gene expression is a term that is used to describe the entireprocess of translation and transcription of a gene. Gene expressionis a highly specific process. Only a small fraction of the genes areexpressed, or turned ”on,” in any particular type of cell.

gene expression in different tissues gene expression in the same tissue,

but different points in time

17/27

- - :

Epigenetics

Non-DNA sequence related, heritable mechanisms to control geneexpressions. Example: DNA methylation, histone modifications.

18/27

- - :

Putting it all together

source:

http://www.nobelprize.org/educational/medicine/dna/index.html

I DNA sequence:Info on chromosome isstatic, and essentially thesame across cells withinthe individual

I mRNA:Not as relevant as protein,but easier to quantify

I Protein:Difficult to quantifyglobally, though veryrelevant

19/27

- - :

Source of Variation

20/27

- - :

Environment Vs. Gene

Any two individuals are 99.9% identical in their DNA

21/27

- - :

Genetic Variations (Polymorphisms)

That 0.1 % is very important in defining our differences

• single nucleotide polymorphisms(SNPs, every 300 nucleotide onaverage)

• small-scale mutation, insertions,deletions

• copy number variations(AAGAAGAAGAAG)

source: http://ghr.nlm.nih.gov/handbook/genomicresearch/snp

22/27

- - :

Mutations

23/27

- - :

Genome Analysis Technologies

1. DNA

• Microarrays:SNP, Copy numbervariation (CNV),Methylation, Chip-chip

• DNA sequencing:SNP, Insertion,Deletion, Mutation,CNV, Methylation,Chip-seq

2. mRNA

• Microarrays• RNA sequencing

3. Protein

• 2-D electrophoresis• Maldi-Tof mass spec

24/27

- - :

General Steps in Obtaining Gene Expression Data

25/27

- - :

General Steps in Next-Generation Sequencing

26/27

- - :

Next Lecture

I Review basic terminology of population geneticsI Crossing OverI DNA RecombinationI Genetic MarkersI Genetic Association Analysis

27/27