graduate school bioinformatics sequence analysis...

Post on 05-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Graduate SchoolBioinformatics Sequence Analysis

Introduction

Barbera van Schaik

Bioinformatics Laboratory, KEBBAcademic Medical Center

b.d.vanschaik@amsterdamumc.nl

March 9, 2020

1 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Related Graduate School courses

• DNA technology

• Unix

• Computing in R

• Practical biostatistics

• Advanced biostatistics

• Bioinformatics

• Bioinformatics Sequence Analysis

• Research Data Managementhttps://www.amc.nl/web/leren/graduate-school.htm

2 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

In this course

Bioinformatics Sequence Analysis

You will learn what is behind commonly used methods forsequence analysis, how to analyze datasets with(reasonably) user-friendly interfaces, and get introduced tocommand-line tools for next generation sequencing (NGS)

3 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Not in this course

1 Sequence assembly

2 Bisulphite sequencing

3 Protein sequence analysis

4 Metagenomics

4 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Bioinformatics Sequence Analysis

1 Introduction to sequence analysis

2 Sequencing techniques

3 Brief introduction Linux and R (self study)

4 NGS pre-processing

5 (Multiple) sequence alignment

6 Case: Neuroblastoma

7 Introduction to R2

8 Exome sequence analysis

9 RNAseq

The focus is on human data, but many techniques are alsoapplicable to other organisms

5 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Practical things

Certificate

• Attend all sessions (one day can be skipped, ask forpossibility for self-study)

• Active participation

Other things

• Lunch is not included

• Coffee is available at the machines with your AMC card

• Slides and exercises are published onhttps://bioinformatics.amc.nl/

6 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

In this hour

IntroductionYou will get an indication about the scale of sequence data,how to handle the data, where to find publicly availabledata and tools, and what can be done with NGS

7 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Overview

1 Welcome

2 Scale of sequence dataDNA sequencingGenome projects

3 Bioinformatics databases and toolsDatabasesSequence analysis

4 Handling sequence dataComputingApplication areas

8 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sanger

9 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Automated sequencing

10 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sequencing centers

11 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Next generation sequencing

12 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Genome projects

• HGP

• 1000g

• UK10K >100K genomes

• Personal genomes

13 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Human Genome Project

http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml

14 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Human Genome Project

http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml

15 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

1000 genomes project

http://www.1000genomes.org/

16 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

UK10K

4000 genomes6000 exomeshttp://www.uk10k.org/

17 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

The 100K genomes project

The project will focus onpatients with a rare disease andtheir families and patients withcancer. The first samples forsequencing are being takenfrom patients living in Englandwith discussions taking placewith Scotland, Wales andNorthern Ireland aboutpotential future involvement.http://www.genomicsengland.co.uk/

18 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Personal genomes

100,000 genomes plus medical recordshttp://www.personalgenomes.org/

19 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sequencers around the world

http://omicsmaps.com/

20 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sequencers around the world 2015

http://omicsmaps.com/

21 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Big data

22 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

DNA sequencing rate

Stephens et al. (2015) PLoS One

23 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

GenBank, EMBL and DDBJ

International Nucleotide Sequence Database CollaborationDaily exchange of sequence data

https://www.ncbi.nlm.nih.gov/

https://www.ebi.ac.uk/

http://www.ddbj.nig.ac.jp/

24 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Nucleotide sequence databases

From: http://www.davelunt.net/

25 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

GenBank

Release 236 (Feb 2020)has 399,376,854,872 base pairs from 216,214,215sequences. In addition, there are 1,206,720,688 WGSrecords containing 6,968,991,265,752 base pairs ofsequence data.

https://www.ncbi.nlm.nih.gov/genbank/statistics/

GenBank has doubled approximately every 18 months

26 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Core databases and derivatives

Nucleotide sequence databases

Core: RNA, DNA

Genbank

EMBL

DDBJ

RNA grouped per gene UniGene

Genome assemblies

Human

Model organisms

Bacteria

Plants

Etc

Genome comparisons Conserved regions

DNA motifs

Protein binding sites

Conserved regions

DNA structure

Restriction sites

Gene expressionExpressed Sequence Tags (ESTs)

RNAseq

Variants

SNPs, insertions and deletions

Structural variants

Allele databases

Specialized databases

Gene specific

Disease specific

Genome projects

MetagenomicsMicrobiome

Environment samples

Protein translations

27 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Where to start?

https://www.oxfordjournals.org/nar/database/c/

28 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sequence analysis

Sequence alignment

• Needleman-Wunsch

• Smith-Waterman

• BLAST

• BLAT

• ClustalW

• BWA, BFAST, Bowtie, Tophat, etc, etc

Sequence suites/packages

• Emboss package

• CLCbio workbench

• Galaxy

• R Bioconductor

Hundreds of tools to analyse sequence data...

29 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Tools

https://academic.oup.com/nar/article/47/W1/W1/5524725

30 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Tools

Most tools are only available via the command-line (on linuxsystems)

31 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Open source

Free as in freedomYou can use, change, integrate, and review the codeOpen source allows sharing and promotes collaborationNo vendor lock-in

32 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Open source

• Software

• Databases

• Journals

• Standards

• Hardware

• Art

• Money

• Drinks

• Medicine

• Fashion

• Educationhttps://en.wikipedia.org/wiki/Open_source

33 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Handling sequence data

34 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Buy a bigger cluster (centralizedmodel)

35 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Dutch life science grid

http://surfsara.nl/

36 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Cloud computing

37 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

HPC cloud at SurfSara

You will use a linux environment that runs on the HPC cloudto get acquainted with command-line tools

38 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

NGS application areas

39 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Whole genomes

• De novo sequencing

• Re-sequencing

• Copy number variations

• Rearrangements

• New insertions/deletions/mutations

40 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Structural variation

The Human Genome Structural Variation Working Group, Nature 2007

41 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

SNP / haplotype analysis

Linkage studiesForensic research

42 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Gene expression

https://en.wikipedia.org/wiki/Regulation_

of_gene_expression

• Full-length transcripts

• EST sequencing

• 5’ transcript ends(5’-RATE, CAGE)

• SAGE ditag sequencing

• SAGE-like 3’ endsequencing

• Nebulized fragments

• ncRNA sequencing

43 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Epigenetics

Treatment with sodium bisulfiteUnmethylated cytosines change into uracilMethylated cytosines are unchangedCompare sequences with reference sequence

44 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Metagenomics and microbialdiversity

Study genomic content in acomplex mixture ofmicroorganisms(bacteria or viruses in someenvironment)Identify new species

45 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Paleogenomics

Sequencing ofancient DNAMummiesSabretoothMammothNeanderthal

46 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Gene regulation

47 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Sequence analysis

Usually starts with sequence alignment or sequence assemblyDepending on the application other tools/methods are used ordeveloped

48 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

With a click of a button...

.. or perhaps not. You will find out during this course.Computer exercises sequence analysis:

1 Via web tools

2 Creating pipelines online

3 With command-line tools in a Linux environment

49 / 50

Introduction

Barbera vanSchaik

Welcome

Scale ofsequence data

DNA sequencing

Genome projects

Bioinformaticsdatabases andtools

Databases

Sequenceanalysis

Handlingsequence data

Computing

Applicationareas

Bioinformatics Sequence Analysis

1 Introduction to sequence analysis

2 Sequencing techniques

3 Brief introduction Linux and R (self study)

4 NGS pre-processing

5 (Multiple) sequence alignment

6 Case: Neuroblastoma

7 Introduction to R2

8 Exome sequence analysis

9 RNAseq

50 / 50

top related