big data in biology &healthcare · what is embl-ebi? • europe’s home for biological data...

39
Big Data in Biology &Healthcare Big Data in Biology &Healthcare Ewan Birney Director, EMBL-EBI www.ebi.ac.uk

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Big Data in Biology &HealthcareBig Data in Biology &Healthcare

Ewan Birney

Director, EMBL-EBI

www.ebi.ac.uk

What is EMBL-EBI?

• Europe’s home for biological data services, research and training

• A trusted data provider for the life sciences

• 200 Petabytes of storage (0.2 exabytes)

• >40,000 CPU Cores

• Part of EMBL, an intergovernmental research organisation

• International: 600 members of staff from 60 nations

• Home of the ELIXIR Technical hub.

See the live map at www.ebi.ac.uk/about/our-impact

Global reference data

We have been living through a revolution.

One genome 2003 to 2017

The cost of sequencing a genome in 2017

The cost of sequencing a genome in 2003

$100 Genome within the next 5 years (likely 3 years)

Real-time genomics in the fieldMeasuring DNA, RNA, protein…

(Note: I am a long-term, paid consultant to Oxford Nanopore)

Medical GenomicsMedical Genomics

Sequencing is now “cheap enough”

• Between $200-300 / exome, and $800-$1000 for whole genome

• Line of sight to $100 genome

• Quoted by Illumina, contenders emerging, steady progress.

• More costs now in consent, DNA sample acquisition (storage and standard analysis low-ish, but not 0!)

• All in costs at or below “routine” medical diagnosis, eg, MRI scans

Clinical Utility is present: Rare disease

• Consistent 20-30% yield of diagnosis for suspected rare diseases

• Diagnosis ends “diagnostic odyssey” for patients – painful, emotionally draining and costly the healthcare service

• Opens up reproductive choices for the parents

• Like for like study in Australia

• 5 fold more diagnoses at 1/3 cost to previous standard of care!

• Roll out in Denmark, Finland, France, UK

Clinical Utility is present: Cancer

• (Cancer logistics harder: sample acquisition and DNA extraction harder to standardise; timelines far shorter)

• In umbrella + basket trials, 1 in 10 patients have treatment changing information from cancer genomic information

• Often being deployed in aggressive, “any option” metastasis scenarios

• Broader molecular phenotyping via genomics showing promise

• Signatures of NHER (BRCA1/BRCA2) defects far broader than suspected from germline associations

• Age of key mutations becoming more obvious

Cohorts and Medical genomics

Medical GenomesCountries with active national medical genome projectsCountries with some activity of medical genomicsCountries planning medical genome projects

USA

Brazil

Canada

Iceland

South Korea

Japan

China

Finland

Australia

India

Spain

UK

Ireland

Estonia

Saudi Arabia

Turkey

France

Mexico

Sweden

Norway

Taiwan

CohortsNational cohorts > 100k genotyped or sequenced at least 25kNational cohorts > 100k people active collection nowPlanning national cohorts > 100k

H3Africa

South Africa

Malaysia

Singapore

Iran

Israel

Austria

Switzerland

Germany

Netherlands

Denmark

Jordan

Kuwait

Qatar

U.A.E

Scotland

Big numbers!

Genomics: from research to healthcare

Research

• English language• Light-weight legal• Similar systems• Open data• Publications• Grant funding

Practicing Medicine

• National language• Heavy legal framework• Different systems• Closed data• Not published• Contract funding

Bridges need at least two anchors

Global standards: the GA4GH

• GA4GH is THE standards-setting body for genomics and healthcare

• Embraces federated approach

• Setting community standards early

• Cloud: Analysis carried out where the data ‘lives’

• “You’re already using it!”: SAM/BAM/CRAM/VCF formats

• Tools: htsget – the first step away from file-based access

• Rare disease diagnoses: Matchmaker Exchange

• Federated discovery: GA4GH Beacons

Federation

Open research data Healthcare datawith research use

analysis analysis

Aggregate data globally

Download, analyse locally

Analyse data locally (via VMs)

Collate analyses

Clinical &

PhenotypicCloud

Discovery

Data Security

Data Use &

Researcher IDs

Genomic

Knowledge

StandardsLarge-Scale

GenomicsRegulatory &

Ethics

1. Pheno ontology recommendations2. Info models for clin data exchange

3. Implementing pheno standards4. Test bed & interoperability demo

5. TES6. TRS7. WES8. DOS

9. Beacon10. Search

11. Service registry12. Variant submission

13. IoG14. Breach response

15. AAI16. Researcher ID & Bona Fide status

17. DUO18. Variant Annotation

19. Variant Representation20. htsget streaming API

21. Reference sequence retrieval API22. Read file formats

23. Genetic variation file formats24. RNASeq expression matrix

25. Return of results policy26. Participant values survey

27. Code of conduct for data sharing28. Cloud access policy

DURI

C & P DURI GKS

Cloud LSG

GKS

R & E

C & P Discov DURI GKS

R & ESecur

C & P Discov

GKS

GKS

Secur DURI

Discov Secur DURI LSG

Discov Secur

Discov Secur

Discov Secur

Secur

Discov GKS

Discov GKS

Discov GKSDURIC & P

C & P

C & P

Cloud

Cloud

Cloud

Cloud

Cloud

Discov

Discov

Discov

Discov

Discov

Secur

Secur

DURI

DURI

GKS

GKS

LSG

LSG

LSG

LSG

LSG

R & E

R & E

R & E

R & E

R & ESecur

Europe’s opportunity

• Strengths/Opportunities

• Public Healthcare systems

• Strong genomics

• Strong public health delivery

• Strong infrastructure

• Transnational requirment

• Weaknesses/Threats

• Less IT depth in some healthcare systems

• Fragmentation of skills

• AI / Big Data capacity (skills+ capital)

• Transnational complexity

EMBL-EBI, ELIXIR and GA4GH

• EMBL-EBI is the world’s leading bioinformatics infrastructure provider

• Human Reference Genome, Annotation, Transcription, Proteomics, Structure, Pathways and Literature

• ELIXIR is Europe’s transnational coordination of bioinformatics infrastructure

• 23 European countries + EMBL-EBI

• Human data community

• GA4GH is the global standards setting organisation in human genomics

• ELIXIR and GA4GH have a strategic partnership

Humans: a new model organismHumans: a new model organism

Humans are…

• Similar to most other life forms on Earth

• Outbred organisms with pretty good genetics

• Huge cohorts – millions of people

• Big (lots and lots of cells)

• Willing participants – they take themselves to hospitals to be phenotyped

• Popular organisms – research into them attracts a lot of funding

• …A great model organism for understanding biology –including human disease!

Trabeculation

UK BioBank – 500,000 healthy UK citizens, consistently phenotyped and genotyped (will be full genome sequence)

100,000 will be MRI imaged (head including fMRI, chest including cardiac MRI)

Fractal dimension trabeculation

Co-registration

Meta analysis

Systolic BPHeart phenotypes

GOSR2

TTN

TNNT2Heart phenotypesDCM

Pulse rateSLC35F1

Many loci also shows changes in QRS

Some loci have “other heart conditions” ICD-10 codes

Meta analysis

Replication in 1,200 other healthy Brits

Meta analysis

Systolic BPHeart phenotypes

GOSR2

TTN

TNNT2Heart phenotypesDCM

Pulse rateSLC35F1

Many loci also shows changes in QRS

Some loci have “other heart conditions” ICD-10 codes

Thanks

Hannah Meyer, EBIDeclan O’regan, LMS, MRC

Thank you!Thank you!

Follow me on twitter: @ewanbirney

I blog regularly (Google Ewan Birney)

2/14/2019 33

Imaging: new technologies change the game

EM tomography,Atomic-scale models from EM

Super-resolutionlight microscopy

High-resolution MRI and CTLight sheet microcopy

Huge impact on biological research

Tools for the wet lab Tools for the dry lab

‘White-collar’ and ‘blue-collar’ problems

Tools and data management: necessary,

less glamorous

Ground-breaking ideas Making them work

Innovative, interesting, blue-skies thinking

Life science: many data types

Genes, genomes & variation

Gene, protein & metabolite expression

Protein sequences, families & motifs

Macromolecular structures

Interactions, reactions & pathways

Chemogenomics & metabolomics

Phenotypes

Data resources at EMBL-EBI

Literature & ontologies• Experimental Factor

Ontology• Gene Ontology• BioStudies• Europe PMC

Chemical biology• ChEBI• ChEMBL• SureChEMBL

Molecular structures• Protein Data Bank in

Europe• Electron Microscopy Data

Bank

Gene, protein & metabolite expression• Expression Atlas• Metabolights• PRIDE• RNA Central

Protein sequences, families & motifs• InterPro• Pfam• UniProt

Genes, genomes & variation• Ensembl• Ensembl Genomes• GWAS Catalog• Metagenomics portal

Systems• BioModels• BioSamples• Enzyme Portal• IntAct• Reactome

Molecular Archives• European Nucleotide Archive• European Variation Archive• European Genome-phenome Archive• ArrayExpress

~410 peopleWorldwide collaborations

Data Growth Doubling time~16 months

Doubling time~6 months