biocuration activities for the international cancer genome consortium (icgc)

Post on 12-Jul-2015

512 Views

Category:

Science

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Biocuration activities for the International Cancer

Genome Consortium (ICGC).

December 4th 2014

B.F. Francis Ouellette francis@oicr.on.ca

• Senior Scientists & Associate Director,

Informatics and Biocomputing, Ontario Institute for

Cancer Research, Toronto, ON

• Associate Professor, Department of Cell and Systems Biology,

University of Toronto, Toronto, ON.

@bffo on

2

You are free to:

Copy, share, adapt, or re-mix;

Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:

You attribute the work to its author and respect the rights

and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.

Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;

http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

3

• Cancer

• Data sharing

• Biocuration

• Access

• Relevance

• Making it better

4

CancerA Disease of the Genome

Challenge in Treating Cancer:

Every tumor is different

Every cancer patient is different

5

Johns Hopkins

> 18,000 genes analyzed for mutations

11 breast and 11 colon tumors

L.D. Wood et al, Science, Oct. 2007

Wellcome Trust Sanger Institute

518 genes analyzed for mutations

210 tumors of various types

C. Greenman et al, Nature, Mar. 2007

TCGA (NIH)

Multiple technologies

brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).

F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

Large-Scale Studies of Cancer Genomes

6

Heterogeneity within and across tumor types

High rate of abnormalities (driver vs

passenger)

Sample quality matters

Consent and controlled data access is

complicated

Lessons learned

7

International Cancer Genome Consortium

• Collect ~500 tumour/normal pairs from each of 50 different major

cancer types;

• Comprehensive genome analysis of each T/N pair:

– Genome

– Transcriptome

– Methylome

– Clinical data

• Make the data available to the research community & public.

Identify

genome

changes

…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…

8

Rationale for the ICGC

• The scope is huge, such that no country can do it all.

• Coordinated cancer genome initiatives will reduce

duplication of effort for common and easy to acquire

tumor samples and and ensure complete studies for many

less frequent forms of cancer.

• Standardization and uniform quality measures across

studies will enable the merging of datasets, increasing

power to detect additional targets.

• The spectrum of many cancers varies across the

world for many tumor types, because of environmental,

genetic and other causes.

• The ICGC will accelerate the dissemination of genomic

and analytical methods across participating sites, and

the user community

9

International Cancer Genome Consortium

(ICGC)Goals

• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe

• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors

• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer

50 tumor types and/or subtypes

500 tumors + 500 controls per subtype

50,000 Human Genome Projects!

Nature (2010) 464:993

10

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

11

Primary Goal: coordinate efforts to

reach goals (50 tumours)

12

http://docs.icgc.org/dcc-data-element-specifications

13

Primary Goal: be comprehensive

http://goo.gl/BE7KH1

14

Analysis Data Types

• Germline variants (SNPs)

• Simple Somatic Mutations (SSM)

• Copy Number Alterations (CNA)

• Structural Variants (SV)

• Gene Expression (micro-arrays and RNASeq)

• miRNA Expression (RNASeq)

• Epigenomics (Arrays and Methylation)

• Splicing Variation (RNASeq)

• Protein Expression (Arrays)

15

Primary Goal: generate highest quality

http://goo.gl/FXCvi9

16

17

Primary Goal: available to all

18

Primary Goal: available to all

19

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

20

Secondary Goal: coordinate

work to benefit productivity

http://goo.gl/K5mHC3

21

https://icgc.org/icgc/committees-and-working-groups

22

Secondary Goal: disseminate knowledge

http://goo.gl/ObcZXy

23

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

24

Policy

ICGC membership implies compliance with Core

Bioethical Elements for samples used in ICGC

Cancer Projects:

http://goo.gl/TFrCmK

http://goo.gl/nYx6YG

25

POLICY:

The members of the International Cancer Genomics

Consortium (ICGC) are committed to the principle of

rapid data release to the scientific community.

http://goo.gl/TFrCmK

26

Publication Policy

• The individual research groups in

the ICGC are free to publish the

results of their own efforts in

independent publications at any

time (subject, of course, to any

policies of any collaborations in

which they may be participating).

27

Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy

28

Publication Policy

29

Where do you find that information?

• We actually make it hard to find, but we are

working on that! (this is an example of where ICGC

would like to do what TCGA does!)

• http://cancergenome.nih.gov/publications/publicatio

nguidelines

30

Where do you find that information?

For ICGC data:

• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-

guidelines/e3-publication-policy

• Find text:

• Find date: in README on FTP file

• This is bad, we know it, and we are fixing it!

• In doubt, contact us: info@icgc.org

31

Policy on Intellectual Property

• All ICGC members agree not to make claims to

possible IP derived from primary data (including

somatic mutations) and to not pursue IP

protections that would prevent or block access to

or use of any element of ICGC data or conclusions

drawn directly from those data.

http://goo.gl/TCMXCl

32

ICGC Map – May 201472 projects launched

33

OICR and the ICGC

34

DCC ActivitiesDCC activities are split between two groups:

• Software Development

– DCC portal

– Submission tool

• Biocuration (which also includes Content

Management)

– Data level management

– Submitter “handling”

– Coordination with secretariat

– User support

http://dcc.icgc.org/team34

35

Data

ValidationValidationValidation(dictionary)

Validation(across fields)

Validation(across fields)

Validation(across fields)

indexing

Happy Users

http://goo.gl/1EcyR

36

http://docs.icgc.org/methods

37

http://docs.icgc.org/dcc-data-element-specifications

38

ICGC Biocuration

• Helping submitters get their data to ICGC

• Progress reporting (data audit)

• Quality checks (coverage, correctness, etc.)

• Helping users get to the data

• Validate and check (and recheck) metadata on public

repositories

• Test and integrate with other public repositories via

standard data formats, ontologies.

• Documentation, documentation, and more documentation

• Training

38

39

ICGC datasets to date

ICGC Data Portal Cumulative Donor Count for Member Projects

2000

4000

6000

8000

10,000

12,000

14,000

0

Number of

Donors

Release 7

Release 8

Release 9

Release 10

Release 11

Release 12Release 13

Release 14

Release 15

Release 16Release 17

•Cancer types: 50

•Body sites: 18

•Donors: 12,232

•Specimens: 24, 661

•Simple somatic mutations: 9,871,477

•Mutated genes: 57,526

ICGC dataset version 17

Sept 11th 2014

41

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

42

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

43

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

44

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

45

ICGC DCC Pipeline

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

Sequencing centres

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

DACO

Data access Agreement

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

Hardeep Nahal

46

EGA: Controlled Access and DACO

47

ICGC DCC Pipeline

Sequencing centres

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

I want raw data for Donor

DO46688

HardeepNahal

48

ICGC DCC Pipeline

Sequencing centres

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

DACO

Data access Agreement

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

I can’t find my BAM file

?!

DCC

HardeepNahal

49

Metadata: ICGC & EGA

1

1

1

1

1

1

n

n

n

n

n

n

n

n

50

ICGC-EGA Audit

Project/Study Names

Sample identifier

Donor Identifier

EGA Study/Dataset Accession

Raw data file names

Tumour/Normal designation

EGA

Metadata

XML

Files

ICGC-

EGA

Audit

Reports

ICGC

Submitted

Data

ICGC Cancer Projects

51

Major Issues & Challenges

Differences in the formats used for clinical identifiers submitted to

ICGC and EGA

Tumour/Normal designation missing

Project/Study names differ

Missing EGA datasets

• Donor ID

• Sample ID

• Tumour/Nor

mal

designation

• Sequencing

strategy

Raw data

filename

EGA

Study/Dataset

accession

ICGC EGA

HardeepNahal

52

Example metadata issues

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>

<SAMPLE alias="LFS_MB1" center_name="DKFZ-IBIOS" accession="ERS040283"><IDENTIFIERS><PRIMARY_ID>ERS040283</PRIMARY_ID><SUBMITTER_ID namespace="DKFZ-IBIOS">LFS_MB1</SUBMITTER_ID>

</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>

</SAMPLE_NAME><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>

<TAG>Sample ID</TAG><VALUE>LFS_MB1</VALUE>

</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>

<TAG>Donor ID</TAG><VALUE>165304</VALUE>

</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>

</SAMPLE></SAMPLE_SET>

Control/tumour information?

Different Donor identifiers!!

HardeepNahal

53

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>

<SAMPLE center_name="QCMG" alias="ICGC-ABMJ-20101130-29-ND" accession="ERS206872" xmlns:xsi…….<IDENTIFIERS><PRIMARY_ID>ERS206872</PRIMARY_ID><SUBMITTER_ID namespace="QCMG">ICGC-ABMJ-20101130-29-ND</SUBMITTER_ID>

</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>

</SAMPLE_NAME><DESCRIPTION>1:DNA|4:Normal control (other site)|Unknown</DESCRIPTION><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>

<TAG>Sample ID</TAG><VALUE>8029782</VALUE>

</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>

<TAG>Donor ID</TAG><VALUE>ICGC_0108</VALUE>

</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>

</SAMPLE></SAMPLE_SET>

Example metadata issues

Free Text

HardeepNahal

54

http://icgc.org

55

56

57

Select “Bladder Cancer – China”

58

Select “Pancreatic cancer – Canada”

59

… But where is the data?

60

61

http://dcc.icgc.org/

62

63

64

Highlights of the new portal: dcc.icgc.org

• Faceted searches capabilities for variants, genes and

donors

– Interactive data exploration fast and easy

• Mutation aggregation & counts across donors and cancers

– # of pancreatic cancers donors with mutation KRAS G12D

• Standardized gene consequence across all projects

• Genome browser

• Data doewnload

• Protein domains

• Links to repositories

65

KRAS search

66

• Summary

• Cancer type distribution

• Other links (Cosmic, Entrez, etc)

• Mutation profile in protein

• Domains

• Genomic Context

• Mutation profile

• Most common mutations

67

http://dcc.icgc.org/genes/ENSG00000133703

68

69

70

71

Donor• Donor ID

• Primary site

• Cancer Project

• Gender

• Tumor Stage

• Vital Status

• Disease Status

• Release type

• Age at diagnosis

• Available data types

• Analysis types

72

Genes

73

Mutations• Consequences

• Type

• Platform

• Verification status

74

Exporting data

75

Exporting data

76

77

Exporting data

78

Can do bulk download of the data …

79

BIGDATA

ValidationValidationRAW

DATA

MetaDATA

Interpreted data

80

DACO

ICGC

dbGaP

EGA

TCGA

BAM

Open

Open

ERA

BA

M

Germ

Line

+ EGA id

BA

MBA

M

81

ICGC Data Categories

ICGC Open Access Datasets ICGC Controlled Access Datasets

Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

Donor

Gender

Age range

RNA expression (normalized)

DNA methylation

Genotype frequencies

Somatic mutations (SNV,

CNV and Structural

Rearrangement)

Detailed Phenotype and Outcome Data

Patient demography

Risk factors

Examination

Surgery/Drugs/Radiation

Sample/Slide

Specific histological features

Protocol

Analyte/Aliquot

Gene Expression (probe-level data)

Raw genotype calls (germline)

Gene-sample identifier links

Genome sequence files

Most of the data in the portal is publically available without restriction. However,

access to some data, like the germline mutations, requires authorization by the Data

Access Compliance Office (DACO)

http://icgc.org/daco

84

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

Identify

yourselfFill out detail form which

includes:

• Contact and Project

Information

•Information Technology

details and procedures

for keeping data secure

•Data Access Agreement

All of these

documents are

put into a PDF

file that you

print and get your

institution to sign

off on your behalf

87

88

89

90

91

92

DACO approved projects (Dec 2014):

163 groups – 867 people

http://goo.gl/E8gHGx

93

Making sense of it all

1 project == 1 pipeline

94

Making sense of it all

70 projects == 70 pipelines

95

Making sense of it all

70 projects == 1 pipeline

96

PanCancer Analysis of Whole Genomes (PCAWG)

• 2,200 T/N pairs with clinical dataanalyzed over 6 Academic clouds

• 16 working groups, > 1000 scientists

• 1 alignment pipeline (8 months)

• Data freeze last month

• 3 somatic mutation pipelines (2 months?)

• 2 RNA-Seq pipelines (1 month?)

• Not scheduled yet:– miRNA

– CNV & SV

– Pathway analysis

• Start writing papers in July 2015

97

Conclusions: What we are doing at DCC

• Working with EGA to audit missing information

and minimize disconnect between submitters’

ICGC files and raw/metadata submitted to

EGA.

• Working in close collaboration with ICGC

projects and EGA to correct missing

information and data so we can harmonize

data in ICGC/EGA submissions process.

• Adding validation step in submission process

to better coordinate efforts with EGA

98

Conclusions: What we are doing at DCC

• Encourage & work with submitters to supply all

clinical metadata.

• Improved data & metadata curation at EGA; better

linking of data held at DCC to ICGC data in other

repositories

• Improved data quality/integrity checking through

new submission/validation system; review of

submission file specifications

• Integration of new data submission system and

portal infrastructure with project and user

information managed at ICGC.org

• Integrating PCAWG results with ICGC data portal

99

Some thoughts:

• Curation activities are obviously crucial to the

development of a great database

• Continuous feedback between users,

developers and submitters is also critical

• Biocurators are at the important interface and

are essential team players for the

development and maintenance of any modern

database.

100

http://www.biocurator.org/

101101

Nature 409:452

Bioinformatics Citizenship: What it means,

and what does it cost?

102

Important messages:

• The ICGC portal is evolving and getting better all

the time

• Lots of data provided by the ICGC

• Important to be good citizens of the scientific world

• The idea behind all of this is to provide tools to

help cure cancer

• Need to respect policies and guidelines

• There is help out there, and user feedback is

*always* welcome.

103

DCC Software

Developer

Vincent Ferretti

Daniel Chang

Anthony Cros

Jerry Lam

Brian O'Connor

Bob Tiernay

Stuart Watt

Shane Wilson

Junjun Zhang

Acknowledgments

ICGC/OICR Project leaders:

Tom HudsonJohn McPhersonLincoln SteinJared SimpsonPaul BoutrosVincent FerrettiFrancis OuelletteJennifer Jennings

Ouellette Lab

Michelle Brazas

Emilie Chautard

Nina Palikuca

Zhibin Lu

Web Dev

Joseph Yamada

Angela Chao

Daniel Gross

Kamen Wu

Kim Cullion

Miyuki Fukuma

Wen Xu

Pipeline Development

& Evaluation

Morgan Taschuk

Michael Laszloffy

Peter Ruzanov

ICGC DCC Biocuration

Hardeep Nahal

Marc Perry

http://oicr.on.ca http://icgc.org

… and all the patients and their families that that

are putting their hopes into our work!

Research IT/Systems

David Sutton,

Bob Gibson

Sam Maclennan

David Magda

Rob Naccarato

Brian Ott

Gino Yearwood

EGA

Justin Paschall

Jeff Almeida-King

Ilkka Lappalainen

Jordi Rambla De

Argila

Marc Sitges Puy

SeqProdBio Team

Tim Beck

Tony DeBat

Larry Heissler

Xuemei (Mei) Luo

Michael Moorhouse

Furqan Qureshi

Yogi Sundaravadanam

104Informatics and Biocomputing at the OICR

105

http://oicr.on.ca/careers

106

107

http://icgc.org

http://dcc.icgc.org

http://docs.icgc.org

info@icgc.org

@bffo

top related