genomic data model and genometric query language as
TRANSCRIPT
![Page 1: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/1.jpg)
Dipartimento di Elettronica, Informazione e
Bioingegneria
Genomic ComputingPolitecnico di Milano 13-17 March, 2017
Genomic Data Model and GenoMetric Query Language as research enabler to discover genome propertiesMarco Masseroli and Stefano Ceri(joint work with several PhD students)
Politecnico di Milano, BioInformatics Group
![Page 2: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/2.jpg)
2
Genomic Computing Background: Next Generation Sequencing
• Next Generation Sequencing technology is about to provide affordable (in time and cost) and precise determinations of genome wide:
– DNA sequence / variations (DNA-seq)– gene subregions’ activity (RNA-seq) [all gene test]– protein-DNA interaction regions (ChIP-seq) – open chromatin (DNase-seq)
Goal of $1,000 full genome sequencing in under an hour has just met
• Very many DNA-interacting proteins / subjects / conditions will be soon evaluated
– Personalized medicine (diagnosis and treatment)– Each NGS test can generate 0.4TB -> Big Data scenario
![Page 3: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/3.jpg)
Source: http://blog.goldenhelix.com/grudy/a-hitchhiker%E2%80%99s- guide-to-next-generation-sequencing-part-2/
Genomic Computing Big data analysis with Next Generation Sequencing
My talk
3
![Page 4: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/4.jpg)
4
HeterogeneousGenomic
Data Sources
HeterogeneousClinic
Data SourcesPersonal Patient Data
Genome Browser Genome Browser
Data Analytics
Data Analytics
Ontological Knowledge Ontological Knowledge
Genom
ic Data Integration
Publishing / Crawling / Searching
Genom
ic Data Integration
Publishing / Crawling / Searching
Clinical D
ata IntegrationC
linical Data Integration
Biologist ClinicianMedical Literature
Clinical Protocols
Genomic Computing The big picture: Distributed heterogeneous data
![Page 5: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/5.jpg)
5
A n
umbe
r of g
enom
ic fe
atur
es (t
rack
s)
One macro genomic region
Dat
a tra
cks
Genomic Computing Current practice – UCSC Genome Browser
![Page 6: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/6.jpg)
6
Genomic Computing The challenge: Understanding biologists’ needs
• Working together with biologists for giving answers to the problems behind the «courtesy» slide
Courtesy of Prof. Pelicci, IEO
![Page 7: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/7.jpg)
7
Genomic Computing Challenge: Genotype–phenotype discovery
• (Epi)genotype-phenotype relationship discovery: understanding genomic regions, genome variations and their associations with different phenotypes
– highly heterogeneous scenario
• It requires evaluating, in several different conditions and types of individuals:
– genome (DNA) sequence variations – gene activity & its regulation– occurring interactions
![Page 8: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/8.jpg)
8
Genomic Computing Main questions
Scientist’s typical questions(from our interaction with IEO - European Oncology Instituteand IIT - Italian Institute of Technology)
• Can interesting DNA regions and their relationships be discovered using genome-wide queries?
• Can genomic data of patients be grouped according to clinical phenotype and compared?
• Can the genomic features of all the genes involved in the same biological process be extracted and then analyzed?
• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?
![Page 9: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/9.jpg)
9
Genomic Computing Research agenda
• Can interesting DNA regions and their relationships be discovered using genome-wide queries?
Genometric query language• Can genomic data of patients be grouped according to
clinical phenotype and compared? Genometric query language + clustering
• Can all the features of the genes involved in the same biological process be extracted and then analyzed?
Genometric query language + data analysis
• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?
Genometric query language + indexing & search
![Page 10: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/10.jpg)
10
Genomic Computing Research agenda by topics
• Data model: design a simple and format-independent data model for describing datasets with both genomic regions and general provenance information (including phenotype)
• Query language: design a query language where both genometric aspects (about the placement of regions on the genome) and provenance can be queried at a high level of data independence and transparency
• Integrative data analysis: translating query results into a genome space which is the ideal start point for correlation and network analysis
• Data search: design protocols for data crawling and indexing based on the data model
![Page 11: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/11.jpg)
Genomic Data Model
11
![Page 12: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/12.jpg)
12
Genomic Computing Genomic Data Model
Within the same sample, two kinds of data:• Region values aligned w.r.t. a given reference, with specific
left-right ends within a chromosome, and with several associated attributes (e.g. p-value of region significance)
• Metadata, with free-format attribute-value pairs, storing all the knowledge about the sample
A C G T T A A C G G A T A C C A A C
left (position) right (position)
chr (chromosome)
strand (direction)
DNAr
![Page 13: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/13.jpg)
Genomic Computing Model rationale
• Regions of the model are data format independent and provide an interoperability framework for comparing data on mutations, expression or regulation using regions as common ground
• Metadata attribute-value pairs of the model are info-system independent and provide an interoperability framework for comparing samples based upon their biological aspects
13
![Page 14: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/14.jpg)
14
Genomic Computing Genomic Data Model – Example
0.1 0.6 Tumor_type = brcaPatient_age = 75
0.5 0Tumor_type = brcaPatient_age = 63Sex = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
Sample 1
Sample 3
Sample 2
![Page 15: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/15.jpg)
15
Genomic Computing Genomic Data Model – Example
• Region values: {expID, region:(chr, left, right, strand), p-value}
• Metadata: {expID, attribute, value}
![Page 16: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/16.jpg)
Genomic ComputingGenomic Data Model - Samples and datasets
16
Samples and datasets• Every sample corresponds to an «experiment», with an ID• Every dataset is a named collection of samples with the
same region data schemaData format independent; interoperability framework for comparing data samples based upon their biological aspects
![Page 17: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/17.jpg)
Genomic ComputingGenomic Data Model - Mapping examples
17
![Page 18: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/18.jpg)
Genomic ComputingGenomic Data Model - Mapping examples
18
![Page 19: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/19.jpg)
Genomic ComputingGenomic Data Model - Mapping examples
19
![Page 20: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/20.jpg)
DNA-seq (mutations)(id, ('chr,start,stop,strand), (A,G,C,T,del,ins,inserted,ambig,Max,Error,A2T,A2C,A2G,C2A,C2G,C2T))(1, (chr1, 917179, 917180,*), (0,0,0,0,1,0,'.','.',0,0,0,0,0,0,0,0))(1, (chr1, 917179, 917179,*), (0,0,0,0,0,1,G,'.',0,0,0,0,0,0,0,0))
RNA-seq (gene expression)(id, ((chr,start,stop,strand), (source,type,score,frame,geneID,transcriptID,RPKM1,RPKM2,iIDR))(1, (chr8, 101960824, 101964847,-), ('GencodeV10', 'transcript', 0.026615, NULL, 'ENSG00000164924.11', 'ENST00000418997.1', 0.209968, 0.193078, 0.058))
Annotations(id, (chr,start,stop,strand), (proteinID,alignID,type))(1, (chr1, 11873, 11873, +), ('uc001aaa.3', 'uc001aaa.3', 'cds')) (1, (chr1, 11873, 12227, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 12612, 12721, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 13220, 14409, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))
ChIA-PET (denoting 3D genomic loops, head is assembled with coordinates, tail is in the schema)(id,(chr,headstart,headstop,strand), (loopType, tailChr, tailStart, tailStop, PETcount, pValue, qValue))(1, (chr1,7385626,7389841,*), ('Inter-Chromosome', chr17, 3081653, 3084755, 50, 0.0, 0.0)
20
Genomic ComputingGenomic Data Model - Other mapping examples
![Page 21: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/21.jpg)
Query Language
21
(Motivational example and detailed description)
![Page 22: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/22.jpg)
22
Genomic Computing GMQL motivational example
The language allows for queries on the genome involving large datasets describing:
• Genomic signals (i.e. experiment dataset regions)• Reference regions (e.g. TSS, genes, promoters, enhancers)• Distance rules (e.g. the nearest enhancer that stands
at least at 100 kb from the nearest gene)
Enhancer Promoter
GeneReference DNAExperimental dataset 1
Distance pattern
Experimental dataset 2
Experimental dataset 3
Reference regions
![Page 23: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/23.jpg)
23
Identification of distal bindings in transcription regulatory regionsFind all CTCF transcription factor (TF) binding regions of ChIP-seq data regarding human cancer cell line HeLa-S3, which are farther than x kb (e.g. 1000 kb) from the TSS (transcription start site) of the nearest gene. Then, find all H3K4me1 histone modification (HM) regions that are also farther than x kb from the TSS of the nearest gene.Finally, consider known enhancer (EN) regions and return a list of EN-HM-overlapping TF regions.
Genomic Computing GMQL motivational example – Distal bindings
Nearest gene
HM
TF1
TF2
REFx
TSS
EN
GMQL result
HM: Histone mark experiment
TF: Transcription factorexperiment
REF: Reference DNA regionsEN: Enhancerx: Threshold distance
DNA region
Not respectingthe distance thresholdNot EN or HMoverlapping
![Page 24: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/24.jpg)
24
Genomic Computing GMQL motivational example – Distal bindings
HM = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘H3K4me1') PEAK;
TF = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘CTCF') PEAK;
TSS = SELECT(type == ‘TSS‘) ANNOTATION;EN = SELECT(type == ‘enhancer‘) ANNOTATION; HMa = JOIN(minDistance(5) AND distance > 1000000, right) TSS HM;TFa = JOIN(minDistance(5) AND distance > 1000000, right) TSS TF;TFb = JOIN(distance < 0, left) TFa EN;HMb = JOIN(distance < 0, left) HMa EN;TF_res = JOIN(distance < 0, left) TFb HMb;
Nearest gene
HM
TF1
TF2
REFx
TSS
EN
GMQL result
HM: Histone mark experiment
TF: Transcription factorexperiment
REF: Reference DNA regionsEN: Enhancerx: Threshold distance
DNA region
Not respectingthe distance thresholdNot EN or HMoverlapping
![Page 25: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/25.jpg)
25
Genomic Computing GenoMetric Query Language
GenoMetric Query Language (GMQL) is defined as a sequence of algebraic operations following the structure:
< variable > = < operator > (< parameters >) < variable >
– Every variable is a dataset including many samples
– Offers high-level, declarative operations which operate both on regions and meta-data -> thus, each operation progressively builds the regions and meta-data of its result
– Inspired by SQL and Pig Latin
– Targeted towards cloud computing
![Page 26: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/26.jpg)
Genomic Computing Overall view on GMQL operations
Classic relational operations – with genomic extensions• SELECT, PROJECT, EXTEND, ORDER, GROUP, MERGE, UNION, DIFFERENCE
Domain-specific genomic operations:
• COVER, (GENOMETRIC) JOIN, MAPUtilities:
• MATERIALIZE
27
![Page 27: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/27.jpg)
28
Genomic Computing Sample selection – Example SELECT
0.1 0.6 Tumor_type = brcaPatient_age = 75
Selection of the samples where a selection predicate p is true (e.g. select patients younger than 70 years)
0.5 0Tumor_type = brcaPatient_age = 63Gender = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
S2 = SELECT(p) S1;
Example: S2 = SELECT(Patient_age < 70) S1;
![Page 28: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/28.jpg)
30
Genomic Computing Region projection – Example PROJECT
0.1 0.6 Tumor_type = brcaPatient_age = 75
0.5 0Tumor_type = brcaPatient_age = 63Gender = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
Selection of the regions where a selection predicate p is true (e.g. select those regions which have a score greater than 0.5)
S2 = PROJECT(p) S1;
Example: S2 = PROJECT(score > 0.5) S1;
![Page 29: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/29.jpg)
31
Genomic Computing Region projection – Example PROJECT
Tumor_type = brcaPatient_age = 75
Tumor_type = brcaPatient_age = 75
Projection of the regions: for each gene in a set, take its promoter (e.g. from -2kbp, to +1kbp from the TSS)
S2 = PROJECT(p) S1;
Example: S2 = PROJECT(start = start – 2000, stop = start + 1000) S1;
S1
S2
![Page 30: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/30.jpg)
33
Genomic Computing Metadata aggregation – Example AGGREGATE
5 3102 2
5 01
Tumor_type = brcaPatient_age = 75Region_count = 3
Tumor_type = escaPatient_age = 78Region_count = 5
5 3
Tumor_type = cholPatient_age = 85Region_count = 2
Count the regions in each sample and store it in metadata
S2 = AGGREGATE(Ai AS gi) S1;
Example: S2 = AGGREGATE(Region_count AS COUNT) S1;
![Page 31: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/31.jpg)
34
Genomic Computing Order and select top k – Example ORDER
Tumor_type = brcaPatient_age = 75Region_count = 3Order = 2
Tumor_type = escaPatient_age = 78Region_count = 5Order = 1
5 01
5 3102 2
5 3
Tumor_type = cholPatient_age = 85Region_count = 2Order = 3
Order by region_count metadata and take the top 2 samplesS2 = ORDER(Ai; [TOP: k]) S1;
Example: S2 = ORDER(Region_count; TOP: 2) S1;
![Page 32: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/32.jpg)
35
Genomic Computing Group by metadata – Example GROUP
5 01
Tumor_type = brcaPatient_age = 75Group = 1Min = 0
5 3102 1
Tumor_type = escaPatient_age = 78Group = 2Min = 1
5 3
Tumor_type = cholPatient_age = 87Group = 3Min = 3
534 6
Tumor_type = escaPatient_age = 78Group = 2Min = 1
Group samples according to the value of tumor and compute the region minimum score of each group
![Page 33: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/33.jpg)
36
Genomic Computing Region merge – Example MERGE
Type = ChipSeqAntibody = CTCFReplicate = 1Type = ChipSeqAntibody = CTCFReplicate = 2
Type = ChipSeqAntibody = CTCFReplicate = 3
Type = ChipSeqAntibody = CTCFReplicate = 1Replicate = 2Replicate = 3
Collapse a bunch of samples (both region and metadata) into an unique one S2 = MERGE() S1;
S1.s1
S1.s2
S1.s3
S2
![Page 34: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/34.jpg)
37
Genomic Computing Region union – Example UNION
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
Return a single dataset with all the samples in two input datasets, merging their region attributes if different
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
S3 = UNION() S1 S2;
S1
S2
S3
![Page 35: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/35.jpg)
38
Genomic Computing Region difference – Example DIFFERENCE
Return all the regions in the first dataset that do not overlap any region in the second one
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
S1.Tumor_type = brcaS1.Experiment = rnaseqS2.Tumor_type = brcaS2.Experiment = mirna
S1
S2
S3
S3 = DIFFERENCE() S1 S2;
![Page 36: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/36.jpg)
COVER(ALL,ALL) AND COVER(2,ANY)COVER(1,ANY) OR
S2 = COVER(min, max) S1;
39
Genomic Computing Dataset operations: COVER
• Jaccard indexes can be used instead of min-max• An aggregate function f can be computed for regions
forming the cover
• Produces new regions where there are between MIN and MAX regions of the operand dataset
![Page 37: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/37.jpg)
40
Tumor_type = brcaTumor_grade = g3
Tumor_type = brcaTumor_grade = g2
Tumor_type = brcaTumor_grade = g2
Tumor_type = brcaTumor_grade = g2Tumor_grade = g3
2 2 3 1 2 1 1
COVER(2, ANY): find portions of the genome that are covered by at least two regions
Genomic Computing Region Cover – Example COVER
S2 = COVER(2, ANY) S1;
S1.s1
S1.s2
S1.s3
S2 23 22
![Page 38: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/38.jpg)
• Given two sets of samples, JOIN builds the pairs of regions and metadata where a join predicate p is true.
• Region of results are composed from regions of the operands
S3 = JOIN(p, comp-op) S1 S2;
• Functions minDistance and distance can be used in the predicate
Genomic Computing Dataset operations: JOIN
41
![Page 39: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/39.jpg)
42
Genomic Computing Metadata join – Example JOIN
Metadata join: select pairs of matching samples (e.g. with the same “Type”)
Type = uvmPatient = 123Gender = M
Type = brcaPatient = 10Age = 88
Type = brcaPatient = 211
Type = sarcPatient = 12
Type = sarcPatient = 444Age = 88
Type = brcaPatient = 333Grade = g3
Type = sarcPatient = 12
![Page 40: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/40.jpg)
43
Join at min-distance: associate each region in the former dataset with the closest in the latter
Genomic Computing Region Join – Example JOIN
A B feature = transcripts
2 feature = TFBSs
S1.feature = transcriptsS2.feature = TFBSs
S3 = JOIN(mindistance, RIGHT) S1 S2;
1-A 3-B
1 3
S1
S2
S3
![Page 41: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/41.jpg)
S2 = JOIN(distance < 1000, CON) R, S1;
d = 400
d = 2500 d = 1100
d = 0
Matching pairs and region
composition
R
3 5
e1
3 5 e2
All pairs
44
Genomic Computing Dataset operations: JOIN example
53
R
e1
![Page 42: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/42.jpg)
S2 = MAP(newAttr AS MIN(attr)) R S1;
2 1 3
2 1 43 2
1
2 1
S1
R
S2
45
Genomic Computing Dataset operations: MAP
• Computes aggregate functions over samples of S1 which intersect with the regions of R
0
![Page 43: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/43.jpg)
46
Compute an aggregate function (e.g. COUNT) on al the regions intersecting the reference
Genomic Computing Region Map – Example MAP
annotation = genesprovider = RefSeqA B
feature = SNP
2 4
R.annotation = genesR.provider = RefSeqS1.features = SNP
S2 = MAP(count_R_S1 AS COUNT) R S1;
R
S1
S2
![Page 44: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/44.jpg)
47
Genomic Computing MAP opens to Genome Space abstraction
• MAP operations, through reference regions R, extract and standardize genomic features expressed in distinct datasets
• Genome Space: simplified structured outcome, ideal format for data analysis
R1 R2 R3 DHS
RNAPII H3K4me1
… Gene A TSS Enhancer E
GMQL MAP
![Page 45: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/45.jpg)
48
Genomic Computing Next: MAP – Genometric space abstraction
• Genometric spaces represent adjacency matrices, i.e. networks
– Network analysis methods (e.g. page rank, hub/authority, community detection, …)
R1 R2 R3
DHS RNAPII H3K4me1
… Gene A TSS
Enhancer E
GMQL MAP
![Page 46: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/46.jpg)
49
Genomic Computing Pattern-based querying & clustering
On the Genome Space (rows = regions, columns = datasets):• Convert genome space values (e.g. to Boolean format)• Extract patterns (regions with given values in different datasets,
i.e. with same genomic traits)• Filter rows (find regions with given values in given datasets)• Cluster rows (based on identity or similarity)• Cluster columns (based on identity or similarity)• Extract relevant row attributes (e.g. common aspects of clustered
regions)• Extract relevant column attributes (e.g. common aspects of
clustered datasets, e.g. prevalent phenotype)G1 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11
r1 X X x Xr2 X X X Xr3 X Xr4 X Xr5 X Xr6 X X X
![Page 47: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/47.jpg)
51
Genomic Computing MAP result visualization: Genome Browser
Res = MAP(mutCount AS count) Genes Dataset;
Dataset
Genes
Res
![Page 48: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/48.jpg)
52
Genomic Computing MAP result visualization: Heatmap viewer
It requires: • Partitioning by experiment classes• Adding names to regions and to experiments (from metadata)• Adding colors
![Page 49: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/49.jpg)
53
Genomic Computing Data Viewer: Region clustering
Cluster3 (low densitypattern-2)
Cluster2 (low densitypattern-1)
Cluster1 (high density)
Cluster4 (basal)
![Page 50: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/50.jpg)
54
Genomic Computing Data Viewer: MAP result visualization
![Page 51: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/51.jpg)
55
Genomic Computing Data Viewer: Dendrogram
Cut-1: 2 clusters
Cut-2: 4 clusters
Distance
![Page 52: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/52.jpg)
56
Genomic Computing Data Viewer: Pattern extraction on samples
![Page 53: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/53.jpg)
57
Genomic Computing Data Viewer: Metadata aggregation
For biological/clinical interpretation of genomic data processing, and data stratification based on of biological/clinical metadata values and/or patterns of different genomic feature regions
![Page 54: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/54.jpg)
GMQL at work(Examples)
59
![Page 55: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/55.jpg)
• Annotating unknown data to known annotations(e.g. annotating transcription factors to genes)
• Defining (epi)genomic features on the base ofexperimental data (e.g. putative enhancers)
• Searching for patterns of (epi)genomic features inexperimental data and relating them to specific biological phenomena (e.g. short-range interactions)
• Looking for patterns of (epi)genomic features related tothe tridimensional structure of the genome and relatingthem to specific biological phenomena (e.g. long-range interactions)
Genomic Computing applicationsTypical GMQL applications
60
![Page 56: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/56.jpg)
61
Genomic Computing GMQL examples – Genetic mutations
Finding genetic mutations in CpG IslandsConsider DNA-seq data of distinct human cancer cell lines; for each of them quantify the mutations in each CpG island.Then, select the CpG islands with at least a mutation.Return the list of cancer cell lines ordered by the number of such CpG islands.
MUT = SELECT(dataType == ‘DnaSeq‘ AND cell_karyotype == ‘cancer‘) EXP;CpG = SELECT(type == ‘CpG islands') ANNOTATION;
CpG1 = MAP(MutCount AS COUNT) CpG MUT;CpG2 = PROJECT(MutCount > 0) CpG1;CpG3 = AGGREGATE(CpGCount AS COUNT) CpG2;CpG_res = ORDER(DESC CpGCount) CpG3;MATERIALIZE CpG_res;
![Page 57: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/57.jpg)
62
Genomic Computing GMQL examples – Heterogeneous datasets
Combining ChIP-seq and DNase-seq data in different formats and sourcesExtract broad peaks of ChIP-seq transcription factor binding sites and histone modifications from ENCODE samples that intersect DNase-seq open chromatin regions from Roadmap Epigenomics in normal H1 embryonic stem cells.
CHIPSEQ = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp' AND cell == 'H1-hESC') HG19_ENCODE_BROAD;
DNASESEQ = SELECT(assay == 'DNase.hotspot.broad' AND Standardized_Epigenome_name == 'H1 Cells') HG19_ROADMAP_EPIGENOMICS_BED;
DNASESEQ1 = COVER(1, ANY) DNASESEQ;
CHIPSEQ_IN_DNASESEQ = JOIN(distance < 0, project_right_distinct)DNASESEQ1 CHIPSEQ;
MATERIALIZE CHIPSEQ_IN_DNASESEQ;
![Page 58: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/58.jpg)
63
Genomic Computing GMQL examples – Heterogeneous datasets
Associating transcriptomics and epigenomicsIn RNA-seq experiments of human cancer cell line HeLa-S3, find the average expression of each gene. Then, in ChIP-seq experiments of the same cell line, in each gene find the average signal of H3K27me3 and H3K4me3 histone modifications. Map these experiments to known genes.
Based on all such average values, evaluate the pairwise similarity of each gene with the BRCA1 gene, and return the ordered list of the 20 genes most similar to BRCA1.Also, through bi-clustering, identify and return groups of genes and their histone and transcription signals with similar patterns.
GENES = SELECT(type == ‘gene‘ AND provider == ‘UCSC’) ANNOTATION; RNA = SELECT(dataType == ‘RnaChip‘ AND cell == 'HeLa-S3‘) PEAK;HM = SELECT(dataType == ‘ChipSeq‘ AND cell = ‘HeLa-S3‘ AND
(antibody == ‘H3K27me3‘ OR antibody == ‘H3K4me3‘)) PEAK;
EXP = UNION RNA HM;GenomeSpace = MAP(EXPavg AS avg(signal)) GENES EXP;……
![Page 59: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/59.jpg)
64
Genomic Computing GMQL examples – Patient datasets
Counting distinct DNA mutations in patient groupsGroup patients by tumor type and ethnicity, and count the distinct DNA somatic mutations in each group.
MUTATION = SELECT(dataType == ‘dnaseq’) TCGA_dnaseq;
MUTATION_BY_RACE = COVER(1, ANY; GROUP BY tumor_tag, race;overlap_count AS COUNT, barcodes AS BAG(tumor_sample_barcode)) MUTATION;
MUTATION_COUNT = AGGREGATE(mutation_count AS COUNT) MUTATION_BY_RACE;
MATERIALIZE MUTATION_COUNT;
![Page 60: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/60.jpg)
65
Genomic Computing GMQL examples – Different datasets of patients
Combining different data types of multiple patients Match DNA copy number variation (CNV) and microRNA (miRNA) data samples regarding the same biospecimen and extract the CNVs occurring within expressed miRNA genes in the paired samples.
CNV = SELECT(dataType == ’cnv’) TCGA_cnv;
MIRNA_GENE = SELECT(dataType == ’mirnaseq’) TCGA_mirnaseq_mirna;
CNV_GENE_0 = MAP(left -> bcr_sample_barcode == right -> bcr_sample_barcode, gene_count AS COUNT, mirna_genes AS BAG(mirna_id)) CNV MIRNA_GENE;
CNV_GENE = PROJECT(gene_count > 0) CNV_GENE_0;
MATERIALIZE CNV_GENE;
![Page 61: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/61.jpg)
66
Genomic Computing GMQL examples – Different datasets of patients
Combining and processing heterogeneous omics data of patients In TCGA data of breast cancer patients, find the DNA somatic mutations within the first 2000 bp outside of the genes that are expressed and methylated in at least one of these patients, and extract the top five patients with the highest number of such mutations and their somatic mutations.
EXPRESSED_GENE = SELECT(dataType == ‘rnaseqv2’ AND tumor_tag == 'brca') HG19_TCGA_RnaSeqV2_Gene;
METHYLATION = SELECT(dataType == ‘dnamethylation’ AND tumor_tag == 'brca') HG19_TCGA_Dnamethylation;
MUTATION = SELECT(data_type == ‘dnaseq’ AND tumor_tag == 'brca') HG19_TCGA_DnaSeq;
GENE_METHYL = JOIN(left->bcr_sample_barcode == right-> bcr_sample_barcode, distance < 0, project_left_distinct) EXPRESSED_GENE METHYLATION;
GENE_METHYL1 = COVER(1, ANY) GENE_METHYL;MUTATION_GENE = JOIN(distance < 2000 AND distance > 0, left) MUTATION
GENE_METHYL1;MUTATION_GENE_count = AGGREGATE(mutation_count AS COUNT) MUTATION_GENE;MUTATION_GENE_top = ORDER(DESC mutation_count; TOP 5) MUTATION_GENE_count;MATERIALIZE MUTATION_GENE_top;
![Page 62: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/62.jpg)
System Architecture
(GenData 2020)http://www.bioinformatics.deib.polimi.it/genData/
67
![Page 63: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/63.jpg)
Genomic Computing Overall architecture of system (GenData 2020)
68
GMQL
![Page 64: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/64.jpg)
Stores experimental datasets and annotations collected from external databases
• ENCODE (more than 4000 processed datasets for humans and mices, relevant to epigenomic research)
• Epigenomics Roadmap (about 1000 human epigenomic datasets for stem cells and ex-vivo tissues)
• TCGA (The Cancer Genome Atlas, providing more than 50,000 processed datasets for more than 30 cancer types, including mutations, copy number variations, gene and miRNA expressions, methylations)
Genomic Computing Repository
69
![Page 65: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/65.jpg)
Genomic Computing Repository
70
![Page 66: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/66.jpg)
Annotation data are also extracted from external references, based upon the needs of given research projects
• Genes (UCSC, RefSeq, Ensembl)• Transcription Start Sites (SwitchGear)• Transcription Factor Binding Sites (UCSC, ENCODE)• CpG islands (UCSC)• miRNA target sites (UCSC)• Enhancers (Vista)
Genomic Computing Repository - Annotations
71
![Page 67: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/67.jpg)
Implementation
72
![Page 68: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/68.jpg)
73
Genomic Computing GMQL implementation, V1
• GMQL similar to Pig Latin (by Yahoo! Research)– Algebraic language for data-intensive applications on
Apache Hadoop, a framework for parallel computing which executes Google MapReduce programs
• Implementation strategy: develop a translator to Pig Latin– Easier development and maintenance – Big company involvement ensures development– Use cloud computing power to obtain efficiency and
scalability
![Page 69: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/69.jpg)
Genomic Computing System architecture
74
![Page 70: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/70.jpg)
Genomic Computing System architecture
75
![Page 71: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/71.jpg)
Genomic Computing System architecture - Repository
76
![Page 72: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/72.jpg)
Genomic Computing GMQL query translation to PIG over Hadoop
77
GMQL query Translator Pig over Hadoop
Motivation:• Clear & compact user code• User-transparent optimization
![Page 73: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/73.jpg)
Genomic Computing Translation example
78
• 1 statement => 25 Pig Latin lines of code + auxiliary Java function
• The translator takes also care of updating the variable schema
• Error handling
![Page 74: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/74.jpg)
79
Genomic Computing GMQL to Pig Latin translation
•
![Page 75: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/75.jpg)
Genomic Computing Optimization in translation
1. Parallelism by splitting computations:• By chromosome• By experiment
2. Join and Map have a translation which avoids crossproducts, based on sequential scan of regions
Pig Latin shows its ability to scale on hundreds or thousands of experiments and multi-node systems
80
![Page 76: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/76.jpg)
Genomic Computing MAP and JOIN vs. competitors
81
![Page 77: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/77.jpg)
Genomic Computing System architecture, V2
• Holistic data management system for genomics• Uses cloud-based computing for querying thousands of
heterogeneous datasets
82
![Page 78: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/78.jpg)
83
Genomic Computing GMQL implementation, V2
• A different approach, with language-independent intermediate representation
• Targeting also usability from within R and Galaxy
IR Semantics
GMQL EmbeddedGMQL
LogicalGMQL
Spark Flink ???
Syntax
Implementation API to IR
![Page 79: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/79.jpg)
Genomic Computing GMQL implementation, V2 - Scala API
84
![Page 80: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/80.jpg)
Genomic Computing GMQL implementation, V2 - IR Example
85
![Page 81: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/81.jpg)
86
Genomic Computing GMQL implementation, V2
• New optimization options
GMQLGMQL IR OptimizedIR Implementation
QueryOptimizer
Low LevelOptimizer
1) Node reordering / deletion2) Select condition refinement
1) Alternative algorithms2) Parallelism tuning3) Data partitioning4) Caching
![Page 82: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/82.jpg)
Genomic Computing Optimizations at compile-time & execution-time
Idea:• Let Flink/Spark/… engines implement common and
well known optimization• Exploit the intermediate representation in order to
implement optimizations which are driven by the semantics of GMQL
• Meta-first optimization• Operator swapping optimization
• Other optimizations based on algorithms for parallel execution on the cloud
87
![Page 83: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/83.jpg)
Genomic Computing Meta-first optimization
Under certain conditions (meta-separability), it is possible to compute the metadata side of the query strictly before the region data side.
88
![Page 84: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/84.jpg)
Genomic Computing Meta-separability
GMQL queries are always meta-separable, except for the ones which use the EXTEND (AGGREGATE) operator
(EXTEND operator computes and aggregates on the region data and stores the result in the metadata)
89
![Page 85: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/85.jpg)
Genomic Computing Meta-first optimization - Workflow
ReadMD
StoreMD
ReadRD
StoreRD
ID s
• Compute metadata side of the query
• Retrieve the IDs from the metadata result
• Use the IDs to selectively load only the files that will appear in the output
90
![Page 86: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/86.jpg)
Genomic Computing Where it helps?
Affected queries are the ones which contain one or more metadata selection (far from the Readings), metadata join and metadata group by; those operations cut the size of the output
91
![Page 87: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/87.jpg)
Genomic Computing Operator swapping optimization
Some reordering of the execution plan can not be inferred by lower level optimizer, since they are motivated by GMQL semantics
DS1 DS2 DS3
Joinleft
Diff
DS1 DS3 DS2
Joinleft
Diff
92
![Page 88: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/88.jpg)
Genomic Computing Binning strategy optimization
Bin1 Bin2 Bin3 Bin4 Bin5 Bin6 Bin75bin 6
n7
Strategy for intersection:1. Partition the genome in bins2. Assign each region to all the bins it overlaps3. Search for intersections within each bin
In the case of more complex operations, we change the way in which the regions are assigned to the bins
93
![Page 89: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/89.jpg)
In order to avoid the duplicates production, when two regions overlap, an output is emitted if, and only if, at least one of them begins in the considered bin
• Bin 2: overlap => red region begins => Output e• Bin 3: overlap => no region begins => Output not emitted!
Genomic Computing Binning strategy
Avoiding output duplicates:
bin 1 bin 2 bin 3
94
![Page 90: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/90.jpg)
Genomic Computing Binning tradeoff
• Smaller bins: smaller search space, but higher number of replicates
• Optimal binning size depends on:– Number of regions and local density– Region length distribution– GMQL operation and parameters– System settings (e.g., number of nodes, amount of
memory, …)
050100150200250300
1K 5K 10K 50K 100K 200K 500K 1M 10M
Bin length
Exec. time
95
![Page 91: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/91.jpg)
User Interface
97
![Page 92: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/92.jpg)
Genomic Computing GMQL Web interface
98
http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/
![Page 93: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/93.jpg)
100
Genomic ComputingGMQL results on Integrated Genome Browser
![Page 94: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/94.jpg)
Genomic Computing GMQL examples – Distal bindings: Visualization of results
101
Results are provided to user in GTF or Tab-delimited format
![Page 95: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/95.jpg)
Genomic Computing GMQL examples – Distal bindings: Visualization of results
102
![Page 96: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/96.jpg)
Applications
103
![Page 97: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/97.jpg)
Genomic Computing Transcription Factor query & Genome Space
Source: ENCODE ChIP-seq datasets for transcription factors (TF)Goal: Generation of a transcriptional network from ChIP-seq dataMethod: Select differentially TFs which are also genes and derive TF-Genes and TF-TF links. Build a TFxGenes matrix Mij such that Mij = 1 if there is at least one binding site i in the gene region j, Mij = 0 otherwise.
# Extract gene information, some of them tagged with TF-encoding
GENE_ANN = SELECT(type == 'gene' AND provider == 'RefSeq') ANNOTATION;
GENES = PROJECT(feature == ’gene’) GENE_ANN; TF_GENES = PROJECT(encode_TF == ‘yes’) GENES; # red in next slide
# Collect TF samples (122)
TF = SELECT(dataType = 'ChipSeq' AND subType = 'TF' AND cell == 'k562' AND treatement == 'None') ENCODE_PEAK;
# Build TF Genomic Space
GS_TF = MAP(signal AS EXISTS) GENES TF;104
![Page 98: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/98.jpg)
105
Genomic Computing Genome Space building
G1 G2 G4
TF1TF2TF4
Ann
Cell line: K562 (CML)# TFs: 122# Nodes: 6240# Edges: 30587
TF2
TF1 G1
G2
TF4G4
G3=TF3
G3=TF3
![Page 99: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/99.jpg)
106
Genomic Computing Restriction of TF to open chromatin regions
# Collect open chromatin samples
DHS_2 = SELECT(dataType == 'DnaseSeq' AND cell == 'k562') ENCODE_PEAK; # 2 samples
FAIRE = SELECT(dataType == 'FaireSeq' AND cell == 'k562') ENCODE_PEAK; # 1 sample
# Merge DHS replicates in one sample
DHS = COVER_FLAT(2, ANY; pValue AS MIN(pValue)) DHS_2;
# Merge open chromatin regions from DHS and FAIRE assays
DHS_FAIRE = UNION DHS FAIRE;OPEN = COVER(1, ANY; pValue AS MIN(pValue)) DHS_FAIRE;
# Extract TFs in open chromatin regions only (active DNA # binding)
TF_OPEN = JOIN(distance < 0, left) TF OPEN;
![Page 100: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/100.jpg)
107
Genomic Computing Transcription Factor Genome Space
# TF Genomic Space
GS_TF = MAP(signal AS EXISTS) GENES TF_OPEN;
G1 G2 G3TF3
G4 G5 … Gn
TF1 1 0 1 1 0 … 1
TF2 1 0 1 0 0 … 1
TF3 0 0 0 0 1 … 0
… 0 0 0 1 0 … 1
TFn 0 0 0 1 1 … 1
![Page 101: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/101.jpg)
108
Genomic Computing Genome Space building
G1 G2 G4Ann
TF1
TF2
TF4
DHS
FAIRE
TF2
TF1
TF4
G4
G2 Cell line: K562 (CML)# TFs: 95# Nodes: 1717# Edges: 2367
G3=TF3
G3=TF3
G1
![Page 102: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/102.jpg)
From Genometric Space to networks: K562 transcription network
109
![Page 103: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/103.jpg)
Genomic Computing Differential binding research
Source: ENCODE JUN ChIP-seq datasets relative to cell line ‘k562’ (chronic myelogenous leukemia), with one control and 4 cases treated using interferon alpha or gamma at 30 minutes and 6 hours respectively, each with two replicas
Questions: • Q1: Find those JUN binding regions in treated cases where there are no
JUN binding peaks in the control• Q2: Find the JUN binding regions which are differentially present between
treated cases (e.g. in alpha and not in gamma interferon treated samples)
• Q3: Find the genes that overlap with JUN binding regions either in control or treated samples
• Q4: Find the average JUN ChIP-seq signal (in bedGraph files) of the treated samples in promoters (areas surrounding the TSS) of genes intersecting JUN binding regions either in control or treated samples
110
![Page 104: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/104.jpg)
Select IFNa30 non overlapping with CONTROLS (Question 1, same for IFNa6h, IFNg30, IFNg6h)
# select controlsJUN_2 = SELECT (antibody == 'c-Jun' AND dataType == 'ChipSeq' AND
cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'None') ENCODE_PEAK;
# extract the peak regions in at least two replicas and their minimum p-valueJUN = COVER(2, ANY; pValue AS MIN(pValue)) JUN_2;
# select treated samplesJUN_IFNa30_2 = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq'
AND cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'IFNa30') ENCODE_PEAK;
# extract the peak regions in at least two replicas and their minimum p-valueJUN_IFNa30_3 = COVER(2, ANY; pValue AS MIN(pValue)) JUN_IFNa30_2;
# extract peaks which do not overlap with controls (question 1)JUN_IFNa30 =JOIN(minDistance AND distance >= 0, left) JUN_IFNa30_3 JUN;
111
![Page 105: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/105.jpg)
Regions expressed differentially by treatment alpha/gamma (Question 2)
# all treated samples: union of all peaks regions non-overlapping with controlsJUN_IFNa = UNION JUN_IFNa30 JUN_IFNa6h;JUN_IFNg = UNION JUN_IFNg30 JUN_IFNg6h;JUN_IFNag = UNION JUN_IFNa JUN_IFNg;
# regions that are present in all treatments alpha and not in any treatment# gamma (question 2)JUN_IFNag_ONLYa = DIFFERENCE() JUN_IFNa JUN_IFNg;
# regions that are present in all treatments gamma and not in any treatment# alpha (question 2)JUN_IFNag_ONLYg = DIFFERENCE() JUN_IFNg JUN_IFNa;
112
![Page 106: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/106.jpg)
JUN overlapping gene lists (Question 3)
# extract all genesGENE_2 = SELECT(type == 'gene' AND provider == 'RefSeq')
ANNOTATION;GENE = PROJECT(feature == ’gene’) GENE_2;
# all samples: union of peak regions in controls or in treated samplesJUN_ALL = UNION JUN JUN_IFNag;
# extract genes intersecting with peak regions in control or treated samples# (question 3)GENE_JUN_ALL = JOIN(distance < 0, left) GENE JUN_ALL;
113
![Page 107: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/107.jpg)
Prepare relevant promoter regions
# extract transcription start sites (TSS) of all genesTSS = SELECT(type == ‘TSS’) ANNOTATION;
# extend TSS to obtain promoter regions PROM = PROJECT(start = start - 1000, stop = stop + 500) TSS;
# extract the peak regions present in at least one sampleJUN_ALL_2 = COVER(1, ANY) JUN_ALL;
# extract all promoter regions intersecting with at least one peak region # in control or treated samplesPROM_JUN_ALL = JOIN(distance < 0, right) JUN_ALL_2 PROM;
114
![Page 108: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/108.jpg)
Compute average signal on extended promoter regions (Question 4)
# Load bedGraph datasetsJUN_ALL_BG = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq' AND
cell == 'k562' AND laboratory == 'Stanford' AND(treatment == 'IFNa30' OR treatment == 'IFNa6h' ORtreatment == 'IFNg30' OR treatment == 'IFNg6h' OR treatment == none)) ENCODE_BEDGRAPH;
# Map peak binding regions on promoters intersecting with at least one peak# region in control or treated samples, producing average promoter signal # (question 4)JUN_ALL_SIGNAL = MAP(promAvg AS AVG(signal)) PROM_JUN_ALL
JUN_ALL_BG;
115
PR1 PR2 PR3 PR4 PR5 … PRn
JUN1 6.7 7.5 5.2 3.2 5.2 … 15.4
JUN2 15.4 0.4 0.8 9.4 6.2 … 21.6
JUN3 5.5 14.5 7.2 5.4 5.2 … 4.5
… … … … … … … …
JUNn 2.2 2.1 2.5 6.5 6.2 … 6.7
![Page 109: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/109.jpg)
Summary & Outlook
116
![Page 110: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/110.jpg)
117
Genomic Computing Conclusions
• GDM: a data format-independent genomic data model– For genomic region data and related metadata– Easing integration and processing of heterogeneous
genomic data
• GMQL: a high-level declarative language– Easing the expression of even complex queries on
numerous data of multiple different types– Running also on cloud computing environments– Supporting a first processing also of big data, to extract
the relevant (usually smaller) ones for further processing
• Several GDM & GMQL application examples– Characterizing interplay and function of genomic regions
![Page 111: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/111.jpg)
118
Genomic Computing Discovering biologically useful results
Working together with biologists at IEO & IIT for giving answers to biological problems. Ongoing projects:
1. 3D chromatine structure (Pelicci)2. DNA replication and gene expression (Pelicci)3. Analysis of unknown P53 binding loci (Amati)4. Chromatin state change in time-course Myc bindings (Amati)5. Myc bindings saturation patterns (Amati)6. Transcription factors co-occurrence with TEAD binding sites
(Campaner)7. Correlating hotspots of RAD21 bindings with high density TF
regions (Campaner)8. Copy number variants (CNVs) in Chromosome 7 (Testa)
![Page 112: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/112.jpg)
119
A p
atte
rn o
f gen
omic
feat
ures
Genomic Computing FutureVision: Pattern-based queries from genome browser
![Page 113: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/113.jpg)
120
Genomic Computing Future Vision: Cycle Query–Analysis–Visualization
• Query– By example– Using public DBs & ontologies– Search remote data– Query / extract remote data
• Analysis– (un)supervised learning– Region finding– Motif / pattern finding
• Visualization– Clustering– Long range interactions
VisualizationVisualization
QueryQuery
AnalysisAnalysis
![Page 114: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/114.jpg)
Genomic Computing Future Long-term vision: Internet of Genomes
• The platform (client & servers) and language should support queries/computations involving different servers- Minimizing the information to be transferred among
servers and between them and the client• Each server should expose its own data for access by
exploratory search & crawlers
1 2 1 121
![Page 115: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/115.jpg)
Genomic Computing Resources & Web sites & acknowledgements
Overview: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL web site: http://www.bioinformatics.deib.polimi.it/GMQL/
Includes:• Local mode or MapReduce mode (over Hadoop, or Hadoop YARN)
for GNU/Linux systems - Download (122 MB)• Web services (over Hadoop YARN) - Download (60 MB)• Quick start - Install GMQL and get started• GMQL tutorial & Complete documentation• Functional comparison with BEDTools & BEDOPS• Pointer to publications (Bioinformatics, IEEE/ACM TCBB, Methods)
http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/Includes:• User-friendly interface to creating/managing GMQL queries• Repository of ENCODE / Roadmap Epigenomics / TCGA datasets
122
![Page 116: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/116.jpg)
Genomic Computing Resources, Web sites & acknowledgements
European Research Council “Data-Driven Genomic Computing”(GeCo) project: http://www.bioinformatics.deib.polimi.it/GeCo/
Our group, with Prof. Stefano Ceri and several PhD & master studentsand postDoc: • Abdulrahman Kaitoua• Pietro Pinoli• Arif Canakoglu
123
![Page 117: Genomic Data Model and GenoMetric Query Language as](https://reader034.vdocuments.us/reader034/viewer/2022052501/628b3d7fb0db6703265cfc54/html5/thumbnails/117.jpg)
Thank you for your attention!
Any question?
Thank you for your attention!
Genomic Computing
124
http://www.bioinformatics.deib.polimi.it/genomic_computing/