![Page 1: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/1.jpg)
Knowledge Graph: Connecting Big Data Semantics
Ying DingIndiana University
![Page 2: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/2.jpg)
Entity in Big Data• Entity: things, not strings• Relationship matters: connecting entities• Changing in searching: – string entityrelationsubgraph
![Page 3: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/3.jpg)
entity relations
![Page 4: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/4.jpg)
Entities• Entities in social web: person, location, organization, book,
music (freebase.com: Metformin)• Entities in translational medicine: gene, drug, disease,
protein, side effect (conceptwiki: Disease Lafora)• Data: scientific papers (PubMed, PubMed central), and
experimental data (SwissPro, KEGG, DrugBank,)
![Page 5: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/5.jpg)
Challenges
• Knowledge Graph – Entity Graph– Schema graph (small size) vs. Instance graph (large
size)– Graph mining (e.g. shortest path,
depth-first/breath first, pagerank)• Neo4j, NoSQL graph database,
– Graph pattern search (SPARQL)• Triple store, virtuoso (openlinksw )
![Page 6: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/6.jpg)
Use Case: Individualized Cohort in EHR
• EHR-based individualized cohort can provide a better solution than standard guidelines because the cohort is drawn from a patient population of the same geolocation, demographics, and socio-economic group to the given patient.
• EHRs are organized around the patient, not by concepts (diseases, lab results, medications, etc.)
![Page 7: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/7.jpg)
Use Case: Individualized Cohort in EHR
• EHR data contains controlled vocabularies (e.g., demographics, diagnostic codes, medications, procedures, etc.) and continuous values (e.g., lab tests, medication doses, etc.).– Category hierarchy (parent, siblings, subtrees): search patients like a given
diagnosis “ICD10:E11.21” (diabetes with nephropathy) ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general)
– Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning)
– Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time
• A patient can have 100-100,000 property values which contain 100 controlled vocabulary values and 1000 continuous values. Most values are time based.
![Page 8: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/8.jpg)
Challenges• Searching challenges
– Category hierarchy (parent, siblings, subtrees): search patients like a given diagnosis “ICD10:E11.21” (diabetes with nephropathy) ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general)
– Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning)
– Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time
– Map the changes in value with changes in time: search for a patient for a 60th% to 90th% transition between two serum glucose over a 6 month time frame. If we have N glucose values, for any two patient, we have to compare N*(N-1)/2 time-based glucose-value comparison. How to scale it up?
– Find common patterns from a set of individualized cohort patients. This means compare with the combination of subsets of million’s of differentials for each patient in the cohort.
![Page 9: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/9.jpg)
Relational Database Semantic Graph
• Paradigm shift from relational row-column lookup to semantic graph traversal– Relational Database is less efficient in joins, – Big indexing overhead (need to indexing every
column)
![Page 10: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/10.jpg)
EHR RDF Graph
Patient EHR data in semantic graph representation. EHR timeline for Patient A and B are shown as RDF graphs. Property values of each patient (demographics, labs, diagnosis, etc.) are connected to their respective ontologies. Enabling searching for patterns across different patients.
![Page 11: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/11.jpg)
EHR RDF Graph
Application of continuous value classes will enrich the patients retrieved from the database. 2A. Property values as literal nodes will not link “like” patients together without a “relational” query. 2B. By using controlled vocabulary (CV)-ontology edges, we will be able to link patients through CV-value nodes. 2C. By adding “nearby” classes to continuous value nodes, we will link additional patients. Different strategies will create different “nearby” links.
![Page 12: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/12.jpg)
Challenges: Semantic Graph Mining• Graph indexing
– gIndex: indexing frequent subgraphs, using subgraphs as features• Graph classification, clustering
– Path-based clustering and top-k similarity problems in heterogeneous information network
• Path-based graph mining– Complex dependencies within heterogeneous network– Conventional supervised classification methods assume that the
objects are independent– Sequential matching vs. snapshot matching as EHR records have
a time dimension.
![Page 13: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/13.jpg)
Linked Open Data
![Page 14: Knowledge Graph: Connecting Big Data Semantics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568163ed550346895dd5609b/html5/thumbnails/14.jpg)
Challenges for Semantic Web
• How to handle ontology graph + instance graph• How to handle inferred triples and existing
triples (reasoning)• Graph pattern search vs. Graph mining• Datatype properties vs. object properties• Different levels of semantics: ontology (schema),
categorized values (terminology), continuous values (binning?), literal