link analysis of life sciences linked data
TRANSCRIPT
Link Analysis of Life Science Linked Data
1
Wei Hu1, Honglei Qiu1, and Michel Dumontier2
1State Key Laboratory for Novel Software Technology, Nanjing University, China2Center for Biomedical Informatics Research, Stanford University
@micheldumontier::ISWC 2015
Linked Data offers links between datasets, but they are often incomplete and may contain
errors.
@micheldumontier::ISWC 20152
Network Analysis• Network analysis has long been
used to study link structures – The structure of the Web– Network medicine: cellular
networks and implications
@micheldumontier::ISWC 20153
Power law is scale free
A graph demonstrates the small worldphenomenon, if its clustering coefficient issignificantly higher than that of a randomgraph on the same node set, and if the graphhas a shorter average distance.
BTC2010
The clustering coefficient quantifies how closeits neighbors are to be a clique. The averagedistance is the average shortest path lengthbetween all nodes in the graph.
Dataset link analysis (using RDF data model)
Entity link analysis (using cross-references)
Term link analysis (using ontology matching)
@micheldumontier::ISWC 20154
@micheldumontier::ISWC 2015
Linked Data for the Life Sciences
5
Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.
chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications
• Release 3 (June 2014)• 35 datasets• 11B RDF triples• 1B entities• 2K classes• 4K properties
Dataset Links
@micheldumontier::ISWC 20156
Network Properties1. Well linked
2. Hubs and authorities
3. small-world phenomenonAverage distance = 2.77 vs 6Clustering coefficient = 0.22 vs
0.134. robust on systematic removal of nodes
Entity Link Analysis
How well do entities link to each other?• 76% entity links involve a special kind of RDF triples
– e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>– x-relations have under-specified semantics
• May be truly identical, may refer to another related entity …
• Degree distribution– Some do not follow power law
• Exponent is too large (close to 5)
7
BTC2010
@micheldumontier::ISWC 2015
symmetry of entity links varies between different pairs of datasets
• Over 99% of links are reciprocated in DrugBank-PharmGKB and OMIM-HGNC– Suggests link sharing and synchronization
• Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanetlinks are reciprocal– Suggests incomplete mapping
• 28% of OMIM-Orphanet links are malposed– Suggests variation in model (omim:Phenotype to orphanet:Disorder)
8 @micheldumontier::ISWC 2015
Transitivity Analysis: Find mismatches and discover new links
@micheldumontier::ISWC 20159
Evaluation of Entity Matching
How accurate are current entity matching approaches?• Built a benchmark from the reciprocal links between similarly-typed
entities • Evaluated several entity matching approaches
– Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard– Machine learning: Linear regression, logistic regression with 5 properties
• Many-to-one links are difficult to be discovered
10 @micheldumontier::ISWC 2015
Term Link Analysis
How similar are the topics in the data network?• Use ontology matching to generate term link graph
– Falcon-AO (linguistic matchers + structural matcher + synonyms)• Created 83K class mappings, 1.5K object property mappings, and 858 data
property mappings– Similarity threshold = 0.9– Top-5 popular labels for classes and properties
• Significant overlap in topics, does not follow power law as in broader SW
11 @micheldumontier::ISWC 2015
Correlation of Link Graphs
To what degree are each of the three link graphs are correlated?• Spearman’s rank correlation coefficient:
– Entity link graph dataset pairs: entity links / entities– Term link graph dataset pairs: term mappings / terms– Dataset link graph dataset pairs: shortest path length
• All positively correlated– Closer datasets in distance have more linked entities and terms– Number of linked entities contributes little to overlap of topics
12 @micheldumontier::ISWC 2015
Summary of Findings
• Dataset, entity and term link graphs do not necessarily share the samecharacteristics with the Hypertext / Semantic Web– Degree distribution of entity links does not follow power law– Data hubs
• A significant number of entities have been linked using x-relations, but their intended semantics differs– Classes are identical or equivalent entity links represent logical equivalence
• Symmetric and transitive entity links do exist, but their utility is weakeneddue to their small number– Meanings of entity links may shift during transitive closure
• Only matching the labels of entities may fail, while combining different properties and using simple learning algorithms achieve good accuracy
13 @micheldumontier::ISWC 2015
Website: http://dumontierlab.comPresentations: http://slideshare.com/micheldumontier
14 @micheldumontier::ISWC 2015