optimizing similarity computations for ontology matching - experiences from gomma michael hartung,...
TRANSCRIPT
Optimizing Similarity Computations for Ontology
Matching - Experiences from GOMMA
Michael Hartung, Lars Kolb, Anika Groß, Erhard RahmDatabase Research GroupUniversity of Leipzig
9th Intl. Conf. on Data Integrationin the Life Sciences (DILS)Montreal, July 2013
getSim(str1,str2)
2
Ontologies
Multiple interrelated ontologies in a domain Example: anatomy
Identify overlapping information between ontologies Information exchange, data integration purposes, reuse …
Need to create mappings between ontologies
MeSHGALENUMLS
SNOMEDNCI Thesaurus
Mouse Anatomy
FMA
3
Matching Example
Two ‘small’ anatomy ontologies O and O’ Concepts with attributes (name, synonym)
Possible match strategy in GOMMA* Compare name/synonym values of concepts by a string
similarity function, e.g., n-gram or edit distance Two concepts match if one value pair is highly similar
O O’
Attr. Values Attr. Values
Concept
c0 • head • head c0‘
c1• trunk• torso
• torso• truncus c1‘
c2
• limb • extremity
• limbsc2‘
5x4=20 similarity computations
MO,O’ = {(c0,c0’),(c1,c1’),(c2,c2’)}
* Kirsten, Groß, Hartung, Rahm: GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution. Journal Biomedical Semantics, 2011
4
Problems
Evaluation of Cartesian product OxO’ Especially for large life science ontologies Different strategies: pruning, blocking, mapping reuse, …
Excessive usage of similarity functions Applied O(|O||O’|) times during matching How efficient (runtime, space) does a similarity function work?
Experiences from GOMMA1. Optimized implementation of n-gram similarity function
2. Application on massively parallel hardware Graphical Processing Units (GPU) Multi-core CPUs
getSim(str1,str2)
5
Trigram (n=3) Similarity with Dice
Trigram similarity Input: two strings A and B to be compared Output: similarity sim(A,B) ∈ [0,1] between A and B
Approach1. Split A and B into tokens of length 3
2. Compute intersect (overlap) between both token sets
3. Calculate dice metric based on the size of intersect and token sets
(Optional) Assign pre-/postfixes of length 2 (e.g., ##, $$) to A and B before tokenization
6
Trigram Similarity - Example
sim(‘TRUNK’, ‘TRUNCUS’)
1. Token sets {TRU, RUN, UNK} {TRU, RUN, UNC, NCU, CUS}
2. Intersect {TRU, RUN}
3. Dice metric 22 / (3+5) = 4/8 = 0.5
7
Naïve Solution
Method Tokenization (trivial)
Result: two token arrays aTokens and bTokens Intersect computation with Nested Loop (main part)
For each token in aTokens look for same token in bTokens true: increase overlap counter (and go on with next token in aTokens)
Final similarity (trivial) 2overlap / (|aTokens|+|bTokens|)
aTokens: {TRU, RUN, UNK}
bTokens: {TRU, RUN, UNC, NCU, CUS}
overlap: 012
#string-comparisons: 8
8
“Sort-Merge”-like Solution Optimization ideas
1. Avoid string comparisons String comparisons are expensive especially for equal-length
strings (e.g., “equals” of String class in Java) Dictionary-based transformation of tokens into unique integer
values
2. Avoid nested loop complexity O(mn) comparisons to determine intersect of token sets A-priori sorting of token arrays make use of ordered tokens
during comparison (O(m+n), see Sort-Merge join ) Amortization of sorting token sets are used multiple times for
comparison
‚UNC‘ = ‚UNK‘? 3 = 8?
9
“Sort-Merge”-like Solution - Example sim(TRUNK,TRUNCUS)
Tokenization integer conversion sorting TRUNK {TRU, RUN, UNK} {1, 2, 3} TRUNCUS {TRU, RUN, UNC, NCU, CUS} {1, 2, 4, 5, 6}
Intersect with interleaved linear scans
aTokens: {1, 2, 3}
bTokens: {1, 2, 4, 5, 6}
overlap: 012
#integer-comparisons: 3
10
GPU as Execution Environment Design goals
Scalability with 100’s of cores and 1000’s of threads Focus on parallel algorithms Example: CUDA programming model
CUDA Kernels and Threads Kernel: function that runs on a device (GPU, CPU) Many CUDA threads can execute each kernel CUDA vs. CPU threads
CUDA threads extremely lightweight (little creation overhead, instant switching, 1000’s of threads used to achieve efficiency)
Multi-core CPUs can only consume a few threads Drawbacks
A-priori memory allocation, basic data structures
11
Bringing n-gram to GPU Problems to be solved
1. Which data structures are possible for input / output?
2. How to cope with fixed / limited memory?
3. How can n-gram be parallelized on GPU?
12
Input- /Output Data Structure
Three-layered index structure for each ontology ci: concept index representing all concepts ai: attribute index representing all attributes gi: gram (token) index containing all tokens
Two arrays to represent top-k matches per concept A-priori memory allocation possible (array size of k|O|) Short (2 bytes) instead of float (4 bytes) data type for
similarities reduce memory consumption
13
Input- /Output Data Structure - Example
O:
ci
ai
gi 0 1 3 4 ci
0 2 5 10 13
1 2 6 7 8 3 4 18 19 20 9 10 21 gi
ai
O‘:
MO,O‘:0 1 2 1000 1000 800simscorrs
c0 c1 c2
c0‘ c1‘ c2‘
c0-c0‘ c1-c1‘ c2-c2‘
O O’
Attr. Values Attr. Values
Concept
c0 • head • head c0‘
c1• trunk• torso
• torso• truncus c1‘
c2
• limb • extremity
• limbsc2‘
O O’
Token sets Token sets
Concept
c0 • • c0‘
c1• •
• • c1‘
c2
• •
• c2‘
0 1 3 5
0 2 8 105 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18
[1,2]
[3,4,5][6,7,8]
[9,10][11,12,13,14,15,16,17]
[1,2]
[6,7,8][3,4,18,19,20]
[9,10,21]
Output:top-k=2 ˄ sim>0.7
Input:
14
Limited Memory / Parallelization Ontologies and mapping to large for GPU
Size-based ontology partitioning* Ideal case: one ontology fits completely in GPU memory
Each kernel computes n-gram similarities between one concept of Pi and all concepts in Qj
Re-use of already stored partition in GPU Hybrid execution on GPU/CPU possible
Q0 Q1 Q2 Q3 Q4
P0
P1
P2
P3
GPU thread
CPU thread(s)
Match task
ci
aigi
P0
ci
aigi
Q3
corrs
sims
MP0,Q3
Global memory
ReplaceQ3 with Q4
ReadMP0,Q4
Kernel0
Kernel|P0|-1
…
GPU
-Q4
-1-Q4
ci
aigi
P0
ci
aigi
Q4
GPUusage
CPUusage
* Groß, Hartung, Kirsten, Rahm: On Matching Large Life Science Ontologies in Parallel. Proc. DILS, 2010
15
Evaluation FMA-NCI match problem from OAEI 2012 Large Bio Task
Three sub tasks (small, large, whole) With blocking step to be comparable with OAEI 2012 results*
Hardware CPU: Intel i5-2500 (4x3.30GHz, 8GB)
GPU: Asus GTX660 (960 CUDA cores, 2GB)
FMA NCI Matchsubtask #concepts #attValues #concepts #attValues #comparisons
small 3,720 8,502 6,551 13,308 0.12·109
large 28,885 50,728 6,895 13,125 0.66·109
whole 79,042 125,304 11,675 21,070 2.64·109
* Groß, Hartung, Kirsten, Rahm: GOMMA Results for OAEI 2012. Proc. 7th Intl. Ontology Matching Workshop, 2012
16
Results for one CPU or GPU Different implementations for Trigram
Naïve nested loop, hash set lookup, sort-merge
Sort-merge solution performs significantly better GPU further reduces execution times (~20% of CPU time)
17
Results for hybrid CPU/GPU usage NoGPU vs. GPU
Increasing number of CPU threads
Good scalability for multiple CPU threads (speed up of 3.6) Slightly better execution time with hybrid CPU/GPU
One thread required to communicate with GPU
18
Summary and Future Work
Experiences from optimizing GOMMA’s ontology matching workflows Tuning of n-gram similarity function
Preprocessing (integer conversion, sorting) for more efficient similarity computation
Execution on GPU Overcoming GPU drawbacks (fixed memory, a-priori allocation)
Significant reduction of execution times 104min99sec for FMA-NCI (whole) match task
Future Work Execution of other similarity functions on GPU Flexible ontology matching / entity resolution workflows
Choice between CPU, GPU, cluster, cloud infrastructure, …