web science introduction to information integration julien gaugaz, october 26, 2010
TRANSCRIPT
![Page 1: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/1.jpg)
Web ScienceIntroduction to Information Integration
Julien Gaugaz, October 26, 2010
![Page 2: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/2.jpg)
2
Topics•1. Information Integration
•2. Web Information Retrieval
•3. Entity Search
•4. Web Usage
•5. Collaborative Web
•6. Web Archiving
•7. Medical Social Web
![Page 3: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/3.jpg)
Scenarios
Why Integrating Information?
![Page 4: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/4.jpg)
4
Company Mergers
![Page 5: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/5.jpg)
5
Travelling Agent
AgentAgent
![Page 6: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/6.jpg)
6
Booking Flights
AgentAgent
![Page 7: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/7.jpg)
7
Leveraging Wikipedia Infoboxes
Query
Data Contribution
![Page 8: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/8.jpg)
8
Evolution
Beginning ofDatabases
Wikipedia &Social Web
Rise of Internet & Wrapping
Websites
Num
ber
of
Sourc
es
![Page 9: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/9.jpg)
Kinds of discrepancies
What is the Problem?
![Page 10: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/10.jpg)
10
Wikipedia Infoboxeshttp://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 11: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/11.jpg)
11
Wikipedia Infoboxeshttp://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco
|leader_title = [[Mayor of San Francisco|Mayor]]|leader_name = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft =
52|elevation_max_ft = 925|elevation_min_ft = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro
= 5000000
![Page 12: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/12.jpg)
12
Causes of Discrepancies
•Information sources are diverse
•Different cultural background
•Different domain of activity
•Different model of information
•Typos and other kinds of errors
•Evolution over time
•Use, usage and users of one source may change of over time
![Page 13: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/13.jpg)
13
Places of DiscrepanciesInformation level where discrepancies appear:
•Semantic: meaning, sense
•Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
•Syntactic: how is the lexical and structural encoded into characters (and bits)
Discrepancies may concern:
•Schema elements (properties and structure) and values
![Page 14: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/14.jpg)
14
Schema Discrepancies
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
EinsteEinsteinin
name first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinsteEinste
inin
<Einstein> <full_name> “Albert Einstein”.
<Einstein> <full_name>Albert Einstein</full_name></Einstein>
![Page 15: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/15.jpg)
15
SemanticRepresentational
Schema Ambiguity
Article title
“Prof. Dr. techn.”xyzxyztitle
“The Theory of Relativity”xyzxyztitle
Person title
![Page 16: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/16.jpg)
16
Value Discrepancies
SemanticRepresentational
Einstein’s full name is “Albert Einstein”
“Albert Einstein”“Albert Einstin”“A. Einstein”“Einstein, Albert”
full_nameEinsteEinste
inin
![Page 17: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/17.jpg)
Where discrepancies are addressed with standards
Syntactic Level
![Page 18: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/18.jpg)
18
Encoding Bytes•Basic unit
•Universal standard: Bit (binary digit)
•Ternary digit (base 3, USSR 50’s, out of use)
•Bits into bytes
•Big or small endian
•System wise convention, easily convertible, defined in communication protocols
![Page 19: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/19.jpg)
19
Encoding Characters
•De facto standards:
•UTF-8/16
•Many others exist: ASCII, ISO-8859’s, KOI-8, ...
•Trivial dictionary-based translation
•When the corresponding code exists in the target character map...
![Page 20: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/20.jpg)
20
Encoding Lexico-Structural
•XML, XML Schema
•Structured document serialization format
•Base for:
•(X)HTML
•SVG: Scalable Vector Graphics
•DOCX: Microsoft Office Word 2007
![Page 21: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/21.jpg)
Resource Description FrameworkEncoding information
RDF
![Page 22: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/22.jpg)
22
•<subject> <property> <object>
•<subject>
•URI or blank node
•<property>
•URI
•<object>
•URI or blank node or (typed) literal22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 23: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/23.jpg)
23
URI
•URI: Universal Resource Identifiers
•URL’s are URI’s
• scheme:scheme-specific-part
•RDF encourage using URL’s
•URL
• scheme://usr:passwd@domain:port/path?query_string#anchor
![Page 24: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/24.jpg)
24
RDF•Resource Description Framework
•Data model specialized in conceptual information modeling
•Supported by various serialization formats:
•XML
•Notation3 (N3)
•Turtle
•...
![Page 25: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/25.jpg)
25
RDF Schema (RDF/S)•Expressed in RDF
•Types subjects and objects with classes
•Class hierarchy (with multiple inheritance)
•Type of properties of a class
•Types properties
•Domain: type of property’s subject
•Range: type of property’s object
•OWL2 is more expressive: cardinality, etc...
![Page 26: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/26.jpg)
26
When to use RDF?•RDF is good at
•Modeling information
•Especially when schema is unknown or changing
•When there is multiple schemas
•RDF is not for
•Representing documents (XHTML, CSS)
• Internal data management when schema is known and fixed (Relational Databases)
![Page 27: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/27.jpg)
Discrepancies between the representational and semantic levels in
the schema
Schema Matching
![Page 28: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/28.jpg)
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address• street• city• tax id
•Input: Schemas to match
•Possibly data instantiating those schemas
•Output: Mappings between schema elements
•Possibly with confidence values and alternatives
•Possibly with value conversion rules (matchings)
Boxer Taxpayer
• ...
Company
• ...
Trainer
• ...
Tax Office
![Page 29: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/29.jpg)
29
Mappings or Matching?
•Schema mapping identifies correspondences between schema elements
•Schema matching actually transforms an instance of one schema into an instance of another schema
![Page 30: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/30.jpg)
General architectures
How to Use Mappings?
![Page 31: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/31.jpg)
31
Mediated Schemas
Mediated
Schema
Query
Schema1
Schema2
Schema3
Query
Mediated
Schema
Schema1
Schema2
Schema3
Query
Schema x
![Page 32: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/32.jpg)
32
Peer Data Management
Local MappingLocal Source
Peer Schema
Peer Mapping
Local Schema
![Page 33: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/33.jpg)
33
Why not by hand?•Size and complexity of source schemas
•Number of schemas sources
•Leveraging data instance values
•Schemas not known in advance
source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
![Page 34: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/34.jpg)
34
Schema Matching Features
•Schema-only vs schema & instances
•Representational
•Lexical vs structural
•Internal vs external
More in:•Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The VLDB Journal. 2001;10(4):334-350.•1. Shvaiko P, Euzenat J. A Survey of Schema-Based Matching Approaches. Journal on Data Semantics IV. 2005;3730:146-171.
![Page 35: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/35.jpg)
35
•String-based
•Language-based
•Linguistic resources
•Constraint-based
•Alignment reuse
•Upper-level formal ontologies
Schema Matching Techniques
•Graph-based
•Taxonomy-based
•Repository of structures
•Model-based
![Page 36: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/36.jpg)
Leveraging lexical features
A String-Based Technique
![Page 37: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/37.jpg)
37
Edit Distance•String distance: measures distance
between two strings
•Edit distance: number of operations needed to transform one string into the other
•Common basic operations:
•Insert, delete or substitute one character
•Possibly with different weights depending on the operation and characters involved
•Java libraries:
•SecondString, SimMetrics
![Page 38: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/38.jpg)
38
Levenshtein Distance•Edit operations: insert, delete,
substitute•Each has a weight of 1
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te f
rom
Sundays substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdaysSaturdays
Sundays
![Page 39: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/39.jpg)
WordNet
A Linguistic Resource
![Page 40: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/40.jpg)
40
WordNet•Fundamental components: Synonyn Sets
(Synsets)
•{car, auto, automobile, machine, motorcar}
•a motor vehicle with four wheels; usually propelled by an internal combustion engine
•{car, railcar, railway car, railroad car}
•a wheeled vehicle adapted to the rails of railroad
![Page 41: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/41.jpg)
41
Hypernyms / Hyponyms
•Hypernyms: superordinates, isA relationships. A synset may have more than one hypernym.
•Hyponyms: subordinates
{car, auto, automobile, machine, motorcar}
{motor vehicle, automotive vehicle}
{cab, hack, taxi, taxicab} {ambulance}
hypernym
hyponyms
![Page 42: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/42.jpg)
42
Holonym / Meronym•Meronym: name of a constituent part of, the
substance of, or a member of something. X is a meronym of Y if X is a part of Y.
•Holonym: name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.
{car, auto, automobile, machine, motorcar}
{ accelerator, accelerator pedal, gas pedal, gas, throttle, gun}
holonym meronym
![Page 43: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/43.jpg)
43
Other relationships in WN•Antonym
•Entailment (for verbs)
•A verb X entails Y if X cannot be done unless Y is, or has been, done.
•Attribute (for adjectives)
•A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.
![Page 44: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/44.jpg)
Leveraging structure
A Graph-Matching Technique
![Page 45: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/45.jpg)
45
Similarity Flooding•Uses structure of the data to help matching
schemas
• Similarity Flooding in Melnik et al. (2002)
• First maps schema elements with lexical similarity
• Then improves matching assuming that:
• If two elements are similar, then the elements adjacent to them are more probable to be similar
Selected paper 1:Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
![Page 46: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/46.jpg)
Detecting duplicate entries
Deduplication
![Page 47: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/47.jpg)
47
Why is there Duplicates?
• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
Sport Authorities Taxes Authorities
AdministratAdministration-wide ion-wide databasedatabase
![Page 48: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/48.jpg)
48
•Input: 2 entities with matched attributes
•Output: M for matched or U for unmatched.
•Possibly R for reject between M and U for cases where supervised decision is necessary.
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561• name: Muhammad Ali
• address:• city: Cairo• country: Egypt• tax id: #8244361
M
UR
![Page 49: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/49.jpg)
Deduplication Features
![Page 50: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/50.jpg)
50
•Value metrics
•Character-based
•Token-based
•Phonetic
•Numeric
Field Distance Metrics
String-based metrics seen for schema matchingSimilar to Information Retrieval techniques (Topic 2 next week)
Not much techniques other than considering them as strings or direct difference
![Page 51: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/51.jpg)
51
Phonex1.First letter as prefix
2.Encode non-prefix consonants
3.Remove duplicate adjacent codes not separated by a vowel
4.Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z
2
d, t 3
l 4
m, n 5
r 6
h, wdroppe
d
Rupert•Rupert•Ro1e63•Ro1e63•R163
Robert•Robert•Ro1e63•Ro1e63•R163
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261
![Page 52: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/52.jpg)
52
Other Phonetic Codes
•NYSIIS
•Developed and still in use at the New York State Division of Criminal Justice Services
•Encodes vowels (mostly to A)
•Codes are letters instead of digits
•Longer codes (6 instead of 4)
![Page 53: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/53.jpg)
53
Other Phonetic Codes
•Metaphone
•Codes are letters instead of digits
•No maximum code length
•More elaborated coding rules
•Double Metaphone
•Returns a secondary code to help disambiguate
![Page 54: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/54.jpg)
Detecting Duplicates
![Page 55: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/55.jpg)
55
•M: match, U: unmatch
•Using Bayes rule
•Decision rule: likelihood ratio
•Using independence assumption
Bayes Decision Rule
![Page 56: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/56.jpg)
56
Bayes Decision Rule
•Priors ( and ) can be learned on a training set
•Other methods based on Expectation-Maximisation (EM) algorithm can estimate priors without training set
![Page 57: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/57.jpg)
57
Clustering-Based Decision•Using clustering techniques with appropriate
parameters
• X-Means
• Variant of K-Means without a fixed K
• Chauduri et al. observed that duplicates tend
1.to have small distances from each other (compact set property), and
2.2) to have only a small number of other neighbors within a small distance (sparse neighborhood property).
Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.
![Page 58: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/58.jpg)
58
Dealing with O(n2)
Number of entities in repository
Num
ber
of
com
pari
sons
![Page 59: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/59.jpg)
59
Canopies
●
●
●
●● ●● ●
●
●
●
●
●●
●
●
•Create canopies using a cheap similarity metric
•Overlapping clusters
•Compare entities pairwise using a more expensive similarity metric
![Page 60: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/60.jpg)
Pay-as-you-go Information Integration
Dataspaces
![Page 61: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/61.jpg)
61
Dataspaces•Note a data integration approach per
se
•Data co-existence appraoch
•Pay-as-you-go data integration
•Leveraging human contributions for data integration in a non-invasive manner
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
![Page 62: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/62.jpg)
62
•Are they duplicates?
•To compare field values we need schema matches
•To find schema matches we need duplicates
•etc...
Relationship between Schema Matching and
Deduplication
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 63: Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649de95503460f94ae410d/html5/thumbnails/63.jpg)
63
Selected Topic Papers1.Schema Matching
• Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
2.Deduplication• Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates.
ICDE’05. 2005:865-876.
3.Dataspaces1. Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06.
New York, NY, USA; 2006:1-9.
• Interdependence between schema matching and deduplication
1. Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.