![Page 1: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/1.jpg)
Scalable Hybrid Keyword Search on Distributed
Database
Jungkee KimFlorida State University
Community Grids Laboratory, Indiana University
Workshop on Autonomic Distributed Data and Storage Systems Management
(ADSM 2005)
![Page 2: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/2.jpg)
Motivation
Internet
Where is the
Information?
![Page 3: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/3.jpg)
Outline
Two Typical Search ParadigmsProblems of Current Search ApproachesLocal Hybrid Keyword SearchHybrid Search on Distributed Databases
![Page 4: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/4.jpg)
Two Typical Search Paradigms
Searching over structured data
Relational Databases
Searching over unstructured data
Information Retrieval
Internet Environment
Semistructured Data – XML
Keyword Search in DB
Web Search Engines – Technologies from Information Retrieval
Hybrid Keyword Search ?
![Page 5: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/5.jpg)
Current Approaches – Keyword-only Search
Web Search Engines Web crawlers visit Web pages and collect the
keyword based text indexes. Fast information retrieval
Keyword Search in databases Web integration on legacy DBMS Dynamic Web publication through embedded
DB Easy to use without knowledge of DB schema
![Page 6: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/6.jpg)
Problems of Current Approaches – Keyword-based
Web Search Engines Can not collect every connected resource Query results are often unrelated
Keyword Search in Databases Losing the inherent meaning of the schema Query results are not based on semantic
schema
![Page 7: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/7.jpg)
Current Approaches – Semantic
Semantic Web Multiple relation links with directed
labeled graphs and machines can understand the relationship between different resources
Describes metadata about resources To represent the relations of the objects
on the Web; the object terms defined under a specific description – an Ontology
![Page 8: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/8.jpg)
Problems of Current Approaches – Semantic Web
Ontology design is sophisticatedLack of unified definition Limited adoption
![Page 9: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/9.jpg)
Our Approach
Hybrid search mechanisms –Semantic metadata + Keyword search
Semantic SolutionSemantic Web might be better than Hybrid
search Hybrid search must be better than Web search
enginesSimplicity
Hybrid search is simpler than Semantic Web
![Page 10: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/10.jpg)
Hybrid Keyword Search Service
A search service fetches target information data against a search query.
Unstructured dataA file containing data – MS Word, PDF, PS documents
Metadata: Structured or semistructured data – XML
We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text.
![Page 11: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/11.jpg)
How to Combine? (1)
Two entity sets and a relationship in relational DBMS
We can obtain the hybrid search result using a nested subquery
![Page 12: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/12.jpg)
How to Combine? (2)
A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene)
![Page 13: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/13.jpg)
Local Query Processing – XML (1)
XML-enabled RDB DBLP XML record (1,000 – 10,000) Non indexed matches
except year match bound by the number of matches.
Combined query time depends on # of year query results
Average XML Query Time
![Page 14: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/14.jpg)
Local Query Processing – XML (2)
Apache Xindice DBLP XML record (1,000 – 10,000) Indexed approximate
matches for text elements in XML instances as bad as non-indexed queries
Exact matches bound by the number of matches.
Average XML Query Time
![Page 15: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/15.jpg)
Local Query Processing – Hybrid (1)
Hybrid search query performance measurement XML-enabled RDB For 100,000 XML instances and 100,000 text documents Small result set: 4 XML and a keyword matches Large result set: 7,752 XML and 41,889 documents (3,227)
Metadata Author Year
(Nested subquery)
Year
(Hash table)
Few
Keywords
0.04
Sec.
82.9 Sec. 5.70 Sec.
Many
Keywords
0.48
Sec.
Half hour 6.96 Sec.
![Page 16: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/16.jpg)
Local Query Processing – Hybrid (2)
Hybrid search query performance measurement Apache Xindice + Jakarta Lucene For 10,000 XML instances and 10,000 text documents Small result set: 2 XML and a keyword matches Large result set: 192 XML and 4,562 documents (41)
![Page 17: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/17.jpg)
Discussion – Local Hybrid Search
XML-enabled RDB provides proper response except some extreme query loads. Inefficient query plan and query optimization in an
old version – better performance in a newer version
A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances)
We will generalize hybrid search to a distributed environment.
![Page 18: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/18.jpg)
Hybrid Search on Distributed Databases Data Independence: logically and physically
independent; the same schema – no change, data encapsulation in each machine
Network Transparency: depends on MOM or P2P framework
No replication – restricted to a computer cluster Fragment: full partition; horizontal fragmentation The query result for the distributed databases is
the collection of query results from individual database queries.
![Page 19: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/19.jpg)
Scalable Hybrid Search Architecture on DDBS
SearchService
MessageBroker
Client
SearchService
SearchService
Subscriber for a query topic
Publisher for a temporary topic
Publisher for a query topic
Subscriber for a temporary topic
QueryMessage
QueryMessage
ResultMessage
ResultMessage
Client Client
![Page 20: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/20.jpg)
Cooperating Broker Network
Distributed Databases based on NaradaBrokering Network
![Page 21: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/21.jpg)
Query Processing – DDBS (1)
100,000 XML and 100,000 Documents in 8 machines – 12,500 each
Few keyword match (1-3) on 1 machine only
RDB – 0.04 Sec. for few keyword match
Avg. response time for an author exact match queryover 8 search services
![Page 22: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/22.jpg)
Query Processing – DDBS (2)
100,000 XML and 100,000 Documents in 8 machines – 12,500 each
RDB – half hour or 6.96 Sec. (Hash table)
Avg. response time for a year match queryover 8 search services
![Page 23: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/23.jpg)
Coupling vs. Scalability
From ICDE 2002 Tutorial
![Page 24: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/24.jpg)
Query Propagate and Results back on a P2P Network
![Page 25: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/25.jpg)
Peer group architecture of the P2P Search
![Page 26: Scalable Hybrid Keyword Search on Distributed Database](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813815550346895d9fc969/html5/thumbnails/26.jpg)
Conclusion
We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web
Our architecture contributed a performance improvement for some queries
Extension of the scalability of Xindice XML query limited to a small size on a single machine