oracle text...oracle8i yintermedia: a single product to manage – text – video, image, audio –...
TRANSCRIPT
-
Oracle Text
Roger Ford
Senior Product Manager
-
AgendaWhat is Oracle Text?
Introducing Oracle Text
Text in the database – Why Integration is KeyPerformance and scalabilityEase of UseGlobal SolutionsSearch QualitySpecialized IndexesXMLDocument ServicesClassification and clusteringSolutionsDemos
-
What is Oracle Text?
Oracle Text adds powerful text search and intelligent text management capabilities to the Oracle database.Oracle Text:
– Fully integrated with the database– Offers premier text search quality– Provides advanced features for text management,
document services, XML, etc.– Has the best internationalization features for
multilingual text search applications.
-
Oracle8i
interMedia: a single product to manage– Text– Video, Image, Audio– Location
in the database, using SQLinterMedia Text
– formerly known as ConText, re-architected on the Extensibility Framework
-
Oracle9i - Introducing Oracle Text
Manage and find text stored in the database, on the file system, or on the WebQuery text using standard SQLDocument Services such as themes, gist, term highlighting and markup.XML supportClassification and clusteringText can be ASCII, HTML, XML, or formatted (150+ formats supported)
-
Introducing Oracle Text – Features overview
All classical full-text search features...Exact word matching; Booleans; Wild-cards; ‘Fuzzy’ matching; Proximity searching ; Stemming in multiple languages ; ISO Thesaurus ; support for Japanese, Chinese, Korean, Western languages ; mix languages, character sets
Plus Advanced Capabilities...
Theme identification, indexing, and searching using 1M word knowledge base and linguistic rules
Advanced ABOUT search
Advanced XML search
-
Introducing Oracle Text – Features overview
Database integration– Tight integration with SQL, PL/SQL, Enterprise
Manager, SQL*Loader
Intelligent Text Management– Extensive Knowledge Base and linguistic rules
enables intelligent document services
Extensibility– Customize the Knowledge Base, filters, data store
-
Introducing Oracle Text – An example
create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ;
select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high resolution"', 1)>0order by score(1) desc ;
SCORE(1) PRODUCT_ID PRODUCT_NAME-------- ---------- ------------------------------29 3331 Monitor 21/HR27 3060 Monitor 17/HR14 1726 LCD Monitor 11/PM14 3054 Plasma Monitor 10/XGA14 2252 Monitor 21/HR/M14 2243 Monitor 17/HR/F
-
Oracle Uses Oracle Text
Oracle internet File SystemOracle PortalOracle CRMOracle E-Business SuiteOracle eXchangeUltra SearchOracle.comOTN
-
Oracle Internet File System
-
Oracle E-Business Suite
-
Oracle Ultra Search
-
Integration with the database
The attempt to separate text and normal business (structured) data fails:
– Cost– Complexity– High latency of development and deployment– Performance
-
No Integration - Separate Everything
Application
Repository Index Search Engine(API)
Oracle Database
File System
B-Tree
Inverted
SQL
C API
-
Full Integration – text, index, API, optimizer
Application
Repository Index Search Engine(API)
Oracle Database B-Tree and Text
SQL
-
2. Index Stored and Managed in Database
select table_name from user_tables;
TABLE_NAME------------------------------DR$DESCRIPTION_IDX$IDR$DESCRIPTION_IDX$KDR$DESCRIPTION_IDX$NDR$DESCRIPTION_IDX$RPRODUCT_INFORMATION
select index_name, table_name, column_name from user_ind_columns where table_name='PRODUCT_INFORMATION'
INDEX_NAME TABLE_NAME COLUMN_NAME---------- ---------- --------------------DESCRIPTION_IDX PRODUCT_INFORMATION PRODUCT_DESCRIPTION
-
3. Seamless SQL API
create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ;
select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high
resolution"', 1)>0order by score(1) desc ;
SCORE(1) PRODUCT_ID PRODUCT_NAME-------- ---------- --------------
29 3331 Monitor 21/HR27 3060 Monitor 17/HR14 1726 LCD Monitor 11/PM14 3054 Plasma Monitor 10/XGA14 2252 Monitor 21/HR/M14 2243 Monitor 17/HR/F
-
4. Optimizer Integration
Cost Model associated with the Text Index– Cost Based Optimizer chooses best plan
Optimizer Hints to Force a Planselect /*+ index product_information description_idx */
score(1), product_id from product_information where contains (product_description, 'monitor NEAR "high resolution"', 1)>0 AND list_price < 500 ;
-
Integration Summary
1. Single Repository2. Index Stored and
Managed in Database3. Seamless SQL API4. Optimizer Integration
••Data SecurityData Security
••Data IntegrityData Integrity
••Index Integrity/SecurityIndex Integrity/Security
••Ease of DevelopmentEase of Development
••Ease of Combining ResultsEase of Combining Results
••Performance Combining Performance Combining Results Results
••ScalabilityScalability
••FlexibilityFlexibility
-
Integration Benefits
Low costLow complexityHigh performanceHigh integrityManageabilityMakes use of existing skills
-
Performance – illustration
Large doc set – 100Gig (20million web pages)Hardware : Enterprise SparcTask : web query
– Web-style query syntax– 2-3 words– Return first 100 hits
40 queries/second90% of requests take < 0.5 second7 hours to create index
-
Performance – Query throughput
Throughput Comparison
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
0 10 20 30
Number of Users
Que
ries
per S
econ
d
CONTEXT Queries persecondTCOMP Queries persecond
Oracle Text vs one of the best-known specialist Text search engines
-
Performance – Speed
Query Time Comparison
0
500
1000
1500
2000
2500
3000
3500
4000
0 10 20 30
Number of Users
Que
ry T
ime
(100
ths
of a
sec
ond)
CONTEXT Longest(s/100)CONTEXT Average(s/100)TCOMP Average
Oracle Text vs one of the best-known specialist Text search engines
-
Performance – Oracle vs Windows Database
Multi-User Query Throughput
0
0.2
0.4
0.6
0.81
1.2
1.4
1.6
1.8
0 2 4 6 8 10
Simultaneous Users
Qu
eri
es
/Se
co
nd
-
Ease of Use, Ease of Development
Simple SQL and PL/SQL interface– Can be used by any developer that knows SQL– Can be called by any tool that knows SQL– Using any language: Java, JSP, PL/SQL, C, etc.
Choice of datastores– Stored in the database– Stored in the file system– Stored on the web (URL)– User-defined datastore
-
Global Solutions
Basic indexing/search works in any NLS language– 48 NLS languages supported
Special support for Japanese, Chinese, KoreanTheme search and services available in any single-byte, white space-delimited languageCan mix languages, character sets in a single columnCan query across languages
-
Chinese, Japanese, Korean Text• Character sets:
• Japanese: JA16SJIS, JA16EUC
• Simplified Chinese: GBK, GB2312-80• Traditional Chinese: BIG5, EUC, TRIS• Korean: KO16KSC5601• Unicode: UTF8
• Lexing:• Lexical segmentation for Japanese, Chinese• Morphological segmentation for Korean
-
Multilingual Search
-
TREC - Quality of search
Text-retrieval is unlike relational queries– Get what you are looking for
Some search engines (e.g. web searches) return too many hits, not enough relevant hitsOracle competes (TREC-8, TREC-10)
– Text-REtrieval Conference– Sponsored by the National Institute of Standards
Achieved impressive precision/recall scoresOracle’s papers available from NIST:
– http://trec.nist.gov/
-
Search Quality
Exact word Boolean expressionPhrase ProximityFuzzy StemmingWildcards
– Prefix, substring index
Thesaurus, multiple ThesauriABOUT searchTheme (concept-based) searchAccumulate scoresTerm weightingAdvanced XML searchXPath supportQuery Feedback
-
Catalog Index
Optimized for response time on small text fields, mixed queries, ORDER BY ….True transactional DMLMutable index structure
– Can duplicate some base table columns in $I table– supports structured query, including range query– supports foreign order by criteria
Subset of CONTEXT query language– No fuzzy, stemming, about– User-friendly web-like query syntax
-
Classification
CTXRULE is an index type designed for classification/routing applicationsEfficiently take a document and find matching queries
Classification Application
Perform Action
Incoming documents
Matched Documents
9i
Compares against rules
-
Classification – Cont.
An inversion of the CONTEXT index– Index a query set– Incoming document becomes the “query”– Use MATCHES to classify a document
Complementary to the CONTEXT index
-
Prefix, substring index
Prefix and Substring are flavors of the CONTEXT indexPrefix will add more tokens to the CONTEXT index to efficiently process prefix searches (e.g. 'ora%')Substring will add an index on substrings of each token, to efficiently process substring searches (e.g. '%oxy%')
-
Storing XML in Oracle
xmltype– store XML as xmltype, use xmltype methods
Decomposition– decompose documents into atomic elements– store elements in columns/rows– compose XML documents using SQL
Store as LOB or varchar– Store XML as-is, in a LOB or VARCHAR– Search using Oracle Text section searching or XPath
-
XML and Oracle Text
Structured data20hardback
Unstructured data– Harry Potter and the Chamber of Secrets, by J. K. Rowling. "The
amazing popularity of Harry Potter and the Sorcerer's Stone means that now even Muggles know about the Leaky Cauldron, Diagon Alley, ….."
Structured/Semi-structured data20hardback
Harry Potter and the Chamber of Secrets J. K. RowlingThe amazing popularity …
-
Content search and XML
Create indexcreate index BOOKINDEX on BOOKS(text) indextype is
ctxsys.context
Query by contentselect PRICE from BOOKSwhere contains(text, ‘Harry Potter')>0 order by price
desc;
Create index to include section infocreate index BOOKINDEX on BOOKS(text) indextype is ctxsys.contextparameters ('section group my_auto_section_group' ) ;
Limit content search to a section of textselect price from bookswhere contains(text, ‘Harry Potter within title’)>0 order by price desc;
-
Advanced XML searches
Nested section searchThe Matrix
Introduction to Matrix Algebra
select price from media
where contains(desc, ‘matrix within title within movie’)>0
Search inside attribute valuesBridge of Birds
select title from bookswhere contains(text, ‘Hughart within book@author’)>0
-
More advanced XML searches
map multiple tags to same nameThe Diamond Ageor, A Young Lady’s Illustrated Primer
(map H1 and H2 to section name of “headline”)
select author from articleswhere contains(text, ‘Diamond within headline’)>0
doctype limiters to handle tag collisions … [email protected] … … 123 Meheula Pkwy …map (foo)address to “email”, (bar)address to “address”
-
XPath support
Path searching ( Xpath )– dog inpath ( A/B/C )
– dog inpath ( A//B )– dog inpath ( /A/B )– dog inpath ( A/@B )
– dog inpath ( A[B] )– dog inpath ( A[@B = “foo”] )– dog inpath ( A[@B = “foo” AND @C = “bar”] )
Section existence– Haspath ( A//B[@C = “foo”]/D )
-
Searching for content and structure in XML
Let’s assume the following XML contentinsert into proceedings(tk,papers) values (1, sys.XMLType.createXML(' Accelerating Dynamic Web Sites using Edge Side Includes Soo Yun and Scott Davies Oracle Corporation The main focus of this presentation is ... Fast Track to Oracle9i
'));
-
Searching for content and structure in XML
We can combine the extract function to manipulate the XML for extracting fragments of XML.
select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings pwhere doc_id = 1;
We can also apply all the power of the Text query language for the content search.
select p.papers.extract('/paper/authors/text()').getStringVal()from proceedings p
where contains(papers,'Dynamic or Edge INPATH(paper/title)')>0
-
XML and Oracle Text
Oracle Text gives you advanced search of – Unstructured data (free text)– Semi-structured data (marked-up text)– Structured data (numbers, dates, strings as XML)– Mixed data (all of the above)
… using standard SQL
-
Document Services
Extract Themes (major concepts)– Extract hierarchical structure
Extract Gist– Generic or Point-of-View– Sentence- or Paragraph- level
View a document as HTML– Highlight search terms, highlight navigation
Return results in a table or a PL/SQL tableBasis for Clustering, More Like This, …
-
Classification and clustering
Classification– ctxrule index type and matches operator– ctx_cls.train: provide a training set for automatic
classification.
Clustering– Implementation of K-means– ctx_cluster package– Available from OTN, standard in 10i
-
Case study: Der Spiegel
-
Case study: Der Spiegel
-
JDeveloper Wizards
-
Oracle Ultra Search
New product from Oracle. Reduces the time spent searching for documents:
– Uniform search no matter where the information is -- in a database, on the Internet, Intranet, in an application.
– Crawls, indexes and makes searchable your corporate Intranet.
– Integrates with Oracle9i Application Server Portal.Built on Oracle’s proven Text technology.
-
Summary
Fully integrated with the databasePremier text search qualityAdvanced features for text management, document services, and XML.Best multilingual features in the market.
-
Home page
http://otn.oracle.com/products/text– Documentation– Sample code– Technical information– Discussion forums