oracle text...oracle8i yintermedia: a single product to manage – text – video, image, audio –...

52
Oracle Text Roger Ford Senior Product Manager

Upload: others

Post on 19-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Oracle Text

    Roger Ford

    Senior Product Manager

  • AgendaWhat is Oracle Text?

    Introducing Oracle Text

    Text in the database – Why Integration is KeyPerformance and scalabilityEase of UseGlobal SolutionsSearch QualitySpecialized IndexesXMLDocument ServicesClassification and clusteringSolutionsDemos

  • What is Oracle Text?

    Oracle Text adds powerful text search and intelligent text management capabilities to the Oracle database.Oracle Text:

    – Fully integrated with the database– Offers premier text search quality– Provides advanced features for text management,

    document services, XML, etc.– Has the best internationalization features for

    multilingual text search applications.

  • Oracle8i

    interMedia: a single product to manage– Text– Video, Image, Audio– Location

    in the database, using SQLinterMedia Text

    – formerly known as ConText, re-architected on the Extensibility Framework

  • Oracle9i - Introducing Oracle Text

    Manage and find text stored in the database, on the file system, or on the WebQuery text using standard SQLDocument Services such as themes, gist, term highlighting and markup.XML supportClassification and clusteringText can be ASCII, HTML, XML, or formatted (150+ formats supported)

  • Introducing Oracle Text – Features overview

    All classical full-text search features...Exact word matching; Booleans; Wild-cards; ‘Fuzzy’ matching; Proximity searching ; Stemming in multiple languages ; ISO Thesaurus ; support for Japanese, Chinese, Korean, Western languages ; mix languages, character sets

    Plus Advanced Capabilities...

    Theme identification, indexing, and searching using 1M word knowledge base and linguistic rules

    Advanced ABOUT search

    Advanced XML search

  • Introducing Oracle Text – Features overview

    Database integration– Tight integration with SQL, PL/SQL, Enterprise

    Manager, SQL*Loader

    Intelligent Text Management– Extensive Knowledge Base and linguistic rules

    enables intelligent document services

    Extensibility– Customize the Knowledge Base, filters, data store

  • Introducing Oracle Text – An example

    create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ;

    select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high resolution"', 1)>0order by score(1) desc ;

    SCORE(1) PRODUCT_ID PRODUCT_NAME-------- ---------- ------------------------------29 3331 Monitor 21/HR27 3060 Monitor 17/HR14 1726 LCD Monitor 11/PM14 3054 Plasma Monitor 10/XGA14 2252 Monitor 21/HR/M14 2243 Monitor 17/HR/F

  • Oracle Uses Oracle Text

    Oracle internet File SystemOracle PortalOracle CRMOracle E-Business SuiteOracle eXchangeUltra SearchOracle.comOTN

  • Oracle Internet File System

  • Oracle E-Business Suite

  • Oracle Ultra Search

  • Integration with the database

    The attempt to separate text and normal business (structured) data fails:

    – Cost– Complexity– High latency of development and deployment– Performance

  • No Integration - Separate Everything

    Application

    Repository Index Search Engine(API)

    Oracle Database

    File System

    B-Tree

    Inverted

    SQL

    C API

  • Full Integration – text, index, API, optimizer

    Application

    Repository Index Search Engine(API)

    Oracle Database B-Tree and Text

    SQL

  • 2. Index Stored and Managed in Database

    select table_name from user_tables;

    TABLE_NAME------------------------------DR$DESCRIPTION_IDX$IDR$DESCRIPTION_IDX$KDR$DESCRIPTION_IDX$NDR$DESCRIPTION_IDX$RPRODUCT_INFORMATION

    select index_name, table_name, column_name from user_ind_columns where table_name='PRODUCT_INFORMATION'

    INDEX_NAME TABLE_NAME COLUMN_NAME---------- ---------- --------------------DESCRIPTION_IDX PRODUCT_INFORMATION PRODUCT_DESCRIPTION

  • 3. Seamless SQL API

    create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ;

    select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high

    resolution"', 1)>0order by score(1) desc ;

    SCORE(1) PRODUCT_ID PRODUCT_NAME-------- ---------- --------------

    29 3331 Monitor 21/HR27 3060 Monitor 17/HR14 1726 LCD Monitor 11/PM14 3054 Plasma Monitor 10/XGA14 2252 Monitor 21/HR/M14 2243 Monitor 17/HR/F

  • 4. Optimizer Integration

    Cost Model associated with the Text Index– Cost Based Optimizer chooses best plan

    Optimizer Hints to Force a Planselect /*+ index product_information description_idx */

    score(1), product_id from product_information where contains (product_description, 'monitor NEAR "high resolution"', 1)>0 AND list_price < 500 ;

  • Integration Summary

    1. Single Repository2. Index Stored and

    Managed in Database3. Seamless SQL API4. Optimizer Integration

    ••Data SecurityData Security

    ••Data IntegrityData Integrity

    ••Index Integrity/SecurityIndex Integrity/Security

    ••Ease of DevelopmentEase of Development

    ••Ease of Combining ResultsEase of Combining Results

    ••Performance Combining Performance Combining Results Results

    ••ScalabilityScalability

    ••FlexibilityFlexibility

  • Integration Benefits

    Low costLow complexityHigh performanceHigh integrityManageabilityMakes use of existing skills

  • Performance – illustration

    Large doc set – 100Gig (20million web pages)Hardware : Enterprise SparcTask : web query

    – Web-style query syntax– 2-3 words– Return first 100 hits

    40 queries/second90% of requests take < 0.5 second7 hours to create index

  • Performance – Query throughput

    Throughput Comparison

    0.00

    10.00

    20.00

    30.00

    40.00

    50.00

    60.00

    70.00

    80.00

    0 10 20 30

    Number of Users

    Que

    ries

    per S

    econ

    d

    CONTEXT Queries persecondTCOMP Queries persecond

    Oracle Text vs one of the best-known specialist Text search engines

  • Performance – Speed

    Query Time Comparison

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    0 10 20 30

    Number of Users

    Que

    ry T

    ime

    (100

    ths

    of a

    sec

    ond)

    CONTEXT Longest(s/100)CONTEXT Average(s/100)TCOMP Average

    Oracle Text vs one of the best-known specialist Text search engines

  • Performance – Oracle vs Windows Database

    Multi-User Query Throughput

    0

    0.2

    0.4

    0.6

    0.81

    1.2

    1.4

    1.6

    1.8

    0 2 4 6 8 10

    Simultaneous Users

    Qu

    eri

    es

    /Se

    co

    nd

  • Ease of Use, Ease of Development

    Simple SQL and PL/SQL interface– Can be used by any developer that knows SQL– Can be called by any tool that knows SQL– Using any language: Java, JSP, PL/SQL, C, etc.

    Choice of datastores– Stored in the database– Stored in the file system– Stored on the web (URL)– User-defined datastore

  • Global Solutions

    Basic indexing/search works in any NLS language– 48 NLS languages supported

    Special support for Japanese, Chinese, KoreanTheme search and services available in any single-byte, white space-delimited languageCan mix languages, character sets in a single columnCan query across languages

  • Chinese, Japanese, Korean Text• Character sets:

    • Japanese: JA16SJIS, JA16EUC

    • Simplified Chinese: GBK, GB2312-80• Traditional Chinese: BIG5, EUC, TRIS• Korean: KO16KSC5601• Unicode: UTF8

    • Lexing:• Lexical segmentation for Japanese, Chinese• Morphological segmentation for Korean

  • Multilingual Search

  • TREC - Quality of search

    Text-retrieval is unlike relational queries– Get what you are looking for

    Some search engines (e.g. web searches) return too many hits, not enough relevant hitsOracle competes (TREC-8, TREC-10)

    – Text-REtrieval Conference– Sponsored by the National Institute of Standards

    Achieved impressive precision/recall scoresOracle’s papers available from NIST:

    – http://trec.nist.gov/

  • Search Quality

    Exact word Boolean expressionPhrase ProximityFuzzy StemmingWildcards

    – Prefix, substring index

    Thesaurus, multiple ThesauriABOUT searchTheme (concept-based) searchAccumulate scoresTerm weightingAdvanced XML searchXPath supportQuery Feedback

  • Catalog Index

    Optimized for response time on small text fields, mixed queries, ORDER BY ….True transactional DMLMutable index structure

    – Can duplicate some base table columns in $I table– supports structured query, including range query– supports foreign order by criteria

    Subset of CONTEXT query language– No fuzzy, stemming, about– User-friendly web-like query syntax

  • Classification

    CTXRULE is an index type designed for classification/routing applicationsEfficiently take a document and find matching queries

    Classification Application

    Perform Action

    Incoming documents

    Matched Documents

    9i

    Compares against rules

  • Classification – Cont.

    An inversion of the CONTEXT index– Index a query set– Incoming document becomes the “query”– Use MATCHES to classify a document

    Complementary to the CONTEXT index

  • Prefix, substring index

    Prefix and Substring are flavors of the CONTEXT indexPrefix will add more tokens to the CONTEXT index to efficiently process prefix searches (e.g. 'ora%')Substring will add an index on substrings of each token, to efficiently process substring searches (e.g. '%oxy%')

  • Storing XML in Oracle

    xmltype– store XML as xmltype, use xmltype methods

    Decomposition– decompose documents into atomic elements– store elements in columns/rows– compose XML documents using SQL

    Store as LOB or varchar– Store XML as-is, in a LOB or VARCHAR– Search using Oracle Text section searching or XPath

  • XML and Oracle Text

    Structured data20hardback

    Unstructured data– Harry Potter and the Chamber of Secrets, by J. K. Rowling. "The

    amazing popularity of Harry Potter and the Sorcerer's Stone means that now even Muggles know about the Leaky Cauldron, Diagon Alley, ….."

    Structured/Semi-structured data20hardback

    Harry Potter and the Chamber of Secrets J. K. RowlingThe amazing popularity …

  • Content search and XML

    Create indexcreate index BOOKINDEX on BOOKS(text) indextype is

    ctxsys.context

    Query by contentselect PRICE from BOOKSwhere contains(text, ‘Harry Potter')>0 order by price

    desc;

    Create index to include section infocreate index BOOKINDEX on BOOKS(text) indextype is ctxsys.contextparameters ('section group my_auto_section_group' ) ;

    Limit content search to a section of textselect price from bookswhere contains(text, ‘Harry Potter within title’)>0 order by price desc;

  • Advanced XML searches

    Nested section searchThe Matrix

    Introduction to Matrix Algebra

    select price from media

    where contains(desc, ‘matrix within title within movie’)>0

    Search inside attribute valuesBridge of Birds

    select title from bookswhere contains(text, ‘Hughart within book@author’)>0

  • More advanced XML searches

    map multiple tags to same nameThe Diamond Ageor, A Young Lady’s Illustrated Primer

    (map H1 and H2 to section name of “headline”)

    select author from articleswhere contains(text, ‘Diamond within headline’)>0

    doctype limiters to handle tag collisions … [email protected] … … 123 Meheula Pkwy …map (foo)address to “email”, (bar)address to “address”

  • XPath support

    Path searching ( Xpath )– dog inpath ( A/B/C )

    – dog inpath ( A//B )– dog inpath ( /A/B )– dog inpath ( A/@B )

    – dog inpath ( A[B] )– dog inpath ( A[@B = “foo”] )– dog inpath ( A[@B = “foo” AND @C = “bar”] )

    Section existence– Haspath ( A//B[@C = “foo”]/D )

  • Searching for content and structure in XML

    Let’s assume the following XML contentinsert into proceedings(tk,papers) values (1, sys.XMLType.createXML(' Accelerating Dynamic Web Sites using Edge Side Includes Soo Yun and Scott Davies Oracle Corporation The main focus of this presentation is ... Fast Track to Oracle9i

    '));

  • Searching for content and structure in XML

    We can combine the extract function to manipulate the XML for extracting fragments of XML.

    select p.papers.extract('/paper/authors/text()').getStringVal()

    from proceedings pwhere doc_id = 1;

    We can also apply all the power of the Text query language for the content search.

    select p.papers.extract('/paper/authors/text()').getStringVal()from proceedings p

    where contains(papers,'Dynamic or Edge INPATH(paper/title)')>0

  • XML and Oracle Text

    Oracle Text gives you advanced search of – Unstructured data (free text)– Semi-structured data (marked-up text)– Structured data (numbers, dates, strings as XML)– Mixed data (all of the above)

    … using standard SQL

  • Document Services

    Extract Themes (major concepts)– Extract hierarchical structure

    Extract Gist– Generic or Point-of-View– Sentence- or Paragraph- level

    View a document as HTML– Highlight search terms, highlight navigation

    Return results in a table or a PL/SQL tableBasis for Clustering, More Like This, …

  • Classification and clustering

    Classification– ctxrule index type and matches operator– ctx_cls.train: provide a training set for automatic

    classification.

    Clustering– Implementation of K-means– ctx_cluster package– Available from OTN, standard in 10i

  • Case study: Der Spiegel

  • Case study: Der Spiegel

  • JDeveloper Wizards

  • Oracle Ultra Search

    New product from Oracle. Reduces the time spent searching for documents:

    – Uniform search no matter where the information is -- in a database, on the Internet, Intranet, in an application.

    – Crawls, indexes and makes searchable your corporate Intranet.

    – Integrates with Oracle9i Application Server Portal.Built on Oracle’s proven Text technology.

  • Summary

    Fully integrated with the databasePremier text search qualityAdvanced features for text management, document services, and XML.Best multilingual features in the market.

  • Home page

    http://otn.oracle.com/products/text– Documentation– Sample code– Technical information– Discussion forums