database technologies for e-commerce rakesh agrawal ibm almaden research center
TRANSCRIPT
![Page 1: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/1.jpg)
Database Technologies for E-Commerce
Rakesh Agrawal
IBM Almaden Research Center
![Page 2: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/2.jpg)
OutlineOverview of the Pangea B2B
marketplaceTechnology dives: Catalog integration Searching specification documents Interactive parametric search Storage & retrieval of e-commerce dataResearch opportunities
![Page 3: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/3.jpg)
Pangea Architecture
Product Selection
Web Infrastructure
Websphere/DB2
Payment and Delivery
Brand and Loyalty Creation
Preference System
User provided/System Mined
DataAcquisition
WrappersCleaning
Data Transfer
UserProfiling
andMarket
Research
RegistrationAccounts
![Page 4: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/4.jpg)
Product Selection
-Parametric-Description-Category-Substitute products
-Category tree-Neighborhood
Search
Browse
Simple Products
Product lists with aggregate constraints (e.g. BOM)
Configuration products with compatibility constraints
Composite Products
Bookmarking(shopping basket)
Attribute value comparison
Detailed description (e.g.data sheets)
Reviews
Experiences
Comparisons
Preference System
Specification/Specializaion/Customization
Simple
Multicriteria
Auctions
Price
![Page 5: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/5.jpg)
Brand and Loyalty Creation
Preference System
Specification/Specializaion/Customization
Tools for- extracting- organizing- maintaininginformation
Personal Database
Forums
Chats
Events
AskExpert
Microcommunities
Resource Center
Resume BankJob Bank
Matchmaker- Query- Browse
HeadhuntersSalary Surveys
Job Trends
Career Center
TechnologyProducts
Businesses
CurrentArchive
News Center
![Page 6: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/6.jpg)
Features (1) Eureka: Interactive parametric search Searches based on example products Similarity search for product substitution Dynamic Content generation/presentation (on
the fly from database): Catalog Hierarchy pages with product counts Side-by-Side Product Comparison Product Listings/Details Shopper groups through preference layer
![Page 7: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/7.jpg)
Features (2) Server side Searching: Functional description search using classification Key word & Part Number Search Restriction of the search to sub-trees of the category
hierarchy (pushed down to the database) Real time Price and Availability Engine: Ability to crawl, interrogate, extract price & availability
data from various suppliers/distributors in real time and present them in side-by-side comparison format
Easily add new distributors/suppliers Completely client side implementation to prevent blocking
![Page 8: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/8.jpg)
Features (3) Design Worksheet (Generalized shopping
cart): List of items: an item could be a specific part,
alternative set of parts, specifications, other design worksheets (nesting)
Completely integrated with all relevant components (search, content, price, etc.)
Aggregate constraints (e.g. total price) Multiple projects saved on server Share projects with other (authorized) users Mining for suggestions
![Page 9: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/9.jpg)
Features (4) Design data warehouse creation and
maintenance Crawling technology for extracting part information
from websites of various manufacturer/distributor/suppliers
Data extraction from Manufacturer Spec Sheets (pdf files)
Classification technology to build, merge, manage and populate category hierarchies.
![Page 10: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/10.jpg)
Features (5) Soft content creation and maintenance Crawling to acquire articles/news/postings from various
web news sources, usenet newsgroups, etc. Agents to cluster, classify, extract concepts, identify
associations. Personalized news/data presentation (based on user
defined profile, channels, etc.) Complete interleaving and integration with part content. Automatic generation of summary pages for
manufacturers (e.g. related news articles, URLs , product categories).
![Page 11: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/11.jpg)
PrototypeData for nearly 2 million parts, in 2000
categoriesFocused crawling of more than 175
manufacturers for datasheets/ application notes/manuals (160,000)
1/4 Terabyte database
![Page 12: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/12.jpg)
Catalog Integration
R. Agrawal, R. Srikant: WWW-10
![Page 13: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/13.jpg)
Catalog Integration Problem
Integrate products from new catalog into master catalog.
a
ICs
LogicMem.DSP
fec db
ICs
Cat 2Cat 1
yx z
New CatalogMaster Catalog
![Page 14: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/14.jpg)
The Problem (cont.)
After integration:
ICs
LogicMem.DSP
a fec db yx z
![Page 15: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/15.jpg)
Desired Solution
Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories
![Page 16: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/16.jpg)
Model
Product descriptions consist of words Products live in the leaf-level categories
![Page 17: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/17.jpg)
Basic Algorithm
Build classification model using product descriptions in master catalog.
Use classification model to predict categories for products in the new catalog.
Logic
DSPx
5%
95%
![Page 18: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/18.jpg)
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
![Page 19: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/19.jpg)
National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:
Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout RegulatorPangea Category:
Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator
...
...
![Page 20: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/20.jpg)
Accuracy on Pangea Data
B2B Portal for electronic components: 1200 categories, 40K training
documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.
![Page 21: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/21.jpg)
Enhanced Algorithm: Intuition
Use affinity information in the catalog to be integrated (new catalog):
Products in same category are similar. Bias the classifier to incorporate this
information. Accuracy boost depends on quality of new
catalog: Use tuning set to determine amount of bias.
![Page 22: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/22.jpg)
Algorithm
Extension of the Naive-Bayes classification to incorporate affinity information
![Page 23: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/23.jpg)
Naive Bayes Classifier
Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule
Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs
Pr(d|Ci) = wd Pr(w|Ci)– Words occur independently (unigram model)
Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)– Maximum likelihood estimate smoothed with the
Lidstone’s law of succession
![Page 24: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/24.jpg)
Enhanced Algorithm
Pr(Ci|d,S) //d existed in category S= Pr(Ci,d,S) / Pr(d,S)
– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)
= Pr(Ci)Pr(S,d|Ci) / Pr(d,S)= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)
– Assuming d, S independent given Ci
= Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)
= Pr(Ci|S)Pr(d|Ci) / Pr(d|S)– Pr(S,d) = Pr(S)Pr(d|S)
Same as NB except Pr(Ci|S) instead of Pr(Ci)– Ignore Pr(d|S) as it is same for all classes
![Page 25: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/25.jpg)
Computing Pr(Ci|S)
Pr(Ci|S) =
|Ci|(#docs in S predicted to be in Ci)w /
j[1,n] |Cj|(#docs in S predicted to be in Cj)w
|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog
– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known
– Choose one weight for the entire new catalog or different weights for different sections
![Page 26: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/26.jpg)
Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy
achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.
Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.
The tune set method attempts to select a good value for weight, but there is no guarantee of success.
![Page 27: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/27.jpg)
Empirical Evaluation
Start with a real catalog M Remove n products from M to form the new
catalog N In the new catalog N
– Assign f*n products to the same category as M– Assign the rest to other categories as per some
distribution (but remember their true category) Accuracy: Fraction of products in N assigned
to their true categories
![Page 28: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/28.jpg)
Improvement in Accuracy (Pangea)
1 2 5 10 25 50 100 200
Weight
65
70
75
80
85
90
95
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
![Page 29: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/29.jpg)
Improvement in Accuracy (Reuters)
1 2 5 10 25 50 100 200
Weight
82
84
86
88
90
92
94
96
98
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
![Page 30: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/30.jpg)
Improvement in Accuracy (Google.Outdoors)
1 5 25 100 400 1000
Weight
50
60
70
80
90
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
![Page 31: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/31.jpg)
Tune Set Size (Pangea)
0 5 10 20 35 50
Tune Set Size
70
75
80
85
90
95A
ccu
racy
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Similar results for Reuters and Google.
![Page 32: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/32.jpg)
Empirical Results
71-22-6 79-21 100
Purity (No. of classes & their distribution)
0
5
10
15
20
% E
rro
rs Standard
Enhanced
![Page 33: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/33.jpg)
Summary
Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized.
How to apply these ideas to other types of classifiers?
![Page 34: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/34.jpg)
Searching Specification Searching Specification DocumentsDocuments
R. Agrawal, R. Srikant. WWW-2002
![Page 35: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/35.jpg)
Specfication DocumentsSpecfication Documents
Consist of <attribute name, value> pairs, embedded in text
Examples:– Data sheets for electronic parts– Classifieds
![Page 36: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/36.jpg)
Sources of ProblemsSources of Problems Synonyms for attribute names and units.
– "lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing.
– No "Speed", just "MHz Pentium III" – No "Memory", just "MB SDRAM"
Accurate data extraction is hard, e.g. partial datasheet for Cypress CY7C225A PROM:
High Speed18 ns address set-up12 ns clock to outputLow Power495 mW (commercial)660 mW (military)
![Page 37: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/37.jpg)
An end run!An end run! Use a simple regular expression extractor to get
numbers Do simple data extraction to get hints, e.g.
– Hint for unit: the word following the number.– Hint for attribute name: k words following the number.
Use only numbers in the queries– Treat any attribute name in the query also as hint
256 MB SDRAM memory
Unit Hint:MB
Attribute Hint:SDRAM, Memory
![Page 38: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/38.jpg)
Documents and QueriesDocuments and Queries
Document D = {<ni , Hi > | ni N, Hi A, 1 i m}
Query Q = {<qi , Ai > | ni N, Ai A, 1 i k}Hi and Ai are hints
D = {<256, {MB, SDRAM, Memory}>, <700, {<MHz, CPU>}}
Q = <200 MB, 750 MHz>
Document D = {ni | ni N, 1 i m}
Query Q = {qi | ni N, 1 i k}No hints with Documents and Queries
D = {256, 700} Q = {200, 750}
![Page 39: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/39.jpg)
Can it work?Can it work?
Yes!!!!!– Provided data is non-reflective– Reflectivity can be computed a priori for a
given data set and provides estimate of expected accuracy.
Demo
![Page 40: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/40.jpg)
ReflectivityReflectivity
0 10 20 30 40 500
10
20
30
40
50
0 10 20 30 40 500
10
20
30
40
50
0 10 20 30 40 500
10
20
30
40
50
Non-reflective
Low Reflectivity
0 10 20 30 40 500
10
20
30
40
50
Low Reflectivity High Reflectivity
<x=i, y=j> => <x=j, y=i>
![Page 41: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/41.jpg)
Non-reflectivity in real lifeNon-reflectivity in real life
Non-overlapping attributes: – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb
Correlations:– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still
fine.
Clusters
![Page 42: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/42.jpg)
Basic IdeaBasic Idea
Consider co-ordinates of a point If there is no data point at the permutations of its
co-ordinates, this point is non-reflective– Only a few data points at the permutations of its co-
ordinates => point is largely non-reflective
Compute reflectivity as this ratio summed over all the points– Consider neighborhood of a point in the above
calculation
![Page 43: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/43.jpg)
ReflectivityReflectivity
D: set of m-dimensional points n: coordinates of point x
(n): number of points within distance r of n Reflections(x): permutations of n (n):
number of points in D that have at least one reflection within distance r of n
Reflectivity(m,r) = 1 - 1/|D| xD (n)/(n) Non-reflectivity = 1- Reflectivity See the paper for reflectivity of D over k-dimensional
subspaces
![Page 44: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/44.jpg)
AlgorithmsAlgorithms
How to compute match score (rank) of a document for a given query?
How to limit the number of documents for which the match score is computed?
![Page 45: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/45.jpg)
Match Score of a DocumentMatch Score of a Document
Select k numbers from D yielding minimum distance between Q and D:– F(Q,D) = ( i
k=1 w(qi ,nji)p )1/p
Map problem to Bipartite Matching in graph G:– k source nodes: corresponding to query numbers– m target nodes: corresponding to document numbers– An edge from each source to k nearest targets.
Assign weight w(qi ,nj)p to the edge (qi,nj).
![Page 46: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/46.jpg)
Bipartite Matching Bipartite Matching The optimum solution to the minimum weight bipartite
graph matching problem matches each number in Q with a distinct number in D such that the distance score F(Q,D) is minimized.
The minimum score gives the rank of the document D for the Query Q.
Assuming F to be L1 and w(qi,nj) := |qi - nj| / |qi +|:
10 7525
20 60
.5 .25.58.25
Doc:
Query:
![Page 47: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/47.jpg)
Limiting the Set of DocumentsLimiting the Set of Documents
Similar to the score aggregation problem [Fagin, PODS 96]
Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]
![Page 48: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/48.jpg)
Limiting the Set of DocumentsLimiting the Set of Documents
Make k conceptual sorted lists, one for each query term [use: documents = index(number)]
Do a round robin access to the lists. For each document found, compute its distance F(D,Q)
Let ni := number last looked at for query term qi Let := (i
k=1 w(qi, ni)p)1/p
Halt when t documents found whose distance is lower bound on distance of unseen documents
20
D4 D6 D8
D2 D3
25/.25 D9D1 D5 D7
60
D6 D8 D9
D5D1 D3 D4
D2 D7
10/.5
35/.75 25/.58
75/.25
66/.1
![Page 49: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/49.jpg)
Evaluation MetricEvaluation Metric
Benchmark: Set of answers when attribute names are precisely known in the document and query
What fraction of the top 10 "true" answers are present in the top 10 answers when attribute names are unknown in both document and query?
![Page 50: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/50.jpg)
Accuracy ResultsAccuracy Results
1 2 3 4 5
Query Size
0
20
40
60
80
100
Acc
urac
y
DRAMLCDProc
TransAutoCredit
GlassHousingWine
Records Fields
DRAM 3,800 10
LCD 1,700 12
Proc 1,100 12
Trans 22,273 24
Auto 205 16
Credit 666 6
Glass 214 10
Housing 506 14
Wine 179 14
![Page 51: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/51.jpg)
Reflectivity estimates accuracyReflectivity estimates accuracy
Non-reflectivity closely tracked accuracy for all nine data sets
Non-reflectivity arises due to clustering and correlations in real data (Randomized non-reflectivity: value obtained after permuting values in the columns)
1 2 3 4 5 7
Query Size
0
20
40
60
80
100
AccuracyNon-Reflectivity
Randomized Non-Reflectivity
![Page 52: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/52.jpg)
Incorporating HintsIncorporating Hints
L1 = w(qi,ni) + B v(Ai,Hi)
– v(Ai,Hi) : distance between attribute name (or unit) for qi and set of hints for ni
– B: relative importance of number match vs. name/unit match
![Page 53: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/53.jpg)
Balance between Number Balance between Number Match & Hint MatchMatch & Hint Match
Weight to hints should depend on the accuracy of hints
Use tune set to determine B on per dataset basis
0.010.03
0.10.3
13
10
Weight of Hint Match
0
20
40
60
80
100
Acc
urac
y
0.990.9
0.8 0.6 No HInts
Wine
![Page 54: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/54.jpg)
Effectiveness of HintsEffectiveness of Hints
Improvement in accuracy depends on how good are hints
1 2 3 4 6 8 10
Query Size
0
20
40
60
80
100
Acc
urac
y
0.990.9
0.8 0.6 No Hints
Wine
![Page 55: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/55.jpg)
Effectiveness of IndexingEffectiveness of Indexing
1 million docs:– 1 sec for qsize = 5– .03 sec for qsize=1
1 2 3 4 5 7
Query Size
0.01
0.1
1
10
100
Tim
e (m
s)
Scan Indexed
DRAM
![Page 56: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/56.jpg)
SummarySummary
Allows querying using only numbers or numbers + hints.
End run around data extraction.Use simple extractor to generate hints.Can ascertain apriori when the technique
will be effective
![Page 57: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/57.jpg)
Future WorkFuture Work
Integration with classic IR (key word search)– PROM speed 20 ns power 500 mW
Extension to non-numeric values
![Page 58: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/58.jpg)
Parametric Search of E-Commerce Data
J. Shafer, R. Agrawal : WWW-9
![Page 59: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/59.jpg)
Conventional Wisdom
• User enters search criteria into HTML form
• User’s query is received by web-server and submitted to a server-side database (usually as SQL)
• Result set is returned to user as an HTML page (or a series of pages)
![Page 60: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/60.jpg)
Search Problem in E-Commerce
• Users often don’t know exactly what they are looking for– Search is a process, not a single operation– An individual query is simply a means to an
end
• Too many/few results: try different query
• No universal ranking function
![Page 61: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/61.jpg)
Essential Ideas of Eureka
• Incrementally fetch superset of tuples of interest in the client (e.g. all products in the product category of interest)
• User-interface and fast indexing structures for interactive exploration of the cached tuples
![Page 62: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/62.jpg)
•Restriction by example
•Ranking
•Fuzzy restrictions
![Page 63: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/63.jpg)
Architecture Overview
DataGroup
DataColumn#1
Eureka
DataPump
ListRenderer
DataColumn#N
ServletDatabase
client
server
. . .HTTP
JDBC
![Page 64: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/64.jpg)
Comments
• Used Eureka in several situations with very positive feedback
• Used Eureka on datasets with 100K records with no visible deterioration in performance
• Performance is excellent, even in Java
![Page 65: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/65.jpg)
Storage and Retrieval ofStorage and Retrieval ofE-Commerce DataE-Commerce Data
R. Agrawal, A. Somani, Y. Xu: VLDB-2001
![Page 66: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/66.jpg)
Typical E-Commerce Data Typical E-Commerce Data CharacteristicsCharacteristics
Nearly 2 Million components More than 2000 leaf-level
categories Large number of Attributes (5000)
An Experimental E-marketplace for An Experimental E-marketplace for Computer componentsComputer components
Constantly evolving schema Sparsely populated data (about 50-100 attributes/component)
![Page 67: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/67.jpg)
Conventional horizontal representation Conventional horizontal representation (n-ary relation)(n-ary relation)
Name Monitor Height Recharge Output playback Smooth scan Progressive Scan
PAN DVD-L75 7 inch - Built-in Digital - - -
KLH DVD221 - 3.75 - S-Video - - No
SONY S-7000 - - - - - - -
SONY S-560D - - - - Cinema Sound Yes -
… … … … … … … …
DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns)
Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index Hard to load/update Schema evolution is expensive
Querying is straightforward
![Page 68: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/68.jpg)
Binary RepresentationBinary Representation(N 2-ary relations)(N 2-ary relations)
Dense representation Manageability is hard
because of large number of tables
Schema evolution expensive
Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99]
Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99]
Val
7 inch
Name
PAN DVD-L75
Monitor
ValName
KLH DVD221
Height
3.75
ValName
PAN DVD-L75
Output
Digital
S-VideoKLH DVD221
![Page 69: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/69.jpg)
Vertical representationVertical representation(One 3-ary relation)(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Objects can have large number of attributes
Handles sparseness well Schema evolution is easy
Oid Key Val
0 ‘Name’ ‘PAN DVD-L75’
0 ‘Monitor’ ‘7 inch’
0 ‘Recharge’ ‘Built-in’
0 ‘Output’ ‘Digital’
1 ‘Name’ ‘KLH DVD221’
1 ‘Height’ ‘3.75’
1 ‘Output’ ‘S-Video’
1 ‘Progressive Scan’
‘No’
2 ‘Name’ ‘SONY S-7000’
… … …
Implementation of SchemaSQL [LSS 99] Edge Approach for storing XML Data [FK
99]
![Page 70: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/70.jpg)
Querying over Vertical Querying over Vertical RepresentationRepresentation
Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2 WHERE v1.Key = ‘Monitor’ AND v2.Key = ‘Output’ AND v2.Val = ‘Digital’ AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
![Page 71: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/71.jpg)
Solution : Query MappingSolution : Query MappingTranslation layer maps relational algebraic
operations on H to operations on V
…Attrk…Attr2Attr1
Query Mapping Layer
ValKeyOid
Horizontal view (H)
Vertical table (V)
![Page 72: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/72.jpg)
Can we define a translation algebra ?Standard database view mechanism does not
work attribute values become column names (need higher
order views)
Can the vertical representation support fast querying ?
What are the new requirements from the database engines?
KeyKey IssuesIssues
![Page 73: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/73.jpg)
Transformation AlgebraTransformation Algebra
Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation.
Two key operators:– v2h ( Operation – Convert from vertical to horizontal
k(V) = [Oid(V)] [i=1,k Oid,Val(Key=‘Ai’(V))]
– h2V (Operation – Convert from horizontal to vertical
k(H) = i=1,k Oid,’Ai’Ai(Ai ‘’(V))]
i=1,k Oid,’Ai’Ai(i=1,kAi=‘’(V))
– Similar to Unfold/Fold [LSS 99] ,Gather/Scatter [STA 98]
![Page 74: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/74.jpg)
Implementation StrategiesImplementation Strategies
VerticalSQL – Uses only SQL-92 level capabilities
VerticalUDF – Exploits User Defined Functions and Table Functions
– Binary– 2-ary representation with one relation per attribute
(using only SQL-92 transforms) SchemaSQL
– Addresses a more general problem– Performed 2-3X worse than Vertical representation
because of temporary tables
![Page 75: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/75.jpg)
Performance: Experimental setupPerformance: Experimental setup
600 MHz dual-processor Intel Pentium machine 512 MB RAM Windows NT 4.0 Database IBM DB2 UDB 7.1 Two 30GB IDE Drives Buffer Pool Size – 50 MB All numbers reported are cold numbers Synthetic data
– 200X100K & =10% 200 columns, 100K rows and non-null density = 10%
![Page 76: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/76.jpg)
Clustering by Clustering by KeyKey outperformed outperformed clustering by clustering by OidOid
density = 10%, 1000 cols x 20K rows
0
5
10
15
20
25
0.1% 1% 5%
Join selectivity
Ex
ec
uti
on
tim
e (
se
co
nd
s)
VerticalSQL_oid
VerticalSQL_key
Join
![Page 77: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/77.jpg)
Projection of 10 columns
VerticalSQL comparable to Binary VerticalSQL comparable to Binary and outperforms Horizontaland outperforms Horizontal
0
10
20
30
40
50
60
200x100K 400x50K 800x25K 1000x20K
Table (#cols x #rows)
Ex
ec
uti
on
tim
e (
se
co
nd
s)
density = 10%
HorizontalSQL
VerticalSQL
Binary
![Page 78: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/78.jpg)
VerticalUDF is the best approach VerticalUDF is the best approach
0
10
20
30
200x100K 400x50K 800x25K 1000x20K
Table (#cols x #rows)
Ex
ec
uti
on
tim
e (
se
co
nd
s)
density = 10%
VerticalUDF
VerticalSQL
Binary
Projection of 10 columns
![Page 79: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/79.jpg)
Wish List from Database EnginesWish List from Database Engines
– VerticalUDF approach showed need for
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
![Page 80: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/80.jpg)
SummarySummary
+-
+-Flexibility
++Manageability
Vertical (w/ Mapping)Horizontal
-
-
Binary (w/ Mapping)
+Performance
Querying + + +
![Page 81: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/81.jpg)
OpportunitiesOpportunities
![Page 82: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/82.jpg)
Semiautomatic Catalog Creation
Supplier Web Sites
Semiautomatic Catalog
Extraction
WCS Catalog
Supplier 1 Supplier 2 Supplier n
Proxy
Fetch Rules Extract Rules
RuleRepository
Web BrowserTarget Pages
Logs
Supplier[i]
Data Mining
Technology for Quickly Assemble catalog from semi-structured data
...
![Page 83: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/83.jpg)
Architecture
Data StoreWeb
Crawler
OtherSources
OCR ExtractorCatego-
rizer
Indexer(IM forText)
PDFto
Text
![Page 84: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/84.jpg)
Improving the Catalog Hierarchy
Identify sets of nodes (or single "kitchen sink" node) for re-organization.
Cluster the set of nodes, and compare the resulting hierarchy with the original.
Identify nodes that fit better elsewhere in the hierarchy.
Metrics: classification accuracy, tightness of cluster.
![Page 85: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/85.jpg)
Integration of Real Time Operational Data
Client
Server
Remote Site 1
Remote Site 2
Remote Site n
Fetch target page(s)4,Query, Extract Response5
in real time
Req ResDynamic
12
3
4
4
6
4
5
![Page 86: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/86.jpg)
![Page 87: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/87.jpg)
Privacy Preserving Data Mining
Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level.
Add random noise to individual values to protect privacy. Can dramatically change distribution
of values. EM algorithm to estimate original
distribution of values given randomized values + randomization function. Estimate only accurate over thousands of
values => preserves privacy. Algorithms for building classification
models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.
50 | 40K | ... 30 | 70K | ...
Randomizer Randomizer
Reconstructdistribution
of Age
Reconstructdistributionof Salary
Data Mining Algorithms
Data Mining Model
65 | 20K | ... 25 | 60K | ...
Alice’s age
Alice’s salary
Bob’s age
30 becomes
65 (30+35)
![Page 88: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/88.jpg)
Seems to work well!
20
60
Age
0
200
400
600
800
1000
1200
Nu
mb
er
of
Peop
le
Original
Randomized
Reconstructed
![Page 89: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/89.jpg)
Accuracy vs. Privacy
10 20 40 60 80 100 150 200
Randomization Level
40
50
60
70
80
90
100
Acc
ura
cy
Original
Randomized
Reconstructed
Fn 3
![Page 90: Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center](https://reader030.vdocuments.us/reader030/viewer/2022032600/56649db15503460f94a9f7ca/html5/thumbnails/90.jpg)
Summary
E-Commerce is a rich application domain for database technology
Research opportunities abound