database technologies for e-commerce rakesh agrawal ibm almaden research center

Database Technologies for E-Commerce

Rakesh Agrawal

IBM Almaden Research Center

OutlineOverview of the Pangea B2B

marketplaceTechnology dives: Catalog integration Searching specification documents Interactive parametric search Storage & retrieval of e-commerce dataResearch opportunities

Pangea Architecture

Product Selection

Web Infrastructure

Websphere/DB2

Payment and Delivery

Brand and Loyalty Creation

Preference System

User provided/System Mined

DataAcquisition

WrappersCleaning

Data Transfer

UserProfiling

andMarket

Research

RegistrationAccounts

Product Selection

-Parametric-Description-Category-Substitute products

-Category tree-Neighborhood

Search

Browse

Simple Products

Product lists with aggregate constraints (e.g. BOM)

Configuration products with compatibility constraints

Composite Products

Bookmarking(shopping basket)

Attribute value comparison

Detailed description (e.g.data sheets)

Reviews

Experiences

Comparisons

Preference System

Specification/Specializaion/Customization

Simple

Multicriteria

Auctions

Price

Brand and Loyalty Creation

Preference System

Specification/Specializaion/Customization

Tools for- extracting- organizing- maintaininginformation

Personal Database

Forums

Chats

Events

AskExpert

Microcommunities

Resource Center

Resume BankJob Bank

Matchmaker- Query- Browse

HeadhuntersSalary Surveys

Job Trends

Career Center

TechnologyProducts

Businesses

CurrentArchive

News Center

Features (1) Eureka: Interactive parametric search Searches based on example products Similarity search for product substitution Dynamic Content generation/presentation (on

the fly from database): Catalog Hierarchy pages with product counts Side-by-Side Product Comparison Product Listings/Details Shopper groups through preference layer

Features (2) Server side Searching: Functional description search using classification Key word & Part Number Search Restriction of the search to sub-trees of the category

hierarchy (pushed down to the database) Real time Price and Availability Engine: Ability to crawl, interrogate, extract price & availability

data from various suppliers/distributors in real time and present them in side-by-side comparison format

Easily add new distributors/suppliers Completely client side implementation to prevent blocking

Features (3) Design Worksheet (Generalized shopping

cart): List of items: an item could be a specific part,

alternative set of parts, specifications, other design worksheets (nesting)

Completely integrated with all relevant components (search, content, price, etc.)

Aggregate constraints (e.g. total price) Multiple projects saved on server Share projects with other (authorized) users Mining for suggestions

Features (4) Design data warehouse creation and

maintenance Crawling technology for extracting part information

from websites of various manufacturer/distributor/suppliers

Data extraction from Manufacturer Spec Sheets (pdf files)

Classification technology to build, merge, manage and populate category hierarchies.

Features (5) Soft content creation and maintenance Crawling to acquire articles/news/postings from various

web news sources, usenet newsgroups, etc. Agents to cluster, classify, extract concepts, identify

associations. Personalized news/data presentation (based on user

defined profile, channels, etc.) Complete interleaving and integration with part content. Automatic generation of summary pages for

manufacturers (e.g. related news articles, URLs , product categories).

PrototypeData for nearly 2 million parts, in 2000

categoriesFocused crawling of more than 175

manufacturers for datasheets/ application notes/manuals (160,000)

1/4 Terabyte database

Catalog Integration

R. Agrawal, R. Srikant: WWW-10

Catalog Integration Problem

Integrate products from new catalog into master catalog.

a

ICs

LogicMem.DSP

fec db

ICs

Cat 2Cat 1

yx z

New CatalogMaster Catalog

The Problem (cont.)

After integration:

ICs

LogicMem.DSP

a fec db yx z

Desired Solution

Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories

Model

Product descriptions consist of words Products live in the leaf-level categories

Basic Algorithm

Build classification model using product descriptions in master catalog.

Use classification model to predict categories for products in the new catalog.

Logic

DSPx

5%

95%

National Semiconductor Files

Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver

device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator

...

...

...

National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:

Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line

Part: LM3940 1A Low Dropout RegulatorPangea Category:

Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator

...

...

Accuracy on Pangea Data

B2B Portal for electronic components: 1200 categories, 40K training

documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.

Enhanced Algorithm: Intuition

Use affinity information in the catalog to be integrated (new catalog):

Products in same category are similar. Bias the classifier to incorporate this

information. Accuracy boost depends on quality of new

catalog: Use tuning set to determine amount of bias.

Algorithm

Extension of the Naive-Bayes classification to incorporate affinity information

Computing Pr(Ci|S)

Pr(Ci|S) =

|Ci|(#docs in S predicted to be in Ci)w /

j[1,n] |Cj|(#docs in S predicted to be in Cj)w

|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog

– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known

– Choose one weight for the entire new catalog or different weights for different sections

Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy

achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.

Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.

The tune set method attempts to select a good value for weight, but there is no guarantee of success.

Empirical Evaluation

Start with a real catalog M Remove n products from M to form the new

catalog N In the new catalog N

– Assign f*n products to the same category as M– Assign the rest to other categories as per some

distribution (but remember their true category) Accuracy: Fraction of products in N assigned

to their true categories

Improvement in Accuracy (Pangea)

1 2 5 10 25 50 100 200

Weight

65

70

75

80

85

90

95

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Reuters)

1 2 5 10 25 50 100 200

Weight

82

84

86

88

90

92

94

96

98

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Google.Outdoors)

1 5 25 100 400 1000

Weight

50

60

70

80

90

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Tune Set Size (Pangea)

0 5 10 20 35 50

Tune Set Size

70

75

80

85

90

95A

ccu

racy

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Similar results for Reuters and Google.

Empirical Results

71-22-6 79-21 100

Purity (No. of classes & their distribution)

0

5

10

15

20

% E

rro

rs Standard

Enhanced

Summary

Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized.

How to apply these ideas to other types of classifiers?

Searching Specification Searching Specification DocumentsDocuments

R. Agrawal, R. Srikant. WWW-2002

Specfication DocumentsSpecfication Documents

Consist of <attribute name, value> pairs, embedded in text

Examples:– Data sheets for electronic parts– Classifieds

Sources of ProblemsSources of Problems Synonyms for attribute names and units.

– "lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing.

– No "Speed", just "MHz Pentium III" – No "Memory", just "MB SDRAM"

Accurate data extraction is hard, e.g. partial datasheet for Cypress CY7C225A PROM:

High Speed18 ns address set-up12 ns clock to outputLow Power495 mW (commercial)660 mW (military)

An end run!An end run! Use a simple regular expression extractor to get

numbers Do simple data extraction to get hints, e.g.

– Hint for unit: the word following the number.– Hint for attribute name: k words following the number.

Use only numbers in the queries– Treat any attribute name in the query also as hint

256 MB SDRAM memory

Unit Hint:MB

Attribute Hint:SDRAM, Memory

Documents and QueriesDocuments and Queries

Document D = {<ni , Hi > | ni N, Hi A, 1 i m}

Query Q = {<qi , Ai > | ni N, Ai A, 1 i k}Hi and Ai are hints

D = {<256, {MB, SDRAM, Memory}>, <700, {<MHz, CPU>}}

Q = <200 MB, 750 MHz>

Document D = {ni | ni N, 1 i m}

Query Q = {qi | ni N, 1 i k}No hints with Documents and Queries

D = {256, 700} Q = {200, 750}

Can it work?Can it work?

Yes!!!!!– Provided data is non-reflective– Reflectivity can be computed a priori for a

given data set and provides estimate of expected accuracy.

Demo

ReflectivityReflectivity

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

Non-reflective

Low Reflectivity

0 10 20 30 40 500

10

20

30

40

50

Low Reflectivity High Reflectivity

<x=i, y=j> => <x=j, y=i>

Non-reflectivity in real lifeNon-reflectivity in real life

Non-overlapping attributes: – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb

Correlations:– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still

fine.

Clusters

Basic IdeaBasic Idea

Consider co-ordinates of a point If there is no data point at the permutations of its

co-ordinates, this point is non-reflective– Only a few data points at the permutations of its co-

ordinates => point is largely non-reflective

Compute reflectivity as this ratio summed over all the points– Consider neighborhood of a point in the above

calculation

ReflectivityReflectivity

D: set of m-dimensional points n: coordinates of point x

(n): number of points within distance r of n Reflections(x): permutations of n (n):

number of points in D that have at least one reflection within distance r of n

Reflectivity(m,r) = 1 - 1/|D| xD (n)/(n) Non-reflectivity = 1- Reflectivity See the paper for reflectivity of D over k-dimensional

subspaces

AlgorithmsAlgorithms

How to compute match score (rank) of a document for a given query?

How to limit the number of documents for which the match score is computed?

Match Score of a DocumentMatch Score of a Document

Select k numbers from D yielding minimum distance between Q and D:– F(Q,D) = ( i

k=1 w(qi ,nji)p )1/p

Map problem to Bipartite Matching in graph G:– k source nodes: corresponding to query numbers– m target nodes: corresponding to document numbers– An edge from each source to k nearest targets.

Assign weight w(qi ,nj)p to the edge (qi,nj).

Bipartite Matching Bipartite Matching The optimum solution to the minimum weight bipartite

graph matching problem matches each number in Q with a distinct number in D such that the distance score F(Q,D) is minimized.

The minimum score gives the rank of the document D for the Query Q.

Assuming F to be L1 and w(qi,nj) := |qi - nj| / |qi +|:

10 7525

20 60

.5 .25.58.25

Doc:

Query:

Limiting the Set of DocumentsLimiting the Set of Documents

Similar to the score aggregation problem [Fagin, PODS 96]

Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]

Limiting the Set of DocumentsLimiting the Set of Documents

Make k conceptual sorted lists, one for each query term [use: documents = index(number)]

Do a round robin access to the lists. For each document found, compute its distance F(D,Q)

Let ni := number last looked at for query term qi Let := (i

k=1 w(qi, ni)p)1/p

Halt when t documents found whose distance is lower bound on distance of unseen documents

20

D4 D6 D8

D2 D3

25/.25 D9D1 D5 D7

60

D6 D8 D9

D5D1 D3 D4

D2 D7

10/.5

35/.75 25/.58

75/.25

66/.1

Evaluation MetricEvaluation Metric

Benchmark: Set of answers when attribute names are precisely known in the document and query

What fraction of the top 10 "true" answers are present in the top 10 answers when attribute names are unknown in both document and query?

Accuracy ResultsAccuracy Results

1 2 3 4 5

Query Size

0

20

40

60

80

100

Acc

urac

y

DRAMLCDProc

TransAutoCredit

GlassHousingWine

Records Fields

DRAM 3,800 10

LCD 1,700 12

Proc 1,100 12

Trans 22,273 24

Auto 205 16

Credit 666 6

Glass 214 10

Housing 506 14

Wine 179 14

Reflectivity estimates accuracyReflectivity estimates accuracy

Non-reflectivity closely tracked accuracy for all nine data sets

Non-reflectivity arises due to clustering and correlations in real data (Randomized non-reflectivity: value obtained after permuting values in the columns)

1 2 3 4 5 7

Query Size

0

20

40

60

80

100

AccuracyNon-Reflectivity

Randomized Non-Reflectivity

Incorporating HintsIncorporating Hints

L1 = w(qi,ni) + B v(Ai,Hi)

– v(Ai,Hi) : distance between attribute name (or unit) for qi and set of hints for ni

– B: relative importance of number match vs. name/unit match

Balance between Number Balance between Number Match & Hint MatchMatch & Hint Match

Weight to hints should depend on the accuracy of hints

Use tune set to determine B on per dataset basis

0.010.03

0.10.3

13

10

Weight of Hint Match

0

20

40

60

80

100

Acc

urac

y

0.990.9

0.8 0.6 No HInts

Wine

Effectiveness of HintsEffectiveness of Hints

Improvement in accuracy depends on how good are hints

1 2 3 4 6 8 10

Query Size

0

20

40

60

80

100

Acc

urac

y

0.990.9

0.8 0.6 No Hints

Wine

Effectiveness of IndexingEffectiveness of Indexing

1 million docs:– 1 sec for qsize = 5– .03 sec for qsize=1

1 2 3 4 5 7

Query Size

0.01

0.1

1

10

100

Tim

e (m

s)

Scan Indexed

DRAM

SummarySummary

Allows querying using only numbers or numbers + hints.

End run around data extraction.Use simple extractor to generate hints.Can ascertain apriori when the technique

will be effective

Future WorkFuture Work

Integration with classic IR (key word search)– PROM speed 20 ns power 500 mW

Extension to non-numeric values

Parametric Search of E-Commerce Data

J. Shafer, R. Agrawal : WWW-9

Conventional Wisdom

• User enters search criteria into HTML form

• User’s query is received by web-server and submitted to a server-side database (usually as SQL)

• Result set is returned to user as an HTML page (or a series of pages)

Search Problem in E-Commerce

• Users often don’t know exactly what they are looking for– Search is a process, not a single operation– An individual query is simply a means to an

end

• Too many/few results: try different query

• No universal ranking function

Essential Ideas of Eureka

• Incrementally fetch superset of tuples of interest in the client (e.g. all products in the product category of interest)

• User-interface and fast indexing structures for interactive exploration of the cached tuples

•Restriction by example

•Ranking

•Fuzzy restrictions

Architecture Overview

DataGroup

DataColumn#1

Eureka

DataPump

ListRenderer

DataColumn#N

ServletDatabase

client

server

. . .HTTP

JDBC

Comments

• Used Eureka in several situations with very positive feedback

• Used Eureka on datasets with 100K records with no visible deterioration in performance

• Performance is excellent, even in Java

Storage and Retrieval ofStorage and Retrieval ofE-Commerce DataE-Commerce Data

R. Agrawal, A. Somani, Y. Xu: VLDB-2001

Typical E-Commerce Data Typical E-Commerce Data CharacteristicsCharacteristics

Nearly 2 Million components More than 2000 leaf-level

categories Large number of Attributes (5000)

An Experimental E-marketplace for An Experimental E-marketplace for Computer componentsComputer components

Constantly evolving schema Sparsely populated data (about 50-100 attributes/component)

Conventional horizontal representation Conventional horizontal representation (n-ary relation)(n-ary relation)

Name Monitor Height Recharge Output playback Smooth scan Progressive Scan

PAN DVD-L75 7 inch - Built-in Digital - - -

KLH DVD221 - 3.75 - S-Video - - No

SONY S-7000 - - - - - - -

SONY S-560D - - - - Cinema Sound Yes -

… … … … … … … …

DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns)

Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index Hard to load/update Schema evolution is expensive

Querying is straightforward

Binary RepresentationBinary Representation(N 2-ary relations)(N 2-ary relations)

Dense representation Manageability is hard

because of large number of tables

Schema evolution expensive

Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87]

Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99]

Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99]

Val

7 inch

Name

PAN DVD-L75

Monitor

ValName

KLH DVD221

Height

3.75

ValName

PAN DVD-L75

Output

Digital

S-VideoKLH DVD221

Vertical representationVertical representation(One 3-ary relation)(One 3-ary relation)

Oid (object identifier) Key (attribute name) Val (attribute value)

Objects can have large number of attributes

Handles sparseness well Schema evolution is easy

Oid Key Val

0 ‘Name’ ‘PAN DVD-L75’

0 ‘Monitor’ ‘7 inch’

0 ‘Recharge’ ‘Built-in’

0 ‘Output’ ‘Digital’

1 ‘Name’ ‘KLH DVD221’

1 ‘Height’ ‘3.75’

1 ‘Output’ ‘S-Video’

1 ‘Progressive Scan’

‘No’

2 ‘Name’ ‘SONY S-7000’

… … …

Implementation of SchemaSQL [LSS 99] Edge Approach for storing XML Data [FK

99]

Querying over Vertical Querying over Vertical RepresentationRepresentation

Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’

Becomes quite complex:

SELECT v1.Val

FROM vtable v1, vtable v2 WHERE v1.Key = ‘Monitor’ AND v2.Key = ‘Output’ AND v2.Val = ‘Digital’ AND v1.Oid = v2.Oid

Writing applications becomes much harder. What can we do ?

Solution : Query MappingSolution : Query MappingTranslation layer maps relational algebraic

operations on H to operations on V

…Attrk…Attr2Attr1

Query Mapping Layer

ValKeyOid

Horizontal view (H)

Vertical table (V)

Can we define a translation algebra ?Standard database view mechanism does not

work attribute values become column names (need higher

order views)

Can the vertical representation support fast querying ?

What are the new requirements from the database engines?

KeyKey IssuesIssues

Transformation AlgebraTransformation Algebra

Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation.

Two key operators:– v2h ( Operation – Convert from vertical to horizontal

k(V) = [Oid(V)] [i=1,k Oid,Val(Key=‘Ai’(V))]

– h2V (Operation – Convert from horizontal to vertical

k(H) = i=1,k Oid,’Ai’Ai(Ai ‘’(V))]

i=1,k Oid,’Ai’Ai(i=1,kAi=‘’(V))

– Similar to Unfold/Fold [LSS 99] ,Gather/Scatter [STA 98]

Implementation StrategiesImplementation Strategies

VerticalSQL – Uses only SQL-92 level capabilities

VerticalUDF – Exploits User Defined Functions and Table Functions

– Binary– 2-ary representation with one relation per attribute

(using only SQL-92 transforms) SchemaSQL

– Addresses a more general problem– Performed 2-3X worse than Vertical representation

because of temporary tables

Performance: Experimental setupPerformance: Experimental setup

600 MHz dual-processor Intel Pentium machine 512 MB RAM Windows NT 4.0 Database IBM DB2 UDB 7.1 Two 30GB IDE Drives Buffer Pool Size – 50 MB All numbers reported are cold numbers Synthetic data

– 200X100K & =10% 200 columns, 100K rows and non-null density = 10%

Clustering by Clustering by KeyKey outperformed outperformed clustering by clustering by OidOid

density = 10%, 1000 cols x 20K rows

0

5

10

15

20

25

0.1% 1% 5%

Join selectivity

Ex

ec

uti

on

tim

e (

se

co

nd

s)

VerticalSQL_oid

VerticalSQL_key

Join

Projection of 10 columns

VerticalSQL comparable to Binary VerticalSQL comparable to Binary and outperforms Horizontaland outperforms Horizontal

0

10

20

30

40

50

60

200x100K 400x50K 800x25K 1000x20K

Table (#cols x #rows)

Ex

ec

uti

on

tim

e (

se

co

nd

s)

density = 10%

HorizontalSQL

VerticalSQL

Binary

VerticalUDF is the best approach VerticalUDF is the best approach

0

10

20

30

200x100K 400x50K 800x25K 1000x20K

Table (#cols x #rows)

Ex

ec

uti

on

tim

e (

se

co

nd

s)

density = 10%

VerticalUDF

VerticalSQL

Binary

Projection of 10 columns

Wish List from Database EnginesWish List from Database Engines

– VerticalUDF approach showed need for

– Enhanced table functions

– First class treatment of table functions

– Native support for v2h and h2v operations

– Partial indices

SummarySummary

+-

+-Flexibility

++Manageability

Vertical (w/ Mapping)Horizontal

-

-

Binary (w/ Mapping)

+Performance

Querying + + +

OpportunitiesOpportunities

Semiautomatic Catalog Creation

Supplier Web Sites

Semiautomatic Catalog

Extraction

WCS Catalog

Supplier 1 Supplier 2 Supplier n

Proxy

Fetch Rules Extract Rules

RuleRepository

Web BrowserTarget Pages

Logs

Supplier[i]

Data Mining

Technology for Quickly Assemble catalog from semi-structured data

...

Architecture

Data StoreWeb

Crawler

OtherSources

OCR ExtractorCatego-

rizer

Indexer(IM forText)

PDFto

Text

Improving the Catalog Hierarchy

Identify sets of nodes (or single "kitchen sink" node) for re-organization.

Cluster the set of nodes, and compare the resulting hierarchy with the original.

Identify nodes that fit better elsewhere in the hierarchy.

Metrics: classification accuracy, tightness of cluster.

Integration of Real Time Operational Data

Client

Server

Remote Site 1

Remote Site 2

Remote Site n

Fetch target page(s)4,Query, Extract Response5

in real time

Req ResDynamic

12

3

4

4

6

4

5

Privacy Preserving Data Mining

Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level.

Add random noise to individual values to protect privacy. Can dramatically change distribution

of values. EM algorithm to estimate original

distribution of values given randomized values + randomization function. Estimate only accurate over thousands of

values => preserves privacy. Algorithms for building classification

models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.

50 | 40K | ... 30 | 70K | ...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data Mining Algorithms

Data Mining Model

65 | 20K | ... 25 | 60K | ...

Alice’s age

Alice’s salary

Bob’s age

30 becomes

65 (30+35)

Seems to work well!

20

60

Age

0

200

400

600

800

1000

1200

Nu

mb

er

of

Peop

le

Original

Randomized

Reconstructed

Accuracy vs. Privacy

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

ura

cy

Original

Randomized

Reconstructed

Fn 3

Summary

E-Commerce is a rich application domain for database technology

Research opportunities abound

database technologies for e-commerce rakesh agrawal ibm almaden research center

Documents

blocking slide

product selection slide

terabyte database slide

pangea architecture

preference layer slide

product categories

catalog integration

product counts