data frame augmentation of free form queries for constraint based document filtering andrew...
Post on 22-Dec-2015
223 views
TRANSCRIPT
![Page 1: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/1.jpg)
Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering
Andrew Zitzelberger
![Page 2: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/2.jpg)
Problem
![Page 3: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/3.jpg)
Constraint Based Queries
![Page 4: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/4.jpg)
Queries
Test Queries 1) Find me a Wii game. 2) Find me a Honda for under 15 thousand dollars. 3) Roller Coaster more than 150 feet high 4) mountains at least 15K feet 5) games under $25 6) mountains less than 4 km 7) ps games < $40 8) coasters longer than 1000 feet 9) car for under 5 grand newer than 1990 with less than 115K miles 10) more than 15K miles under 5 grand newer than 2004
![Page 5: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/5.jpg)
Keywords + Semantics
• Semantic queries are computationally expensive
• Keyword queries are fast and simpleo People are used to keyword queries
• Synergistic solution:o extract numerical constraints from the queryo use keywords to quickly narrow the search spaceo use constraints as a filter
![Page 6: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/6.jpg)
Data Frames
Price internal representation: Double external representation: \$[1-9]\d{0,2}(,\d{3})*|... ... right units: (K)?\s*(cents|dollars|[Gg]rand|...) canonicalization method: toUSDollars comparison methods: LessThan(p1: Price, p2: Price) returns (Boolean) external representation: (less than|<|under|...)\s*{p2}|... ... end
![Page 7: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/7.jpg)
Data Frame Library
![Page 8: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/8.jpg)
Free Form Query
• Car under 6 grand newer than 1990 with less than 115K miles
![Page 9: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/9.jpg)
Step 1: Condition Extraction
• Car under 6 grand newer than 1990 with less than 115K miles
• Extracted Conditionso (Price < 6000)o (Year > 1990)o (Distance < 115000)
![Page 10: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/10.jpg)
Step 2: Remove Condition Values
• Car under newer than with less than
![Page 11: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/11.jpg)
Step 3: Remove Stopwords
• Car
![Page 12: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/12.jpg)
Step 4: Perform Keyword Search
![Page 13: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/13.jpg)
Step 5: Filter Document on Constraints
• Keep page if every constraint is satisfied by at least one extracted value
![Page 14: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/14.jpg)
Experimental Setup
• 300 web documentso 100 car+trucks pages from http://provo.craigslist.orgo 100 video gaming pages from http://provo.craigslist.orgo 50 mountain pages from http://en.wikipedia.orgo 50 roller coaster pages from http://en.wikipedia.org
• 10 querieso 8 with usable conditions
• 2 data setso test-developmento blind test
![Page 15: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/15.jpg)
Results Summary
• Precision increase for 56% of queries o 75% for test-dev, 50% for blind-test
• Precision never worse than keyword query• Most effective for short, focused documents
Precision@3/Query Type Keyword Queries Reduced Queries Data Frame Augmented Queries
Dev-Test Queries 33% 40% 60%
Blind-Test Queries 50% 46% 63%
Overall 42% 43% 62%
![Page 16: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/16.jpg)
Discussion
• Issues:1.inadequate narrowing or ranking of search space2.noise caused by other numbers
Distance < 115000
![Page 17: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/17.jpg)
Future Work
• Scalabilityo Indexing data frame extracted terms
• Precision vs Recall trade-offs
• Pay-as-you-go search construction
![Page 18: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/18.jpg)
Related Work
• Question-Answering Systems
• Keyword search over databases and semantic stores
![Page 19: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/19.jpg)
Questions?
![Page 20: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/20.jpg)
Results (Test-Dev Set)
Query Keyword Condition Removed Keyword
Data Frame Augmentation
Find me a Wii game. 0.33 0.33 0.33
Find me a Honda for under 15 thousand dollars. 0.67 1.00 1.00
roller coaster more than 150ft high 0.33 0.33 0.67
mountains at least 15K ft 1.00 0.67 1.00
games under $25 0.00 0.33 0.67
mountains less than 4 km 0.00 0.00 0.33
ps games < 40 bucks 0.33 0.00 0.33
coasters longer than 1000 feet 0.33 1.00 1.00
car for under 6 grand newer than 1990 with less than 115K miles
0.33 0.33 0.67
more than 15K miles under 10 grand newer than 2000
0.00 0.00 0.00
![Page 21: Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger](https://reader030.vdocuments.us/reader030/viewer/2022012913/56649d7d5503460f94a6053d/html5/thumbnails/21.jpg)
Results (Blind Test Set)
Query Keyword Condition Removed Keyword
Data Frame Augmentation
Find me a Wii game. 0.67 0.67 0.67
Find me a Honda for under 15 thousand dollars. 0.67 1.00 1.00
roller coaster more than 150ft high 0.67 0.67 0.67
mountains at least 5K ft 0.33 0.33 0.67
games under $25 0.67 0.67 1.00
mountains less than 4 km 0.00 0.00 0.00
ps games < 40 bucks 0.33 0.33 0.33
coasters longer than 1000 feet 0.67 0.67 0.67
car for under 6 grand newer than 1990 with less than 115K miles
0.67 0.00 1.00
more than 15K miles under 10 grand newer than 2000
0.33 0.33 0.33