fast and intelligent search in very large amounts of data hannah bast max-planck-institute for...
TRANSCRIPT
![Page 1: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/1.jpg)
Fast and Intelligent Search
In Very Large Amounts of Data
Hannah BastMax-Planck-Institute for Informatics
Saarbrücken
Kick-off meeting for Cluster of ExcellenceMultimodal Computing and Interaction
November 13th, 2008
![Page 2: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/2.jpg)
General theme of my group
Searching for information
Fancy and Fast, On Lots of Data
Terabytes of data, hundreds of millions of documents
Query times in a fraction of a second
Beyond Google-style keyword search
+ always open for other real-world algorithmic problemscurrently: route planning in large transportation networks
![Page 3: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/3.jpg)
Searching for Information
Problems we have recently worked on– efficient prefix search
– efficient faceted search
– efficient error-tolerant search
– efficient semantic search
– efficient snippet generation
– efficient index construction
– efficient 3D shape retrieval
Our system: the CompleteSearch engine– efficient
– does all of the above (not the shapes though)
There is a demo this afternoon at 2.30 pm
joint work withthe graphics people
joint work withthe database people
planned joint workwith the CL people
planned: efficientmusic retrieval
![Page 4: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/4.jpg)
Recent Output
Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …
Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …
Awards
– Jan’08: Meyer-Struckmann Award 15,000 €
– Oct’08: Alcatel-Lucent Award 20,000 €
– big press coverage (e.g, it was on the Heise newsticker)
![Page 5: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/5.jpg)
Faceted Search
Problem– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
11 22 33 44 55
year:2001 year:1997 year:2003 year:2001 year:2008
Query: I = {1, 3, 4} Answer: {year:2001, year:2003, year:2001}
![Page 6: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/6.jpg)
Faceted Search
Problem– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
a5a4a3a2a1
Query: I = {1, 3, 4} Answer: {a1, a3, a4}
Trivial if labels are in an array in main memory– but if data is on disk, we have block access to the data
– each read gives us a whole block of B labels
– we have to minimize the number of reads / IO operations
typical: B=10,000
![Page 7: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/7.jpg)
IO-efficient Faceted Search
Precomputation:
– given n elements a1,…,an
– organize in array of size N ≥ n
Query:
– given I = {i1,…, im} с {1,…,n}
– return elements ai1,…, aim
using as few IOs as possible
Extreme solutions:
– space: n #IOs: min{n / B, |I|} (optimal space)
– space: B ∙ (n choose B) #IOs: |I| / B (optimal #IOs)
How much space is needed for which IO-efficiency?
a1 a2 a3 a4 a5 a6 a7 a8
a4 a7 a5 a3 a1 a8 a2 a6
a3 a6 a4 a2 a7 a1 a8 a5n = 8, N = 24
I = {1, 6, 8}, B = 4
get a1, a6, a8 with 1 IO
a1 a8 a2 a6
???
???
![Page 8: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/8.jpg)
A simple lower bound
Theorem:– if we want < |I| IOs for every query I
– we need ≥ n2 / (4∙B) space
Proof:
1. construct graph G with n vertices
edge {i, j} iff ai and aj can be read in one
IO
m ≤ 2B ∙ N
2. by assumption, every I = {i, j} can be
read with 1 IO, hence edge {i, j} exists
m ≥ (n choose 2) ≈ n2 / 2The short queries alone make the problem hard
n = 4, N = 8
B = 2
a1 a2 a3 a4
a1 a4 a2 a3
a1 a2
a4a3
![Page 9: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/9.jpg)
Restrict to large queries
Theorem:– if we want < |I| IOs for all queries with |I| ≥ M
– we need ≥ n2 / (4∙B∙M) space
Proof sketch:
1. construct graph G as before
m ≤ 2B ∙ N
2. Consider arbitrary I with |I| ≥ M
I not independent in G (otherwise |I| IOs necessary)
no independent set larger than M
3. Turan’s theorem implies m ≥ (n choose 2) / M
n = 4, N = 8B = 2
a1 a2 a3 a4
a1 a4 a2 a3
a1 a2
a4a3
so there is hope for queries of size linear in n
and we indeed have a space-efficient algorithm for that case(but no time to explain it here, sorry)
![Page 10: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/10.jpg)
Turán numbers (extremal set theory)
Definition: for n ≥ k ≥ r
T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets
For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k
Turan’s theorem:
– lim n ∞ T(n, k, r) / (n choose r) exists
– exact value of limit unknown for k ≥ 2
Lower bound
– T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r)Paul (Pál) Turán*1910 in Budapest†1976 in Budapest
Erdös number 1
Very natural application inthe context of faceted
search!
![Page 11: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/11.jpg)
Route Planning
Route planning in road networks
– from a single source to a single target (point-to-point)
– weighted graph, edge costs = travel times
![Page 12: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/12.jpg)
Transit Node Routing
We invented transit node routing
– 100 times faster than previous best scheme
– Oct’08 SaarLB Award 25.000 €
(together with Stefan Funke, now University of Greifswald)
– integration with previous best scheme published in Science
(joint work with P. Sanders and D. Schultes, Uni Karlsruhe)
– big press coverage
– we are currently trying to market the idea
(via Algorithmic Solutions, a spin-off from MPII D1)
There is a demo this afternoon at 2.00 pm
![Page 13: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/13.jpg)
![Page 14: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/14.jpg)
![Page 15: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/15.jpg)
Google Transit
I am currently @ Google in Zürich
– as “visiting scientist”
– great experience; I can highly recommend it
– one of my projects there is Google Transit
– public transportation networks are completely different from road networks
they can both be modeled as graphs
and that’s about it with the similarity
– the scale is an even bigger challenge there
one node per arrival / departure event
– will publish what I have done at the end of the year
Thank you!
![Page 16: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/16.jpg)
![Page 17: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/17.jpg)
![Page 18: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/18.jpg)
Vorberechnung der Transitknoten
Von Distanzen zu Pfaden
24 min20 min23 min
23
2Start Ziel
![Page 19: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/19.jpg)
Overview
How I work
Information retrieval
– overview of problems & results
– our CompleteSearch engine
– recent result: faceted search
Route planning
– ultrafast routing in road networks
– public transportation routing @ Google
![Page 20: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/20.jpg)
Recent Output
Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …
Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …
Awards
– Jan’08: Meyer-Struckmann Award 15,000 €
– Oct’08: Alcatel-Lucent Award 20,000 €
– Jul’09 : ...... 25,000 €
![Page 21: Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649ed35503460f94be2a11/html5/thumbnails/21.jpg)
How I work
I grew up in theoretical computer science
– well-defined, standard problems
– the goal are theorems
– the more difficult / original, the better
– often art for arts sake
– good to learn the art of clear & precise thinking
Then I moved to more applied problems
– work starts with a real problem
– finding the right abstraction is half of the challenge
– think about it, but keep in mind the real problem
– implement + experiment
– build a system and use it / let it be used
necessity is the mother of all inventions