the simigle image search engine

The Simigle Image Search Engine

Wei Dong

2010-09-23

http://www.simigle.com/

Challenges

• Large dataset– ~100 million images w/ single server

• High confidence– False positive rate < 10-6

• High recall– Recall ~ 80%

• Online search• High throughput

– Still a long way to go

System Overview

Loosely coupledSearch servers

Easy to replicate

Read OnlyDatabaseImages

A cluster for crawling and indexing images

Clients w/Various Browsers

JsonJpeghtml

Software techniques:

C++, boost, pocoJavascript, jquery C++, java, hadoop

http://images.google.com/imgres?imgurl=http://ui03.gamespot.com/1186/997953862f602966618_2.jpg&imgrefurl=http://www.gamespot.com/users/TwiztedMetal/show_blog_entry.php%3Ftopic_id%3Dm-100-25103641&usg=__XcpyONr6WivZJCq6puLwE2-e0U4=&h=362&w=337&sz=65&hl=en&start=0&zoom=1&tbnid=yzCMEfYdnCosFM:&tbnh=151&tbnw=139&prev=/images%3Fq%3Dserver%2Brack%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=273&vpy=110&dur=258&hovh=151&hovw=141&tx=120&ty=102&ei=FWCbTJjLM4PGlQfZtqzlCQ&oei=FWCbTJjLM4PGlQfZtqzlCQ&esq=1&page=1&ndsp=29&ved=1t:429,r:1,s:0

http://images.google.com/imgres?imgurl=http://1.bp.blogspot.com/_lWnJEx1aTgA/Sc3YDSpoLqI/AAAAAAAAASE/cfXlBj1AA4o/s400/Apple%2BMAC%2B%2Blaptop-1.jpg&imgrefurl=http://laptopblank.blogspot.com/2010/02/apple-white-cool-laptop.html&usg=__RbqP5Y_XgByemAQT0WZI_-noPRw=&h=300&w=400&sz=14&hl=en&start=115&zoom=1&tbnid=4MgC4ZigRgs9gM:&tbnh=134&tbnw=188&prev=/images%3Fq%3Dwhite%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=864&vpy=441&dur=1419&hovh=194&hovw=259&tx=80&ty=82&ei=TV-bTNylEMO78gbQ0JRY&oei=O1-bTKj3E8aAlAfdtrTSCQ&esq=5&page=5&ndsp=32&ved=1t:429,r:24,s:115

http://images.google.com/imgres?imgurl=http://www.dealgiant.co.uk/wp-content/uploads/2009/11/toshiba_g61_110sa_windows7_laptop_review.jpg&imgrefurl=http://www.dealgiant.co.uk/hp-g61-110sa-laptop-review-windows-7-laptop-deals-specs/&usg=__Y5yJz72kvdMuRRAzjJMfK-Mm1sM=&h=361&w=394&sz=51&hl=en&start=0&zoom=1&tbnid=MzVe6_o8ywVd5M:&tbnh=141&tbnw=157&prev=/images%3Fq%3Dwindows%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=284&vpy=304&dur=1765&hovh=215&hovw=235&tx=142&ty=111&ei=qmObTOvMDYTGlQfc_sTdCQ&oei=qmObTOvMDYTGlQfc_sTdCQ&esq=1&page=1&ndsp=24&ved=1t:429,r:13,s:0

http://images.google.com/imgres?imgurl=http://www.chinagadgetland.com/wp-content/uploads/wpsc/product_images/Trekker%2520Ubuntu-powered%25208.9%2520inch%2520Mini%2520Laptop,%2520Intel%2520Atom%2520N270%25201.6G%2520Processor,%2520512M%2520Memory,%252060G%2520Harddisk,%2520%2520%25208.9%2520Inch%2520WXGA%2520LCD%2520Screen-1.jpg&imgrefurl=http://www.chinagadgetland.com/products-page/ubuntu-powered-laptop-packages/&usg=__5EEsjyqQeNm8uENCqdJJkTnyVbk=&h=322&w=351&sz=14&hl=en&start=0&zoom=1&tbnid=VdA5fvq057dwrM:&tbnh=153&tbnw=177&prev=/images%3Fq%3Dubuntu%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=811&vpy=407&dur=415&hovh=215&hovw=234&tx=96&ty=106&ei=FWSbTJ_0N8SBlAe51uncCQ&oei=FWSbTJ_0N8SBlAe51uncCQ&esq=1&page=1&ndsp=20&ved=1t:429,r:13,s:0

http://images.google.com/imgres?imgurl=http://windows7.iyogi.net/wp-content/uploads/zahipedia_mozila_firefox.jpg&imgrefurl=http://windows7.iyogi.net/windows-7/insight/browsers&usg=__qTgVY_mwgzcbLLQJ3-LWiwwnJco=&h=356&w=369&sz=31&hl=en&start=0&zoom=1&tbnid=lsAAWNEhsoj0MM:&tbnh=142&tbnw=141&prev=/images%3Fq%3Dfirefox%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=317&vpy=104&dur=7689&hovh=221&hovw=229&tx=110&ty=96&ei=3mObTNjxB4eglAe716zLCQ&oei=3mObTNjxB4eglAe716zLCQ&esq=1&page=1&ndsp=31&ved=1t:429,r:1,s:0

http://images.google.com/imgres?imgurl=http://www.teknobites.com/wp-content/uploads/2010/02/safari512px.png&imgrefurl=http://www.teknobites.com/2010/02/25/5-must-have-plugins-for-safari/&usg=__jI9wbTD-tFxPUqkYyRSY3k1ZjFQ=&h=512&w=512&sz=215&hl=en&start=0&zoom=1&tbnid=uVaEQlK1kjL8kM:&tbnh=137&tbnw=137&prev=/images%3Fq%3Dsafari%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=124&vpy=99&dur=421&hovh=225&hovw=225&tx=133&ty=115&ei=OGSbTKm_DoWclgfpiK3bCQ&oei=OGSbTKm_DoWclgfpiK3bCQ&esq=1&page=1&ndsp=30&ved=1t:429,r:0,s:0

http://images.google.com/imgres?imgurl=http://topnews.net.nz/images/Chrome_Logo.png&imgrefurl=http://topnews.net.nz/content/25834-chrome-soon-block-older-plug-ins&usg=__HfHmqI0m873V8GVuqrTVxZMxeAs=&h=256&w=256&sz=77&hl=en&start=0&zoom=1&tbnid=HH75ZWu7en_13M:&tbnh=134&tbnw=128&prev=/images%3Fq%3Dchrome%2Blogo%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=746&vpy=403&dur=522&hovh=204&hovw=204&tx=101&ty=78&ei=K2abTIuTJIG0lQex59TNCQ&oei=K2abTIuTJIG0lQex59TNCQ&esq=1&page=1&ndsp=35&ved=1t:429,r:18,s:0

http://images.google.com/imgres?imgurl=http://yutubemedia.com/wp-content/uploads/2010/08/ie-logo.png&imgrefurl=http://yutubemedia.com/myths-about-internet-explorer-is-it-better/&usg=__SOO8WNBH38gB5ChIaD_NWVP2GLc=&h=300&w=300&sz=153&hl=en&start=0&zoom=1&tbnid=7L95hifOJ7SUQM:&tbnh=162&tbnw=162&prev=/images%3Fq%3DIE%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=664&vpy=76&dur=1440&hovh=225&hovw=225&tx=132&ty=129&ei=8WObTN6GMYaglAe9wuXZCQ&oei=8WObTN6GMYaglAe9wuXZCQ&esq=1&page=1&ndsp=24&ved=1t:429,r:3,s:0

Search Server Architecture

query

SessionCache

(by UUID)

RetrievalCache

(by SHA1)Feature Extraction

Feature Search

Query Expansion

Search Processmiss

ThumbnailDatabase

FeatureIndex

FeatureIndex

FeatureIndex

FeatureIndex

Main Techniques

• Entropy-filtered local image features– High confidence

• Graph-based query expansion– High recall

• Compact sketch representation– Smaller database, faster search

• Flexible bit-vector indexing– Online search

• Content-aware disk layout– High throughput thumbnail retrieval

Entropy-Filtered Local Feature

• Feature detection w/ Difference-of- Gaussian

• Entropy-based filtering for high confidence

• DoG detects more regions than needed. • Some plain regions can cause false positives (like A, D). • We only keep regions with high entropy (rich content, like B, C)• 10x reduction of error rate• Less features have to be indexed

[ Unpublished ]

Graph-Base Query Expansion

• We can find more results if we use the initial results to search again

• Keep searching until we find no more

• Problem: hit a lot of false positives

• We use graph-partitioning method[1] to smartly cut-off expansion.

• Recall from 43% to ~80% w/ same false positive rate[2].

[1] Andersen, et al. Local graph partitioning using PageRank vectors. FOCS’ 06.[2] Unpublished.

Compact Sketch Representation

• Raw features are large, 5~10KB/image– About 80 features / image– 128 bytes / feature (SIFT)

or 64 bytes / feature (SURF) with lower quality– Encodes all information about a region

• We only need to tell if two features are extremely similar

• 128-bit sketch with random space partitioning techniques

Dong, et al. Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. SIGIR ’08.

Flexible Bit-Vector Indexing

• Search for sketches w/ <=3 bits different.

• Divide 128-bit into 4 blocks, so at least one block is identical.

• State-of-art[1] is equal partitioning.

• We find optimal partitioning with dynamic programming[2]

– Faster– More flexible

[1] Manku, et al. Detecting near-duplicates for web crawling. WWW'07.[2] Unpublished

Content-Aware Disk Layout

• Query results range from a few to 1000s

• 20~100 thumbnails / page

• If thumbnails are randomly stored on disk, throughput will be limited by disk seeks

• We store similar images together on disk and load a bunch with one disk seek

• Results on a single query can be covered with a few disk seeks.

[ Unpublished ]

Conclusion

• We present a system for similar web image retrieval– High capacity (~100 million images / server)– High confidence (10-6 error rate)– High recall (~80% recall)– Online search (searches return in seconds)

• Future work: further improve responsiveness and throughput.

the simigle image search engine

Documents

disk seekresults

local graph partitioning

single query

high recallrecall

similarity search

images serverhigh confidence

high confidence dog

high entropy rich content