hipi: computer vision at large scale
TRANSCRIPT
Chris Sweeny
Liu Liu
Intro to MapReduce SIMD at Scale
Mapper / Reducer
MapReduce, Main Takeaway Data Centric, Data Centric, Data Centric!
Hadoop, a Java Impl An Implementation of MapReduce originated from
Yahoo!
The Cluster we worked at has 625.5 nodes, with map task capacity of 2502 and reduce task capacity of 834
Computer Vision at Scale The “computational vision”
The sheer size of dataset:
PCA of Natural Images (1992): 15 images, 4096 patches
High-perf Face Detection (2007): 75,000 samples
IM2GPS (2008): 6,472,304 images
HIPI Workflow
HIPI Image Bundle Setup Moral of the story:
Many small files are killing the performance in distributed file system.
Redo PCA in Natural Images at Scale The first 15 principal components with 15 images
(Hancock, 1992):
Redo PCA in Natural Images at Scale Comparison:
Hancock, 1992
HIPI, 100
HIPI, 1,000
HIPI, 10,000
HIPI, 100,000
Optimize HIPI Performance Culling: because decompression is costly
Decompress at need
A boolean cull(ImageHeader header) method for conditional decompression
Culling, to inspect specific camera effects Canon Powershot S500, at 2592x1944
HIPI, Glance at Performance figures An empty job (only decompressing and looping over
images), 5 run, using minimal figure, in seconds, lower is better:
050
100150200250300350400450
10 100 1000 10000 100000
Many Small Files
Hadoop Sequence File
HIPI Image Bundle
HIPI, Glance at Performance figures Im2gray job (converting images to gray scale), 5
run, using minimal figure, in seconds, lower is better:
0
100
200
300
400
500
10 100 1000 10000 100000
Many Small Files
Hadoop Sequence File
HIPI Image Bundle
HIPI, Glance at Performance figures Covariance job (compute covariance matrix of
patches, 100 patches per image), 1~3 run*, using minimal figure, in seconds, lower is better:
0
1000
2000
3000
4000
5000
6000
7000
8000
10 100 1000 10000 100000
Many Small Files
Hadoop Sequence File
HIPI Image Bundle
HIPI, Glance at Performance figures Culling job (decompressing all images V.S.
decompressing images we care about), 1~3 run, using minimal figure, in seconds, lower is better:
0
100
200
300
400
500
600
700
10 100 1000 10000 100000
Without Culling
With Culling
Conclusion Everything at large scale gets better.
HIPI provides an image-centric interface that performs on par or better than the leading alternative
Cull method provides significant improvement and convenience
HIPI offers noticeable improvements!
Future work Release HIPI as Opensource Project.
Work on deep integration with Hadoop.
Making HIPI work-load more configurable.
Making work-load more balanced.