flickr: computer vision at scale with hadoop and storm (huy nguyen)

46
Presented by: Huy Nguyen @huyng Computer vision at scale with Hadoop and Storm Copyright © 2014 Yahoo, Inc.

Upload: yahoo-developer-network

Post on 08-Sep-2014

8.330 views

Category:

Technology


0 download

DESCRIPTION

This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm. Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/ If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs

TRANSCRIPT

Page 1: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Presented by: Huy Nguyen @huyng

Computer vision at scale with Hadoop and Storm

Copyright © 2014 Yahoo, Inc.

Page 2: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Outline

▪ Computer vision at 10,000 feet

▪ Hadoop training system

▪ Storm continuous stream processing

▪ Hadoop batch processing

▪ Q&A

Page 3: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Page 4: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

How can we help Flickr users better organize, search, and browse

their photo collection?

photo credit: Quinn Dombrowski

Page 5: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

What is AutoTagging?

Any photo you upload to Flickr is now automatically classified using computer vision software

Class Probability

Flowers 0.98

Outdoors 0.95

Cat 0.001

Grass 0.6

ClassifiersClassifiersClassifiers

Page 6: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Demo

Page 7: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Auto Tags Feeding Search

Page 8: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Our Goals

▪ Train thousands of classifiers

▪ Classify millions of photos per day

▪ Compute auto tags on backlog of billions of photos

▪ Enable reprocessing bi-weekly and every quarter

▪ Do it all within 90 days of being acquired!

Page 9: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Our system for training classifiers

▪ High level overview of image classification

▪ Hadoop workflow

Page 10: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Classification in a nutshell

How do we define a classifier?

A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b

Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data

Classification is done by computing z, given some data point x

Page 11: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Image classification in a nutshell

To classify images we need to convert raw pixels into a “feature” to plug into x

computefeature

Page 12: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

How are image classifiers made?

flower Class Name

Negative Examples

(approx. 100K)

Positive Examples(approx. 10K)

Repeat a few thousand times for each class

ComputeFeatures Train

Classifier

w,b

Page 13: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training

Page 14: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training

HDFS

Pixel Data+

Label Data

1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>

dog 1 TRUEdog 2 FALSEdog 3 TRUEcat 1 FALSEcat 2 TRUEcat 3 FALSE..

Pixel Data Label Data

photo_id, pixel_data class, photo_id, label

Page 15: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training - JOIN

map

map

map

reduce

Pixels

Labels

1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..

▪ mappers emit (photo_id, pixels) OR (photo_id, class, label)▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))

Page 16: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training - TRAIN

1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..

map

map

map

reduce

computefeature

trainingclassifier

dogclf.

catclf.

▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image

▪ reducers, takes advantage of sorting to train. Emits (class, w, b)

Page 17: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training

▪ Workflow coordinated with OOZIE

Page 18: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Classifier Training

▪ 10,000 mappers▪ 1,000 reducers▪ A few thousand classifiers in less than 6 hours▪ Replaced OpenMPI based system

▪ Easier deployment▪ Access to more hardware▪ Much simpler code

Page 19: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Area of Improvements

▪ Reducers have all training data in memory▪ A potential bottleneck if we want to expand training set for any given

set of classifiers▪ Known path to avoiding this bottleneck

▪ Incrementally update models as new data arrives

Page 20: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Our system for tagging the flickr corpus

▪ Batch processing with Hadoop

▪ Stream processing with Storm

Page 21: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Batch processing in Hadoop

www HDFS AutotagsMapper

Search Engine Indexer

HDFS IndexingReducer

AutotagsMapperAutotagsMapper

Page 22: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop Auto tagging

HDFS

Pixel Data

Autotag Results

1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>

1 { “cat”: .98, “dog”: 0.3 }2 { “cat”: .97, “dog”: 0.2 }3 { “cat”: .2, “dog”: 0.3 }4 { “cat”: .5, “dog”: 0.9 }..n { “cat”: .90, “dog”: 0.23 }

Pixel Data Autotag Results

photo_id, pixel_data photo_id, results

Page 23: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Hadoop classification performance

▪ 10,000 Mappers▪ Processed in about 1 week entire corpus

Page 24: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Main challenge: packaging our code

▪ Heterogenous Python & C++ codebase

▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?

Page 25: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Main challenge: packaging our code

▪ tried: pyinstaller, didn’t work

▪ rolled our own solution using strace▪ /autotags_package

▪ /lib▪ /lib/python2.7/site-packages▪ /bin▪ /app

Page 26: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Stream processing (our first attempt)

www Redis

PHPWorkersPHPWorkersPHPWorkers

fork autotags a.jpg

searchdb

Page 27: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Why we started looking into Storm

Page 28: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Why we started looking into Storm

FeatureComputation Classification

300 ms 10 ms

ClassifierLoad

300 ms

Classifier loading accounted for 49% of total processing time

Page 29: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Solutions we evaluated at the time

▪ Mem-mapping model

▪ Batching up operations

▪ Daemonize our autotagging code

Page 30: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Enter Storm

Page 31: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Page 32: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Storm computational framework

▪ Spouts▪ The source of your data

stream. Spouts “emit” tuples

▪ Tuples▪ An atomic unit of work

▪ Bolts▪ Small programs that

processes your tuple stream

▪ Topology▪ A graph of bolts and

spouts

Page 33: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Defining a topology

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("images", new ImageSpout(), 10);

builder.setBolt("autotag", new AutoTagBolt(), 3)

.shuffleGrouping("images")

builder.setBolt("dbwriter", new DBWriterBolt(), 2)

.shuffleGrouping("autotag")

Page 34: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Defining a bolt

class Autotag(storm.BasicBolt):

def __init__(self):

self.classifier.load()

def process(self, tuple):

<process tuple here>

Page 35: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Our final solution

www Redis AutotagsSpout

AutotagsBolt

Search Engine Indexer

Bolt

DB WriterBolt

Topology

Page 36: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Storm architecture

Supervisors

(10 nodes)

Nimbus

(1 node)

Zookeepers

(3 nodes)

Spouts and bolts live here

Page 37: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Fault tolerant against node failures

Nimbus server dies▪ no new jobs can be submitted▪ current jobs stay running▪ simply restart

Supervisor node dies▪ All unprocessed jobs for that node get reassigned▪ Supervisor restarted

Zookeeper node dies▪ Restart node, no data loss as long as not too many die at once

Page 38: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

storm commands

storm jar topology-jar-path class

Page 39: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

storm commands

storm list

Page 40: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

storm commands

storm kill topology-name

Page 41: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

storm commands

storm deactivate topology-jar-path class

Page 42: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

storm commands

storm activate topology-name

Page 43: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Is Storm just another queuing system?

Page 44: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Yes, and more

▪ A framework for separating application design from scaling concerns

▪ Has a great deployment story▪ atomic start, kill, deactivate for free▪ no more rsync

▪ No single point of failure▪ Wide industry adoption▪ Is programming language agnostic

Page 45: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Try storm out!

https://github.com/apache/incubator-storm/tree/master/examples/storm-starter

https://github.com/nathanmarz/storm-deploy

Page 46: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Thank you

Presentation copyright © 2014 Yahoo, Inc.