introduction to digital signal processing in hadoop by adam laiacano

48

Click here to load reader

Upload: hakka-labs

Post on 10-May-2015

2.333 views

Category:

Technology


4 download

DESCRIPTION

In this talk, Adam Laiacano from Tumblr gives an "Introduction to Digital Signal Processing in Hadoop". Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs. Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding). I'll hopefully have some examples of how to detect trendy meme-ish blogs on Tumblr. Bio: Adam Laiacano is a Data Scientist and Engineer at Tumblr, a blogging network with over 140 million blogs, where he's responsible for collecting and analyzing large volumes of data to gain a better understanding of trends and activity within the Tumblr community. He holds a Bachelor of Science degree in Electrical Engineering from Northeastern University, and designed signal detection systems for low-power atomic clocks before joining Tumblr.

TRANSCRIPT

Page 1: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

digital signal processing

in hadoop

with scalding

Adam [email protected]

@adamlaiacanoThursday, October 17, 13

Page 2: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Overview

• Intro to digital signals and filters

• sampling

• frequency domain

• FIR / IIR filters

• Very quick intro to Scalding

• Filtering tons of signals at once

• Application: Finding trending blogs on tumblr

Thursday, October 17, 13

Page 3: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

1 sample / day

Thursday, October 17, 13

Page 4: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

7-day average 1 sample / day

Thursday, October 17, 13

Page 5: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Some DefinitionsSignal - Any series of data (Volts, posts, etc) that is measured at regular intervals.

Sampling period, Ts - Time between samples (my example was Ts = 1 day)

Sampling frequency fs - 1 / Ts

Nyquist frequency - Highest frequency that can be represented = fs/2

Filter - A system to reduce or enhance certain aspects (phase, magnitude) of a signal.

Stopband - The frequency range we want to eliminate

Passband - The frequency range we want to preserve

Cutoff frequency, fc - The boundary of the stopband/passband

Thursday, October 17, 13

Page 6: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

SignalsThursday, October 17, 13

Page 7: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Sampling

Orignal Analog 10 samples/period

Thursday, October 17, 13

Page 8: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Sampling

1 sample/period 2 samples/period

Thursday, October 17, 13

Page 9: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

FiltersThursday, October 17, 13

Page 10: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

FILTER

Thursday, October 17, 13

Page 11: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Passband Stopband

fc fn

Low-Pass Filter

Thursday, October 17, 13

Page 12: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Passband Stopband

fc fn

Low-Pass Filter

Closer to reality

Thursday, October 17, 13

Page 13: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Moving Average Filter

y[t] = 1/7 * x[t] + 1/7 * x[t-1] + 1/7 * x[t-2] + ... 1/7 * x[t-6]

Thursday, October 17, 13

Page 14: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

FIR Digital Filter

y[t] = h[0] * x[t] + h[1] * x[t-1] + h[2] * x[t-2] + ... h[N-1] * x[t-N-1]

y <- filter(x, h)

R code:

Thursday, October 17, 13

Page 15: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Frequency Domain

x = 1.0 * sin(0.5*2*pi*t) + 0.5 * sin(250*2*pi*t) + 0.1 * sin(400*2*pi*t)

Thursday, October 17, 13

Page 16: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Frequency Domain

Thursday, October 17, 13

Page 17: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Frequency Domain

h = [-0.0201, -0.0584, -0.0612, -0.0109, 0.0513, 0.0332, -0.0566, -0.0857, 0.0634, 0.3109, 0.4344, 0.3109, 0.0634, -0.0857, -0.0566, 0.0332, 0.0513, -0.0109, -0.0612, -0.0584, -0.0201]

21-point low-pass filter with 250Hz cutoff

http://t-filter.appspot.com/fir/index.htmlThursday, October 17, 13

Page 18: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

FIR vs IIR

y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1]

y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1] -

g[1] * y[t-1] - ... g[M] * y[t-M]

1

Thursday, October 17, 13

Page 19: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Delta Function

Thursday, October 17, 13

Page 20: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

FIR vs IIR

y[t] = 1/7 * x[t] + 1/7 * x[t-1] + ... 1/7 * x[t-6]

y[t] = 1/2 * x[t] + 1/2 * y[t-1]

Thursday, October 17, 13

Page 21: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

IIR - be careful!

y[t] = 0.5 * x[t] + 1.1 * y[t-1]

Thursday, October 17, 13

Page 22: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Impulse Response

Thursday, October 17, 13

Page 23: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Recap - FIR Filters

• FIR filters are weighted sums of previous input.

• Can think of them as a generalized Moving Average

• Required to apply:

• Filter h of length N

• Previous N inputs x

Thursday, October 17, 13

Page 24: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Thursday, October 17, 13

Page 25: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Super General Overview

• DSL on top of Cascading, written in scala

• Cascading: Workflow language for dealing with lots of data. Often in hadoop.

• Similar to pig or hive, but easier to extend (no UDFs! one language!).

• Feels like “real programming” - compiler! types!

• Is awesome

Thursday, October 17, 13

Page 26: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Less General Overview

• Similar to split/apply/combine paradigm (plyr, pandas)

• Load data into Pipes (like data.frames)

• Each pipe has one or more Fields (columns)

• Perform row-wise operations with map (d$a+d$b)

• Perform field-wise operations in groupBy

Thursday, October 17, 13

Page 27: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Hello World

Thursday, October 17, 13

Page 28: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Scalding Resources

• The best resource is the Scalding wiki page.https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

• Edwin Chen’s post about recommendations.http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

• Source code is FULL of undocumented features!https://github.com/twitter/scalding

Thursday, October 17, 13

Page 29: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

import Matrix._

Thursday, October 17, 13

Page 30: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

DataVector

SlidingFilter

= Sliding subset of input

Thursday, October 17, 13

Page 31: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

000000000000000000000000

0

DataVector

0000000000000000000000000

0000000000000

00000000000

0000000000000000000000000

... ...

SlidingFilter

= Sliding filter

0

00000000000000000000000

Thursday, October 17, 13

Page 32: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

00 00 0 00 0 0 00 0 0 0 00 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0 0 0 0 0 0 00 0 0 0 0 0

0 0 0 0 00 0 0 0

0 0 00 0

0

T

DataVector

FilterMatrix

* =

FilteredOutput

T

Thursday, October 17, 13

Page 33: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

00 00 0 00 0 0 00 0 0 0 00 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0 0 0 0 0 0 00 0 0 0 0 0

0 0 0 0 00 0 0 0

0 0 00 0

0

T

DataVector

FilterMatrix

* =

FilteredOutput

T

Thursday, October 17, 13

Page 34: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

N = number of blogsM = number of samples

N MxM N

* =

T T

M M

X * H = YT T

Thursday, October 17, 13

Page 35: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 36: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 37: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 38: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

• Scalding has a Matrix library!

• Stores data in a Pipe as ('row, 'col, 'val)

• Ideal for sparse matricies

• L0, L1, L2 norm, inverse, +, -, *

• QR Factorization: http://bit.ly/1hxWF17

• More!

import Matrix._

Thursday, October 17, 13

Page 39: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Tumblr Social Graph

• 140+ Million Nodes

• 3.5 Billion Edges

• About 100GB of raw text data

• 3 columns: fromId, toId, timestamp

GOAL: Calculate followers / day for every blog, apply 1-week moving average.

Thursday, October 17, 13

Page 40: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Apply Low-Pass filter to 140,000,000 blogs

Thursday, October 17, 13

Page 41: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Find blogs who have accelerating follower countsfor the most consecutive days.

“Accelerating”:New Followers Today > New Followers Yesterday

Thursday, October 17, 13

Page 42: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 43: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 44: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 45: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

• Binary input: 1 if more followers today than yesterday, otherwise 0

• Filter the binary signal to produce a value between 0 and 1

• Anything above a threshold (0.75) is “accelerating”

Days of consecutive acceleration

Thursday, October 17, 13

Page 46: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

15 days

36 days

48 days

Days of consecutive acceleration(filtered signal)

Thursday, October 17, 13

Page 47: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

15 days

36 days

48 days

Days of consecutive acceleration(filtered signal)

Thursday, October 17, 13

Page 48: Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

Thanks!

@adamlaiacanoadamlaiacano.tumblr.com

github.com/alaiacano/dsp-scalding

Thursday, October 17, 13