revolutionize text mining with spark and zeppelin

28
‹# © Hortonworks Inc. 2011 – 2016. All Rights Reserved Revolutionize Text Mining with Spark and Zeppelin April 2017 Yanbo Liang Apache Spark committer Software engineer @ Hortonworks

Upload: dataworks-summithadoop-summit

Post on 14-Apr-2017

14 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Rev

olut

ioni

ze T

ext M

inin

gw

ith S

park

and

Zep

pelin

Apr

il 20

17

Yanb

o Li

ang

Apa

che

Spar

k co

mm

itter

Softw

are

engi

neer

@ H

orto

nwor

ks

Page 2: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Agenda

Text

min

ing

wor

kflow

on

Big

Dat

a

Text

min

ing

with

Spa

rk a

nd M

Llib

Spar

k an

d Ze

ppel

in a

s the

pla

tform

Page 3: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing:

Pra

ctic

al A

pplic

atio

ns

•Te

xt c

lass

ifica

tion

–Sp

am fi

lterin

g–

Frau

d de

tect

ion

•Te

xt c

lust

erin

g

•Se

ntim

ent a

naly

sis

•En

tity

extra

ctio

n

•R

ecom

men

datio

ns

•A

utom

atic

labe

ling

•C

onte

xtua

l adv

ertis

ing

Page 4: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Trad

ition

al T

ext M

inin

g

•Co

mm

erci

al so

ftwar

e

•O

pen

sour

ce so

ftwar

e–

Gen

sim, K

NIM

E, N

LTK

,sk

lear

n, R

Page 5: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Trad

ition

al T

ext M

inin

g

•Co

mm

erci

al so

ftwar

e–

IBM

SPS

S, R

apid

Min

er, S

AS

•O

pen

sour

ce so

ftwar

e–

Gen

sim, K

NIM

E, N

LTK

,sk

lear

n, R

Page 6: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing

on B

ig D

ata

Page 7: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing

on B

ig D

ata

Dat

a Sc

ient

ists

Softw

are

engi

neer

s

Page 8: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Why

Apa

che

Spar

k M

Llib

•Sc

alab

le m

achi

ne le

arni

ng a

lgor

ithm

s on

top

of S

park

–A

ltern

atin

g Le

ast S

quar

es o

n Sp

otify

dat

a•5

0+ m

illio

n us

ers x

30+

mill

ion

song

s, 50

bill

ion

ratin

gs•F

or ra

nk 1

0 w

ith 1

0 ite

ratio

ns, ~

1 ho

ur ru

nnin

g tim

e

•W

ork

flow

util

ities

–M

L pi

pelin

e–

Mod

el im

port/

expo

rt–

cros

s val

idat

ion

Page 9: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing

wor

kflow

•Pr

otot

ype

(Pyt

hon/

R)

•C

reat

e Pi

pelin

e–

Load

dat

aset

–Ex

tract

raw

feat

ures

–Tr

ansf

orm

feat

ures

–Se

lect

key

feat

ures

–Fi

t and

cho

ose

best

mod

els

•R

e-im

plem

ent P

ipel

ine

for

prod

uctio

n (J

ava/

Scal

a)

•D

eplo

y Pi

pelin

e

•Sc

orin

g

Dat

a Sc

ienc

eSo

ftwar

e en

gine

erin

g

Page 10: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing

wor

kflow

•Pr

otot

ype

(Pyt

hon/

R)

•C

reat

e Pi

pelin

e–

Load

dat

aset

–Ex

tract

raw

feat

ures

–Tr

ansf

orm

feat

ures

–Se

lect

key

feat

ures

–Fi

t and

cho

ose

best

mod

els

•R

e-im

plem

ent P

ipel

ine

for

prod

uctio

n (J

ava/

Scal

a)

•D

eplo

y Pi

pelin

e

•Sc

orin

g

Dat

a Sc

ienc

eSo

ftwar

e en

gine

erin

g

Page 11: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Load

dat

a

Text

Labe

lI b

ough

t the

gam

e…4

Do

NO

T bo

ther

try…

1Th

is sh

irt is

aw

esom

e…5

neve

r got

it. S

elle

r…1

I ord

ered

this

to…

3

Dat

aset

Feat

ure

engi

neer

ing

Mod

eltra

inin

gM

odel

eval

uatio

n

Page 12: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Extr

act f

eatu

res

Text

Labe

lW

ords

Feat

ures

I bou

ght t

he g

ame…

4“i

”, “

boug

ht”,

…[1

, 0, 3

, 9, …

]D

o N

OT

both

er tr

y…1

“do”

, “no

t”, …

[0, 0

, 11,

0, …

]Th

is sh

irt is

aw

esom

e…5

“thi

s”, “

shirt

”, …

[0, 2

, 3, 1

, …]

neve

r got

it. S

elle

r…1

“nev

er”,

“go

t”, …

[1, 2

, 0, 0

, …]

I ord

ered

this

to…

3“i

”, “

orde

red”

, …[1

, 0, 0

, 3, …

]

Dat

aset

Feat

ure

engi

neer

ing

Mod

eltra

inin

gM

odel

eval

uatio

n

Page 13: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Fit a

mod

el

Text

Labe

lW

ords

Feat

ures

Prob

abili

tyPr

edic

tion

I bou

ght t

he g

ame…

4“i

”, “

boug

ht”,

…[1

, 0, 3

, 9, …

]0.

84

Do

NO

T bo

ther

try…

1“d

o”, “

not”

, …[0

, 0, 1

1, 0

, …]

0.6

2Th

is sh

irt is

aw

esom

e…5

“thi

s”, “

shirt

”, …

[0, 2

, 3, 1

, …]

0.9

5ne

ver g

ot it

. Sel

ler…

1“n

ever

”, “

got”

, …[1

, 2, 0

, 0, …

]0.

71

I ord

ered

this

to…

3“i

”, “

orde

red”

, …[1

, 0, 0

, 3, …

]0.

74

Dat

aset

Feat

ure

engi

neer

ing

Mod

eltra

inin

gM

odel

eval

uatio

n

Page 14: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Righ

ts Re

serv

ed

Evaluate Text

Label

Words

Features

Probability

Prediction

I bou

ght t

he g

ame…

4“i

”, “

boug

ht”,

…[1

, 0, 3

, 9, …

]0.

84

Do

NO

T bo

ther

try…

1“d

o”, “

not”

, …[0

, 0, 1

1, 0

, …]

0.6

2Th

is sh

irt is

aw

esom

e…5

“thi

s”, “

shirt

”, …

[0, 2

, 3, 1

, …]

0.9

5ne

ver g

ot it

. Sel

ler…

1“n

ever

”, “

got”

, …[1

, 2, 0

, 0, …

]0.

71

I ord

ered

this

to…

3“i

”, “

orde

red”

, …[1

, 0, 0

, 3, …

]0.

74

Dat

aset

Feat

ure

engi

neer

ing

Mod

eltra

inin

gM

odel

eval

uatio

n

Page 15: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Key

abs

trac

tion

of S

park

ML

pipe

line

•Tr

ansf

orm

er–

Feat

ure

trans

form

ers (

e.g.

, Has

hing

TF) a

nd tr

aine

d M

L m

odel

s (e.

g., N

aive

Bay

esM

odel

).

•Es

timat

or–

ML

algo

rithm

s for

trai

ning

mod

els (

e.g.

, Nai

veB

ayes

).

•Ev

alua

tor

–Th

ese

eval

uate

pre

dict

ions

and

com

pute

met

rics,

usef

ul fo

r tun

ing

algo

rithm

par

amet

ers (

e.g.

,B

inar

yCla

ssifi

catio

nEva

luat

or).

Page 16: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Spar

k’s T

ext M

inin

g al

gori

thm

s

•LD

A fo

r top

ic m

odel

•W

ord2

Vec

an u

nsup

ervi

sed

way

to tu

rn w

ords

into

feat

ures

bas

ed o

n th

eir m

eani

ng

•C

ount

Vect

oriz

er tu

rns d

ocum

ents

into

vec

tors

bas

ed o

n w

ord

coun

t

•H

ashi

ngTF

-ID

F ca

lcul

ates

impo

rtant

wor

ds o

f a d

ocum

ent w

ith re

spec

t to

the

corp

us

•A

nd m

uch

mor

e

Page 17: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

MLl

ib T

ext M

inin

g Pi

pelin

e - c

lass

ifica

tion

Dat

aset

Reg

exTo

keni

zer

Stop

Wor

dsR

emov

er

Cou

ntVe

ctor

izer

Has

hing

TFID

F

Strin

gInd

exer

Nai

veB

ayes

Logi

stic

Reg

ress

ion

SVM

MLP

text

cla

ssifi

catio

n

Page 18: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

MLl

ib T

ext M

inin

g Pi

pelin

e –

topi

c m

odel

Dat

aset

Reg

exTo

keni

zer

Stop

Wor

dsR

emov

er

Cou

ntVe

ctor

izer

Has

hing

TFID

FLD

Ato

pic

mod

el

Page 19: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

MLl

ib T

ext M

inin

g Pi

pelin

e - r

ecom

men

datio

n

Dat

aset

Reg

exTo

keni

zer

Wor

d2Ve

c

reco

mm

enda

tion

Page 20: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

MLl

ib T

ext M

inin

g Pi

pelin

e

Dat

aset

Reg

exTo

keni

zer

Stop

Wor

dsR

emov

er

Cou

ntVe

ctor

izer

Has

hing

TFID

F

Strin

gInd

exer

Nai

veB

ayes

Logi

stic

Reg

ress

ion

SVM

MLP

LDA

Wor

d2Ve

c

text

cla

ssifi

catio

n

topi

c m

odel

reco

mm

enda

tion

Page 21: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Demo

•lo

ad th

e fil

e co

nten

ts a

nd th

e ca

tego

ries

•ex

tract

feat

ure

vect

ors s

uita

ble

for m

achi

ne le

arni

ng

•tra

in a

line

ar m

odel

to p

erfo

rm c

ateg

oriz

atio

n

•us

e a

grid

sear

ch st

rate

gy to

find

a g

ood

confi

gura

tion

of b

oth

the

feat

ure

extra

ctio

nco

mpo

nent

s and

the

clas

sifie

r

http

s://g

ithub

.com

/yan

bolia

ng/d

ataw

orks

-mun

ich-

2017

Page 22: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Cus

tom

ing

ML

Pipe

lines

•M

Llib

2.1

incl

udes

:–

30+

feat

ure

trans

form

ers (

Toke

nize

r, W

ord2

Vec,

…)

–25

+ m

odel

s (fo

r cla

ssifi

catio

n, re

gres

sion

, clu

ster

ing,

…)

–M

odel

tuni

ng &

eva

luat

ion

•B

ut so

me

appl

icat

ions

requ

ire c

usto

miz

ed–

Tran

sfor

mer

s & M

odel

s

Page 23: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Opt

ions

for

cust

omiz

atio

n

•Ex

istin

g us

e ca

ses:

–sp

ark-

core

nlp

–sp

ark-

vlbf

gs

•Ex

tend

abs

tract

ions

–Tr

ansf

orm

er–

Estim

ator

& M

odel

–Ev

alua

tor

Page 24: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Spar

k vi

rtua

l env

iron

men

t

Dat

a Sc

ient

ist A

Dat

a Sc

ient

ist B

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Page 25: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Spar

k vi

rtua

l env

iron

men

t

Dat

a Sc

ient

ist A

Dat

a Sc

ient

ist B

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Pyth

on2.

7

Pyth

on3.

5

Pyth

on3.

5

Pyth

on3.

5

Page 26: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Text

Min

ing

wor

kflow

•Pr

otot

ype

(Pyt

hon/

R)

•C

reat

e Pi

pelin

e–

Load

dat

aset

–Ex

tract

raw

feat

ures

–Tr

ansf

orm

feat

ures

–Se

lect

key

feat

ures

–Fi

t and

cho

ose

best

mod

els

•R

e-im

plem

ent P

ipel

ine

for

prod

uctio

n (J

ava/

Scal

a)

•D

eplo

y Pi

pelin

e

•Sc

orin

g

Dat

a Sc

ienc

eSo

ftwar

e en

gine

erin

g

Dup

licat

ed a

nder

ror-p

rone

Page 27: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

ML

pers

isten

ce

•Pr

otot

ype

(Pyt

hon/

R)

•C

reat

e Pi

pelin

e

•Lo

ad P

ipel

ine

(Jav

a/Sc

ala)

–M

odel

.load

(“s3

n://…

”)

•D

eplo

y in

pro

duct

ion

Dat

a Sc

ienc

eSo

ftwar

e en

gine

erin

g

Pers

ist m

odel

or P

ipel

ine:

mod

el.sa

ve(“

s3n:

//…”)

Page 28: Revolutionize Text Mining with Spark and Zeppelin

‹# ›©

Hor

tonw

orks

Inc.

201

1 –

2016

. All

Rig

hts R

eser

ved

Dat

a sc

ient

ists w

ork

with

softw

are

engi

neer

Dat

a Sc

ient

ists

Softw

are

engi

neer

s

Expl

ore

data

Cre

ate

pipe

line

Find

bes

t par

ams

Save

mod

el

Load

mod

elD

eplo

y in

pro

duct

ion

Scor

ing

onba

tch/

stre

amin

g da

ta