series-o-rama search & recommend tv series with sql bit.ly/dmh7kb

25
Series-O-Rama Series-O-Rama Search & Recommend TV series Search & Recommend TV series with SQL with SQL http://bit.ly/dMh7kb http://bit.ly/dMh7kb Guillaume Cabanac [email protected] February 15th, 2011

Upload: sakura

Post on 11-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Guillaume Cabanac [email protected]. Series-O-Rama Search & Recommend TV series with SQL http://bit.ly/dMh7kb. February 15th, 2011. Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac. Toulouse: A Picture is Worth a Thousand Words. 1. 3. Capbreton 3h ride. 4. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Series-O-RamaSeries-O-Rama

Search & Recommend TV series with SQLSearch & Recommend TV series with SQL

http://bit.ly/dMh7kbhttp://bit.ly/dMh7kb

Guillaume [email protected]

February 15th, 2011

Page 2: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Toulouse: A Picture is Worth a Thousand WordsSeries-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

2

1

2

3

4Capbreton

3h ride Toulousepopulation: 437 000students: 97 000

Aberdeenpopulation: 210 400students: ?? ???Collioure

2h30 ride

Ax-les-Thermes1h40 ride

Page 3: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

en.wikipedia.org

Telly Addicts Need Help to Find TV Series

Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization

Series about ‘plane crash islandplane crash island’ Search engine

What should I watch next? Recommender system

amazon.com

3

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 4: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Text Mining: Let’s Crunch Subtitles

4

Main Topics of Grey’s AnatomyGrey’s Anatomy? Text mining, Visualization

Series about ‘plane crash islandplane crash island’ Search engine

What should I watch next? Recommender system

Cold CaseCold Case

Grey’s AnatomyGrey’s Anatomy

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 5: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

What’s in a Subtitle File?

5

Title – Season – Episode – Language.srt 1 episode = 1 plain text file

Synchronization start --> stop

Dialogue

We can easily extract words[ a, again*2, and, but, com, cuban,

different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ]

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 6: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

6

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

DB technology at Work!

[Home]

7 527 files = 337 MB

100% Java and Oracle

Page 7: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

DB technology at Work!

[Search engine]

7

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Ranked listof results

Page 8: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

DB technology at Work!

[Infos]

8

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Most popular

terms

Mostrelatedseries

Page 9: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

DB technology at Work!

[Recommendations]

9

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 10: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

DB technology at Work!

[Recommendations]

10

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

I liked I disliked

What shouldI watch next?

Page 11: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

DB technology at Work!

[Recommendations]

11

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Ranked list ofrecommendations

Page 12: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

How Does this Work?

12

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Page 13: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Architecture and Data Model

13

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

DB

subtitles

indexing

searching

browsing

recommending

GUI

offline

online

Dict = { idT,term}8 plane27 killer29 crash

Posting = { idT*,idS*, nb}

27 45 898 45 38 12 90

Page 14: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Theory Text Indexing Pipeline

14

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

[the, plane, crashed, ..., planes, ..., is]

[plane, crashed, ..., planes, ...]

[plane, crash, ..., plane, ...]

{(plane, 48), (crash, 15) ...}

Tokenization + lowercase

Stopwords removal

Stemming

Porter’s Stemmer (1980)Porter’s Stemmer (1980)http://qaa.ath.cx/porter_js_demo.html

In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects …

Counting

Page 15: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Vocabulary

Theory Vector Space Model, Term Weighting

15

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Raw TF

dexter > lost

max

max

Normalization TF / max(TF)

survive ?

max

max

dexter < lost

Page 16: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Theory Best Match Retrieval

16

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

1 TV series = 1 vector

1 45 1467 6790 n

Now, we know how to:

Find most popular terms popular terms for a TV series

Compute similaritysimilarity between TV series

Find TV series matching a querymatching a query

Page 17: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Theory More on Term Weighting

17

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

1 45 1467 6790 n

1 TV series = 1 vector

All terms are supposed to be equally representative… but ‘survive’ is way more unusual than ‘people’

‘survive’ better represents Lost than ‘people’ does

IDF: Inverse Document FrequencyIDF: Inverse Document Frequency

Page 18: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Theory The Big Picture: TF*IDF

18

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.

Page 19: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Theory … and Practice

19

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Series = { idS, name,maxNb}12 Lost 54045 Dexter 125

Dict = { idT, termidf }8 plane 1.2527 killer 2.8729 crash 3.07

Posting = { idT*, idS*, nb, tf }

27 45 89 0.718 45 3 0.028 12 90 0.16

Page 20: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Description of a TV Series

20

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

Lost

Many surnames need to be filtered out

Page 21: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Retrieval of TV Series queries with 1 term

21

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

survive ⋈

Importance of normalization

• Stargate Atlantisnb/maxNb = 63/1116 = 0.05645

• Bladenb/maxNb = 9/163 = 0.05521

Page 22: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Retrieval of TV Series queries with n terms

22

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

survive mulder ⋈

67|The Vampire Diaries

survive|0.028|0.107 = 0.028 * 0.107 = 0.003

mulder|0.007|3.977 = 0.007 * 3.977 = 0.028

+ 0.031

18| X-Files

survive|0.014|0.107 = 0.014 * 0.107 = 0.001

mulder|1.000|3.977 = 0.007 * 3.977 = 3.977

+ 3.978⁞

Page 23: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Similar to House?

Computing Similarities Among TV Series

1/2

23

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi

Page 24: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Similar to House?

Computing Similarities Among TV Series

2/2

Series-O-Rama: Search & Recommend TV series with SQL

Guillaume Cabanac

24

Page 25: Series-O-Rama Search & Recommend TV series with SQL bit.ly/dMh7kb

Thank you

http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac