apply logistic regression model in making celebrity's popularity ranking system

7
Author: Thuc VX <[email protected] > Company: NHN Vietnam Search service devision Apply Logistic Regression model in Making Celebrity ranking system Author: Thuc VX <[email protected] > Company: NHN Vietnam Search service devision Project description - Making ranking system for celebrities in line2day service Architecture Systems collaboration - "Celeb ranking" run every day at 23.00 - Line2day get scores ever day at 05.00

Upload: vietsoftware-international-inc

Post on 20-Jul-2015

26 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

Apply Logistic Regression model in Making Celebrity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

Project description - Making ranking system for celebrities in line2day service

Architecture

Systems collaboration

- "Celeb ranking" run every day at 23.00

- Line2day get scores ever day at 05.00

Page 2: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

General architecture

Collecting data module

Page 3: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

- Repository is text file. Structure in this file is in Json format

- Module for collecting data is run every day by crontab

- Return data file in http protocol with Apache web server

Platform

Web server: - Apache web server

Data format for information exchange - Json lib for Java: http://www.json.org

https client: - JSSE in JDK

Programming languages - Java: for all modules – Java JDK already in server gvbatch01

Celeb Ranking theory

Features 1. Youtube channel's subscribers

2. Official facebook's likes

3. Number of articles in Line2day - returned from Line2day API when get celeb list

4. Number of google search results:

5. Number of documents in vnews collection:

Recency attributes

Idx Name Description Coefficient

1 videoRecency Number of videos in line2day in last 3 days 1

2 articleRecency Number of articles in line2day in last 3 days 0.01

3 vnewsRecency Number of documents in vnews collection in last 3

days

0.005

4 photoRecency Number of photos in line2day in last 3 days 0.001

5 postRecency Number of posts of celeb in facebook in last 1 day 0.005

Page 4: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

Ranking function Use Logistic Regression model

Logistic Regression function:

Suppose:

Score(t) = F(t) + β . Recency

Machine leaning tool http://devwiki.nhncorp.vn/index.php/Logregdrvr

Phase 2

Features

Idx Name Description Apply in

1 Youtube Youtube channel's subscribers Phase 1

2 Facebook Official facebook's likes

3 Line2day

Number of articles in Line2day - returned from Line2day

API when get celeb list

4 Google Web Number of relevant documents in Google Web search

5 Google News Number of relevant documents in Google News search

5.1 GoogleTrend Result from Google trend

6 articleClicks Number of clicks on articles talking about celeb Phase 2

7 LikesPerPost Average number of likes per post in Facebook

8 musicVideoClicks Number of clicks on videos talking about celeb Phase 3

9 PhotoClicks Number of clicks on photos talking about celeb

10 Rating Celeb Rating Score

11 Followers

Number of followers in line2day: this feature from

Mobile App

12 Photos Number of photos in line2day

13 Videos Number of videos in line2day

Note:

Page 5: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

- In order to have feature Celeb Rating, Line2day mobile App and web have to provide function

that allows reader ability to vote for celeb.

- Try another Google parameters for searching:

http://moz.com/ugc/the-ultimate-guide-to-the-google-search-parameters

On articleClicks feature - Checking correlation between #clicks and pre-classification

- Use #clicks and evaluate model

Recency attributes

Idx Name Description coefficient 1 videoRecency Number of videos in line2day in last 3 days 1

2 articleRecency Number of articles in line2day in last 3 days 0.01

4 photoRecency Number of photos in line2day in last 3 days 0.001

5 postRecency Number of posts of celeb in facebook in last 1 day 0.005

6 FacebookTalkingAbout TalkingAbout in Facebook

PreProcess feature data

Check abnormal google search result:

- Check number of search result in Google search

- Processes:

1. Detect abnormal data: by min / max or automatically by app

2. Fixing abnormal data

Method for detecting abnormal data:

We use statistical method:

- Calculate modified Z-score:

Where

- If modified Z-score > 3.5, it's likely an outlier.

Method for fixing data:

1. Regression function between search result and classification: just right.

2. Every new celebrity must have a classification number.

Page 6: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

3. Old celeb will get classification from popularity

4. From classification value, we calculate Google search result by regression function.

Generate data for missing social URL

- Make regression function between Facebook (Youtube) feature and Google search feature

- If missing social URL, predict this feature value from predictive function

Name search service

Requirements:

- Count as exactly as possible number of times celebrity appears on the Internet

- Count articles that has celeb name, not other people with same name

- Synonym dictionary: support name with different format such as: Tara, t-ara..

Solution:

Search string is managed by CMV (Line2day tool). After editing search string, data is imported to

Line2day database and Line2day return to Celeb Ranking System by API.

Search string: provided from Line2day. This string provides two informations:

- Differnet type of name: name1, name2,..

- Different keywords: keyword1, keyword2,

Query for search engine will be in following format:

Query = ("name1" | "name2" ...) & ("keyword1" | "keyword2" | ...)

Evaluation - Đánh giá mô hình Hồi qui logistic

http://kiemtailieu.com/khoa-hoc-tu-nhien/tai-lieu/danh-gia-mo-hinh-hoi-qui-logistic/2.html

Use PEASON correlation Correlation coeff between target value and predicted value

Use Kendall tau-b correlation cor.test(exer, smoke, method="kendall")

Page 7: Apply Logistic Regression model in Making Celebrity's popularity ranking system

Author: Thuc VX <[email protected]>

Company: NHN Vietnam – Search service devision

Use DCG (Discounted cummulative gain)

This rate is between range of [0, 1]. Higher rate reflects better model we made.

Diagnostics - Residuals: between real rank and predictted rank

- Confident interval

- Outliers - specify highest residual observation

- Covariance: between pre-judged and predicted values