organizing big data for text in rakuten

28
Organizing Big Data for Text in Rakuten October 28, 2017 Keiji Shinzato Rakuten Institute of Technology Rakuten, Inc.

Upload: rakuten-inc

Post on 21-Jan-2018

153 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Organizing Big Data for Text in Rakuten

Organizing Big Data for Text in Rakuten

October 28, 2017

Keiji Shinzato

Rakuten Institute of Technology

Rakuten, Inc.

Page 2: Organizing Big Data for Text in Rakuten

2

Text

in RakutenUnderstanding

• Search Queries

• Reviews from Users

• Product Descriptions

• Etc.

Valuable

Information

• User Interest

• User Experience

• Product Features

• Etc.

Page 3: Organizing Big Data for Text in Rakuten

3

Number of products has risen 2.6 times compared to five years ago.

• 100M products in 20121) to 258M products in 20172)

How much time do you need for reading descriptions of 258M products?

A. 4 years C. 400 years

B. 40 years D. 4,000 years

1) https://corp.rakuten.co.jp/about/history.html

2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)

Page 4: Organizing Big Data for Text in Rakuten

4

Number of products has risen 2.6 times compared to five years ago.

• 100M products in 20121) to 258M products in 20172)

How much time do you need for reading descriptions of 258M products?

A. 4 years C. 400 years

B. 40 years D. 4,000 years

1) https://corp.rakuten.co.jp/about/history.html

2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)

Technology to organize big data for text is critical!

Page 5: Organizing Big Data for Text in Rakuten

5

• Information Extraction from Product Data

• Sentiment Analysis on Review Data

Page 6: Organizing Big Data for Text in Rakuten

6

Application

Crafted from sleek

spazzolato leather

(black), the Dorian

shopper is an

elegant carryall

that's perfect for

your essentials.

10"H x 13"L x 6"D.

RALPH LAUREN

Attribute Value

Brand Ralph Lauren

Color Black

Material Leather

Size 10’’H x 13’’L x 6’’D

Unstructured Data Structured Data

Faceted Navigation / Recommendation / Market Research

The bag image is designed by Freepik (http://www.freepik.com/free-vector/set-of-woman-s-bags-in-flat-style_960523.htm)

Page 7: Organizing Big Data for Text in Rakuten

7

Difficulty

• Ambiguity

• パーカー (luxury pen brand and hoodie), PUMA (sports brand and knife brand)

• Diversity (long tail)

• 風と光 (a company of natural foods)

Dictionary-based approach

• Easily control system behavior by editing entries in the dictionary.

• Easily understand errors.

Page 8: Organizing Big Data for Text in Rakuten

8

Brand Dictionary

Product Titles andtheir Genres

Input Data with Brands IDs

• Tokenization• PoS Tagging

ExtractionMorphological

Analysis

• List tokens matched with the dictionary entries.

• Extract the candidate to the furthest left.

Normalization

• Retrieve brand IDs corresponding to extracted brands.

SynonymDictionary

Page 9: Organizing Big Data for Text in Rakuten

9

Brand expression Relevant Genre

力王 Unknown

中部電磁器工業 Computers & Networking

キメラパーク Unknown

シュガーローズ Women's Clothing

サスクワッチファブリックス Women's Clothing

藤栄 Home Decor, Housewares &

Furniture

ミキモト Unknown

エドウィンゴルフ Sports & Outdoors

AKI WORLD Sports & Outdoors

工房飛竜 Toys, Hobbies & Games

パーカー Home & Office Supplies

ハイライトキャバレー Men's Clothing

杉野 Unknown

カウネット Kitchen, Dining & Bar

Brand expressions are contained

with their relevant genres.

• 190K entries

Relevant genres are critical for

disambiguation.

• Employ brand expressions whose

relevant genre is the same with a

given product.

• Retrieve パーカー only for products in home & office supplies.

Page 10: Organizing Big Data for Text in Rakuten

10

Screenshot of https://item.rakuten.co.jp/brandol-ec/gu-295710-j8400-8106/ as of October 10th, 2017.

Page 11: Organizing Big Data for Text in Rakuten

11

New entries

3. Assign new relevant

genres to existing brands

4. Check manually

Candidates

2. Train and run

machine learning models

Annotated text

1. Create training data

Brand

dictionary

Product data

in ICHIBA

5. Update

Page 12: Organizing Big Data for Text in Rakuten

12

Brand Dictionary

Product Titles andtheir Genres

Input Data with Brands IDs

• Tokenization• PoS Tagging

ExtractionMorphological

Analysis

• List tokens matched with the dictionary entries.

• Extract the candidate to the furthest left.

Normalization

• Retrieve brand IDs corresponding to extracted brands.

SynonymDictionary

Page 13: Organizing Big Data for Text in Rakuten

13

Genre ID Synonym

: : :

Shoes,

Bags,…

B2449 NIKE,ナイキ

Electronics B2450 SONY,ソニー

: : :

Genre Product Brand ID Label

Shoes ナイキ B2449 NIKE

Shoes NIKE B2449 NIKE

Bags NIKE B2449 NIKE

Interior ナイキ -- ナイキ

Synonym DictionaryExtraction Results

Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)

The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)

The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)

Page 14: Organizing Big Data for Text in Rakuten

14

Genre ID Synonym

: : :

Shoes,

Bags,…

B2449 NIKE,ナイキ

Electronics B2450 SONY,ソニー

: : :

Genre Product Brand ID

Shoes ナイキ B2449

Shoes NIKE B2449

Bags NIKE B2449

Interior ナイキ --

Extraction Results

Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)

The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)

The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)

Synonym Dictionary

Page 15: Organizing Big Data for Text in Rakuten

15

Genre ID Synonym

: : :

Shoes,

Bags,…

B2449 NIKE,ナイキ

Electronics B2450 SONY,ソニー

: : :

Genre Product Brand ID

Shoes ナイキ B2449

Shoes NIKE B2449

Bags NIKE B2449

Interior ナイキ B3510

NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html

Extraction Results

Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)

The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)

The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)

Synonym Dictionary

Page 16: Organizing Big Data for Text in Rakuten

16

Genre ID Synonym

: : :

Shoes,

Bags,…

B2449 NIKE,ナイキ

Electronics B2450 SONY,ソニー

: : :

Genre Product Brand ID

Shoes ナイキ B2449

Shoes NIKE B2449

Bags NIKE B2449

Interior ナイキ B3510

Information when we can

use it is important

NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html

Extraction Results

Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)

The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)

The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)

Synonym Dictionary

Page 17: Organizing Big Data for Text in Rakuten

17

Find candidates automatically, and then check them manually.

• JAN code

• Wikipedia

• Semantic similarity

206K triplets of <genre, brand id, synonyms>

Page 18: Organizing Big Data for Text in Rakuten

18

Manually assign brands to 500 randomly selected product titles

• Percent of product titles including brands: 69.6% (348/500)

Performance

• Precision: 89.2% (224/251)

• Recall: 64.4% (224/348)

We can automatically extract correct brands for

100M products in 260M products!

Page 19: Organizing Big Data for Text in Rakuten

19

• Information Extraction from Product Data

• Sentiment Analysis on Review Data

Page 20: Organizing Big Data for Text in Rakuten

20

I ordered this a week ago, but

no response from the store.

176,502 reviewsStock

Information

Payment

Service

Package

Shipping

Snapshot of https://review.rakuten.co.jp/shop/4/261122_261122/cpmj-i0h5i-97x3lm_1_1/?l2-id=review_PC_sl_body_05 as of October 16th, 2017.

Page 21: Organizing Big Data for Text in Rakuten

21

• What aspects should we design?• How do we develop the system to perform it?

s1: Item was nicely packaged.s2: A tracking # was given,

but never worked.s3: Will shop again.

s1: Package / Poss2: Shipping / Negs3: Repeat / Pos

Input: Merchant Review Output: Aspect / Sentiment Polarity

The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)

Page 22: Organizing Big Data for Text in Rakuten

22

# Aspect Example

1 配送

(Shipping)

迅速な配送ありがとうございました。

(Thank you for the quick shipping.)

2 対応

(Service)

今まで買い物した店舗で一番対応が遅かった。

(I’ve never seen such slow service!)

3 連絡

(Communication)

注文受付の自動送信メールが届いたきり一週間何の連絡もなし。

(No contact for a week after ordering it.)

4 店舗

(Shop)

信頼できるショップ様でした。

(They are a reliable store.)

5 商品

(Item)

安全に使用できそうで、これからが楽しみです。

(I’m looking forward to using this product.)

6 リピート

(Repeat)

また利用したいと思います。

(I’m going to purchase an item again.)

7 梱包

(Package)

梱包も破損のないよう、しっかりとされていました。

(It was tightly packaged to prevent damage.)

Page 23: Organizing Big Data for Text in Rakuten

23

# Aspect Example

8 品揃え

(Stock/variety)

商品が多いので助かります。

(They have a big inventory.)

9 情報

(Information)

マネキンの身長を記載してあったのでかなり参考になりました。

(The description about the height of a mannequin is very useful.)

10 キャンセル/返品

(Cancel/return)

しかしたまに断りなく遅れたりキャンセルされている点に不満です。

(I’m not satisfied because they suddenly canceled without any notification.)

11 価格

(Price)

商品が安く、購入でき、まんぞくです。

(I’m satisfied with purchasing the item at a low price.)

12 楽天

(Rakuten)

楽天の全サービスに信用がなくなりました。

(Because of this experience, I can’t trust any services in Rakuten.)

13 支払い

(Payment)

決済方法にEdyが使える方がよいと思います。

(It would be better if Rakuten Edy were acceptable.)

14 その他

(Other)

レビューがもう少し増えるといいですね。

(I hope the number of reviews increases.)

Page 24: Organizing Big Data for Text in Rakuten

24

• What aspects should we design?• How do we develop the system to perform it?

s1: Item was nicely packaged.s2: A tracking # was given,

but never worked.s3: Will shop again.

s1: Package / Poss2: Shipping / Negs3: Repeat / Pos

Input: Merchant Review Output: Aspect / Sentiment Polarity

The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)

Page 25: Organizing Big Data for Text in Rakuten

25

Annotated 1,510 reviews (5,277 sentences)

• 配送も迅速で良かったです。(I was very pleased at how quickly I received it.) Shipping/Positive

• いつになっても商品が来ず、問い合わせても返信がない。(No shipment, no reply to inquiry.) Shipping/Negative, Communication/Negative

103 hours / a well-trained annotator

Page 26: Organizing Big Data for Text in Rakuten

26

Train models using passive aggressive algorithm, and CRF.

Features are:

• Bag-of-words, aspect dictionary, sentiment polarity

dictionary, and syntactic information.

Performance

• Aspect classification

• Precision: 82.6%, Recall: 46.8%

• Sentiment classification

• Precision: 84.8%, Recall: 77.5%

Page 27: Organizing Big Data for Text in Rakuten

27

• Important to develop technique to automatically pull valuable information from Big Data for Text.

• e.g., reviews users’ experience

• Rakuten develops techniques in-house to exploit Big Data for Text in the services.

• Information extraction from product descriptions

• Sentiment analysis on reviews of merchants

Page 28: Organizing Big Data for Text in Rakuten