[rakutentechconf2013] [c-4_2] building structured data from product descriptions

22
Building Structured Data from Product Descriptions Keiji Shinzato

Upload: rakuten-inc

Post on 07-Nov-2014

741 views

Category:

Technology


1 download

DESCRIPTION

Rakuten Technology Conference 2013 "Building Structured Data from Product Descriptions" Keiji Shinzato (Rakuten)

TRANSCRIPT

Page 1: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

Building Structured Data from Product Descriptions

Keiji Shinzato

Page 2: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

2

An Italian product. This is a fruityred wine that mainly consists of sangiovese grapes of Tuscany.

Product information extraction

Type Red

Grape variety

Sangiovese

Region Italy,Tuscany

Page 3: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

3

Background

Attribute Value

Type 赤

Region イタリア ,トスカーナ州キャンティ地区

Grape サンジョベーゼ

Vintage 2011

トスカーナ州 キャンティ地区のサンジョベーゼ種を主体につくられる、イタリアを代表する赤ワインの一つ。

ベリンダ・コーリー キアンティ2011 750ml

Unstructured data

Structured data

• Structured data play a crucial role for making Rakuten more attractive service.– Faceted navigation, recommendation, and

market analysis.

Page 4: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

4

Faceted navigation

Reference: http://www.amazon.com/

Page 5: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

5

Background

Attribute Value

Type 赤

Region イタリア ,トスカーナ州キャンティ地区

Grape サンジョベーゼ

Vintage 2011

トスカーナ州 キャンティ地区のサンジョベーゼ種を主体につくられる、イタリアを代表する赤ワインの一つ。

ベリンダ・コーリー キアンティ2011 750ml

Structured data

• Structured data play a crucial role for making Rakuten more attractive service.– Faceted navigation, recommendation, and

market analysis.

• Unsupervised methodology is required.– 100 million products / 40,000 categories.

Unstructured data

Page 6: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

6

Table is an useful clue, but…

Montes Alpha M 2009

Product page including a table

WINE > CHILE

Type Red

Region Chile

Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot

Year 2009

Montes Alpha M 2009

Product page consists of sentences

Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a …

WINE > CHILE

38%

Page 7: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

7

Product information extraction

• Issue1: How do we know attributes for a category ??

• Issue2: How do we extract attribute values from full texts ??

Montes Alpha M 2009Attribut

eValue

Type Red

Region Chile

Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot

Vintage 2009

Company MontesProduct page (unstructured)

Structured data

Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot.A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. …

WINE > CHILE

Page 8: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

8

Reference: http://item.rakuten.co.jp/redbox/odm3000728/

Attribute name collection

Analyze a large amount of table data for collecting attributes of an

object

Attribute namesof Wine

Attribute values

Page 9: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

9

Attribute value database (wine)ぶどう品種(Grape variety)

内容量(Volume)

産地(Region)

生産者(Winery)

味わい(Taste)

Chardonnay 750ML France Farnese Dry

Chardonnay100%

720ML Italy Mas de Monistrol

Full body

Merlot 375ML Spain Leroy Medium body

Riesling 500ML Chile M. Chapoutier Slightly sweet

Syrah 1500ML German Mastroberardino

Sweet

Grenache 360ML Australia Santero Medium dry

Merlot 200ML America Saltarelli Extremely sweet

Tempranillo 3000ML Bordeaux Cavicchioli Medium dry

Sangiovese 1800ML Champagne Fontodi Red Full body

Syrah100% 1000ML Argentina Ca'Rugate Middle sweetPrecision is high, but coverage is low.

Page 10: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

10

Product information extraction

• Issue1: How do we know attributes for each category ??

• Issue2: How do we extract attribute values from product descriptions ??

Montes Alpha M 2009

Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot.A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. …

WINE > CHILEAttribut

eValue

Type Red

Region Chile

Grape Cabernet sauvignon,Merlot,Cabernet franc,Petit verdot

Vintage 2009

Company MontesStructured dataProduct page (unstructured)

Page 11: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

11

Unsupervised attribute value extraction- distant supervision approach -

Product page includingentries in the database

Rulewine from x ⇒ x is a Region

Chateau d’Issan 1994This is a wine from Margaux....

:<Region, Margaux><Color, White> :

DatabaseAnnotation

Generation

Rule is generatedthrough machinelearning algorithm.

Semi-structured data

Construction

Page 12: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

12

Corpus with attribute-value annotations (wine)• < 産地 > アルザス </産地 > で最も香り豊かと言われるスパイシーで華やかな

ワイン。

A spicy and gorgeous wine that is known as the richest aroma one in

<production_area> Alsace </production_area>.

• 最もお手頃で、 < 生産者 > ドメーヌ・ペゴー </生産者 > の美味しさを気軽に

楽しめる、とっても嬉しい一本なのです

This is a very nice wine because we can easily enjoy the taste of <winery>

Domaine Pegau </winery> at the best price.

• < ぶどう品種 > ソーヴィニヨン・ブラン </ぶどう品種 > 種の特長がよく表れ

たワイン。

A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well

featured.

• < タイプ > 白 </タイプ > 身魚の塩焼きやシンプルな味付けのソテー、焼き牡

蠣、豚のしょうが焼き、ボンゴレビアンコなどと。

Grilled or sauted <type> white </type> fish, grilled oyster, pork saute with ginger,

vongole bianco, and others.

J:

J:

J:

J:

E:

E:

E:

E:

Page 13: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

13

Unsupervised attribute value extraction- distant supervision approach -

Product page includingentries in the database

Rulewine from x ⇒ x is a Region

Chateau d’Issan 1994This is a wine from Margaux....

:<Region, Margaux><Color, White> :

DatabaseAnnotation

Generation

Rule is generatedthrough machinelearning algorithm.

Semi-structured data

Construction

Page 14: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

14

Extraction rule generation

• Algorithm: Conditional random fields [Lafferty+ 2001]

• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]

• Features:– Token: Surface form of the token.– Base: Base form of the token.– PoS: Part-of-Speech tag of the token.– Char. type: Types of characters in the token.– Prefix: Double character prefix of the token.– Suffix: Double character suffix of the token.– The above features of ±3 tokens surrounding the token.

They are frequently employed in the task of Japanese named entity recognition.

Page 15: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

15

Unsupervised attribute value extraction- distant supervision approach -

Product page includingentries in the database

Rulewine from x ⇒ x is a Region

Chateau d’Issan 1994This is a wine from Margaux....

:<Region, Margaux><Color, White> :

DatabaseAnnotation

Generation

Rule is generatedthrough machinelearning algorithm.

Semi-structured data

Construction

Page 16: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

16

Unsupervised attribute value extraction- distant supervision approach -

Terre di matraja Bianco 2012

This is a wine from Tuscany....

Rulewine from x ⇒ x is a Region

Apply

Attribute

Value

Region Tuscany

Vintage 2012

Grape Chardonnay

Rule1800 < x <= 2013 ⇒ x is a Vintage

Page 17: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

17

Performance (F-score)

Wine

Shampoo

60.1pt.

71.5 pt.

43.8 pt.

24.1pt.

With ML

Without ML

Page 18: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

18

An Italian product. This is a fruityred wine that mainly consists of sangiovese grapes of Tuscany.

Wine / Japanese

Type Red

Grape variety

Sangiovese

Region Italy,Tuscany

Page 19: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

19

Shampoo / Japanese

Category Shampoo

Product name

MCH Natural shampoo 1000ml

Ingredient Cypress oil,Charcoal

``MCH Natural shampoo 1000ml’’ is a shampoo consisting of cypress oil and charcoal.

Page 20: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

20

Video game / French

Product type

Nintendo 64,Nintendo DS

Saga Mario

Page 21: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

21

Conclusion

• Developing a technique for extracting product information from unstructured data.– Independent of any category and language.

• Useful services can be realized on structured product data.

• Our paper is available on the web.– ACL anthology:

http://aclweb.org/anthology//I/I13/

Page 22: [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

22

Thank you for listing !