george moiseev - classification of e-commerce websites by product categories

14
Classification of E-commerce Websites by Product Categories Case Study Moiseev George Higher School of Economics Faculty of Computer Science Higher School of Economics , Moscow, 2016 www.hse.ru

Upload: aist

Post on 15-Apr-2017

157 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: George Moiseev - Classification of E-commerce Websites by Product Categories

Classification of E-commerce Websites by

Product Categories

Case Study

Moiseev George Higher School of Economics

Faculty of Computer Science

Higher School of Economics , Moscow, 2016

www.hse.ru

Page 2: George Moiseev - Classification of E-commerce Websites by Product Categories

Outline

• Introduction

• Preprocessing

• Feature extraction

• Classification and evaluation

• Experimental results

2

Page 3: George Moiseev - Classification of E-commerce Websites by Product Categories

Problem Statement

• Retrieve e-commerce websites (e-shops)

• Classify e-shops by sold product type

*We don’t include customer-to-customer websites as e-

commerce shops

3

Page 4: George Moiseev - Classification of E-commerce Websites by Product Categories

Applications

• Market research

• Statistics gathering

• Organizing a knowledge base

• Goods search

4

Page 5: George Moiseev - Classification of E-commerce Websites by Product Categories

Dataset

The dataset was received by datainsight.ru

There are two training subsets marked by experts:

1. 1312 e-commerce and 1077 non e-commerce web

sites

2. 1448 of 15 product

categories.

5

Page 6: George Moiseev - Classification of E-commerce Websites by Product Categories

Preprocessing

Downloading a website:

Starting from the main page

Download all internal hyperlinks from a web page which weren’t

downloaded before

Check if equal webpage was already downloaded by other

hyperlink

What information should be saved from other webpages:

1. Nothing

2. Only meta data

3. Everything

6

Page 7: George Moiseev - Classification of E-commerce Websites by Product Categories

Preprocessing

Each webpage will be stored in two versions

• Raw page:

– Remove only javascript and obvious advertisements

• Cleaned page:

– Extract only content of markup tags

– Tokenization – retrieving sentences and words

– Stemming – reducing words to their root or base form

– Lowercase conversion

– Filter out stopwords

7

Page 8: George Moiseev - Classification of E-commerce Websites by Product Categories

Feature Extraction

There many methods and models for automatic text feature

extraction:

• Bag of words

• n-grams

• word2vec

• TF-IDF (on the picture)

• Mutual information

• Chi-square

• …

8

Page 9: George Moiseev - Classification of E-commerce Websites by Product Categories

Feature extraction

Proposed approach:

The term weighting formula for the i-th term in the k-th website is

derived from TF-IDF as follows:

𝑊𝑖𝑘 =𝑡𝑓𝑖𝑘 log

𝑁𝑛𝑖

(𝑡𝑓𝑖𝑗 log𝑁𝑛𝑗)2𝑁

𝑗=1

where ni is the number of websites where the i-th term appears, N –

total number of web sites in the sample and tfik is computed as:

𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡)

𝑇

𝑡

Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is

frequency of the i-th term in t-th tag.

9

Page 10: George Moiseev - Classification of E-commerce Websites by Product Categories

Classification and evaluation

• Support Vector Machine as classifier.

• multiclass classification performs in “one-vs-all” way.

• precision, recall and F-score for evaluation

• overall performance of the product type classification is evaluated

by average F-score among all categories.

10

Page 11: George Moiseev - Classification of E-commerce Websites by Product Categories

Results

F-score of e-commerce class in binary classification

11

Used web site information pure TF-IDF TF-IDF with Tag

weighting

only main page 0.85 0.89

main page + meta and title from other

pages 0.89 0.94

main page + whole other pages 0.86 0.92

.

Page 12: George Moiseev - Classification of E-commerce Websites by Product Categories

Results

average F-score of e-commerce categorization by sold product type:

12

.

Used web site information pure TF-IDF TF-IDF with Tag

Weighting

only main page 0.67 0.72

main page + meta and title from other pages

0.74 0.79

main page + whole other pages 0.73 0.81

Page 13: George Moiseev - Classification of E-commerce Websites by Product Categories

References

1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages", International Journal of Advancements in Computing Technology, vol. 2, no. 4, pp. 36-46, 2010.

2. A. Aizawa, "An information-theoretic perspective of tf-idf measures", Information Processing & Management, vol. 39, no. 1, pp. 45-65, 2003.

3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation", Journal of Machine Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011.

4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning. (1995).

5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning. 178-185 (2001).

13

.

Page 14: George Moiseev - Classification of E-commerce Websites by Product Categories

Moiseev George

[email protected]