![Page 1: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/1.jpg)
Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain
Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse
AxIS Research TeamINRIA Sophia Antipolis and Rocquencourt
![Page 2: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/2.jpg)
MotivationsTo show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge
the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005)
And
the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs
2 main viewpoints: User and web site charge
![Page 3: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/3.jpg)
Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit
« Group of SessionIDs » - first statistical Intersite analysis
2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4
3. Conclusions
![Page 4: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/4.jpg)
Table 1. Format of page requestsShopID Date IP address SessionID Page Referrer
11 1074585663 213.151.91.186 939dad92c4…84208dca /
11 1074585670 213.151.91.186 87ee02ddcff…7655bb9e /ct/?c=148 http://www.shop2.cz
Table 2. Number of requests per shop
ShopID Site name (shop) #Requests10 www.shop1.cz 509,688
11 www.shop2.cz 400,045
12 www.shop3.cz 645,724
14 www.shop4.cz 1,290,870
15 www.shop5.cz 308,367
16 www.shop6.cz 298,030
17 www.shop7.cz 164,447
Data pre-processing
Initial data:
![Page 5: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/5.jpg)
Data pre-processing
Tanasa & Trousse (IEEE Intelligent Systems 2004)Tanasa ‘s Thesis (2005)
![Page 6: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/6.jpg)
Table 3. Transformed log lines
Datetime IP SessionID URL Referrer2004-01-20 09:01:03 213.151.91.186 939dad92c4…84208dca http://www.shop2.cz/ -2004-01-20 09:01:10 213.151.91.186 87ee02ddcff…7655bb9e http://www.shop2.cz/ct/?c=148 http://www.shop2.cz/
Data pre-processing
• Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window)
522, ,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction;
• Data fusion, data cleaning
![Page 7: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/7.jpg)
Relational DB modelData summarisation
![Page 8: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/8.jpg)
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour
Vis
its
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
0
50
100
150
200
250
300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
HourG
roup
s
Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop
Data pre-processing
• Low number of new visits on Saturdays and Sundays during the lunch time• The high number of new visits on Tuesdays and Wednesdays• Same results a) and b)
![Page 9: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/9.jpg)
Crossed Clustering Aproach for Time Periods/Product Analysis
Data: Selection of ls pages in shop 4 (the most used)
Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
0
200 000
400 000
600 000
800 000
1 000 000
1 200 000
1 400 000
10 11 12 14 15 16 17Shop
Acc
ess
/ct /ls /dt /znacka /akce others
![Page 10: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/10.jpg)
Crossed Clustering Aproach for Time Periods/Product Analysis
Relational BD model : We add easily a crossed table
Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals
Column: a multi-categorical variable representing the number of products requested
by users into the specific time slice
Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
![Page 11: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/11.jpg)
Crossed Clustering Aproach for Time Periods/Product Analysis
Table 4. Quantity of products requested by weekday x hour and registered on shop 4
Weekday x Hour Product (number of requests)
Monday_0Built-in electric hobs (10),Built-in dish washers 60cm (64),Corner single sinks (50), ...
Monday_1Free standing combi refrigerators (44),Corner single sinks (50), Built-in hoods (60), ...
… …
Sunday_22Built-in microwave ovens (27),Built-in dish washers 45cm (38),Built-in dish washers 60cm (85), ...
Sunday_23Built-in freezers (56),Kitchen taps with shower (45), Garbage disposers (32), ...
![Page 12: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/12.jpg)
Crossed Clustering Aproach for Time Period/Product Analysis
Table 5. Confusion table
Product_1 Product _2 Product _3 Product _4 Product _5 Total
Period_ 1 2847 5084 3284 2265 2471 15951
Period_ 2 11305 31492 12951 1895 9610 67253
Period _3 33107 55652 36699 5345 20370 151173
Period _4 22682 46322 30200 5165 27659 132028
Period _5 9576 20477 19721 2339 7551 59664
Period _6 1783 3515 2549 392 11240 19479
Period _7 15019 14297 8608 1397 6014 45335
Total 96319 176839 114012 18798 84915 490883
57,7%
![Page 13: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/13.jpg)
Crossed Clustering Aproach for Time Period/Product Analysis
Example of one surprising result:
the class Product 5 is defined by one type of products « Free standing combi refrigerators »
consulted predominantly on Fridays from 17:00 to 20:00 (class period 6)
57,7% of such a product type requested on this period
![Page 14: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/14.jpg)
Conclusions
1. Intersite Data Pre-Processing - structuration into user’s intersite visits
« Group of SessionIDs » - first statistical Intersite analysis
- anomalies and recommandations for the dataset
2. Crossed Clustering Approach - first application of such a method on time periods of Web logs
and in e-commerce domain - promising results
![Page 15: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte](https://reader034.vdocuments.us/reader034/viewer/2022052606/5a4d1b197f8b9ab059992b45/html5/thumbnails/15.jpg)
Data pre-processing
Inconsistency problems:- table kategorie: found repeated entries and different entries with same ID
- for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table).
- extra parameters equivalent to the give ones for some page types:i.e. for ct page type, id is equivalent to the given c parameter
- missing values (descriptions) in tables: 3 values in product table and 64 in category table
- multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site;
- multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).