a method for analysing large-scale ugc data for tourism: application to the case of catalonia

16
ENTER 2015 Research Track Slide Number 1 A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia Estela Mariné-Roig and Salvador Anton Clavé Research Group on Territorial Analysis and Tourism Studies (GRATET) University Rovira i Virgili , Catalonia, Spain [email protected] [email protected] http://www.urv.cat/en_index.html

Category:

Education


1 download

TRANSCRIPT

Page 1: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 1

A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

Estela Mariné-Roig and Salvador Anton Clavé

Research Group on Territorial Analysis and Tourism Studies (GRATET)

University Rovira i Virgili, Catalonia, [email protected]

[email protected]

http://www.urv.cat/en_index.html

Page 2: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 2

Introduction and aim

UGC data good source of information for DMOs, stakeholders and tourists. Travel blogs and Online Travel Reviews (OTRs) first-hand experiences of

travellers. They have mostly been analysed with content analysis and narrative analysis

(Banyai & Glover, 2012) in the areas of service quality, destination image and reputation, UGC, experiences and behaviour, and mobility patterns (Lu & Stepchenkova, 2014)

Such UGC data have exponentially grown in recent years and it is now considered that its manipulation requires the use of Big Data technologies.

However, in most studies concerning UGC data the collection is done “by hand” (Lu and Stepchenkova, 2014) and is usually non-random very time-consuming and non-representative.

This article aims to propose a method for semi-automatic downloading, arranging, cleaning, debugging, and analysing large-scale travel blog and OTR data.

Page 3: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 3

Web mining background

Web mining, using data mining techniques, intends to find useful information or to extract knowledge of the hyperlink structure and content of webpages Liu (2011)

To automatize the process of extraction, first a Web crawler programme is needed, capable of roaming the hyperlink structure and downloading the linked webpages.

There is abundant literature on data mining related to tourism and some on massive downloads.

Page 4: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 4

Methodology

Abburu and Babu (2013) propose a framework for web data extraction and analysis based on three basic steps: finding URLs of webpages, extracting information from webpages, and data analysis.

The above system architecture is divided into three modules: web crawling information extraction Mining

In this research we add the cleaning and debugging phases to eliminate the noise present in the webpage to be able to get to the content analysis phase with quality information in the original HTML format Resulting webpages only contain what the user wrote.

The methodology is applied to the case of Catalonia to analyse about 85,000 travel diaries created between the years 2004 -2013

Page 5: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 5

Destination selected for the case study (Catalonia)

Attributes:•Millenary history•Mediterranean destination•Bathed by 580 km of shoreline•Own culture and language (Catalan)•Wealthy historical and natural heritage•Third European region (overnight stays)•Foreign tourists in 2013: 15,631,500•Nine regional tourism brands:

Tourist brand Abbr.BarcelonaCosta BarcelonaCosta BravaCosta DauradaPaisatges BarcelonaPirineusTerres de l’EbreTerres de LleidaVal d’Aran(unclassified)

BarnacBarccBravcDaurpBarcPyrentEbretLleivAranunCla

Page 6: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 6

Selection of the most suitable websites hosting UGC data

Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S)oBorda count (B): Method that ranks options in order of preference

Webometrics:oVisibility (V):• Indexed pages in search engines (Google.com, Bing.com)• Link-based ranks (Google page rank PR, Yandex topical citation index CY)

oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com)oSize (S): Number of UGC entries related to the case study

Websites hosting UGC data selected:o1st TripAdvisor.com (TA): Hosts online travel reviews (OTRs)o2nd VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRso3rd TravelBlog.org (TB): Hosts travel blogso4th TravelPod.com (TP): Hosts travel blogs and a few OTRs

Page 7: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 7

Webometrics of the top four websites hosting travel diaries

TA TB TP VTIndexed pages

Google.comBing.com

18,600,00023,800,000

478,000320,000

759,000448,000

1,120,000415,000

Link-based rank

Google PRYandex CY

81,600

6110

6350

7375

Visit-based rank

Compete.comQuantcast.comAlexa.com

51127182

38,74236,06721,123

11,8249,279

21,324

2,5002,0654,156

Size Entries 72,874 2,988 2,116 7,791TBRH Rank 1 3 4 2

Page 8: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 8

Gathering process on websites

Filters: Simplified flow diagram of the downloading process:

oLevel (0, 1, ... no level limit)

Inclusive / exclusiveoURL• Protocol (HTTP, FTP, ...)• Server• Domain• Directories (folders)• Filename• File type (html, jpg, ...)o Content. Search• for all keywords• for exact word sequence• inside HTML tags

Page 9: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 9

UGC data arrangement

Structure of folders and files:

root\website\brand\destination\date_lang_[isFrom]_pageName_[theme].htm

Page 10: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 10

UGC data cleaning

Aims: Before: 52 KB After: 2 KB (both without pictures)

The cleaning and debugging phases

are essential to be able to obtain

quality information, limited to the

web content as written and posted

by the diary author, and overcoming

the most significant errors.

Sample of removed HTML elements:•<meta ... />•<form ... </form>•<iframe ... </iframe>•<div id="header">... </div>•<!-- [comment] -->•<div id="comment">... </div>•<div id="footer">... </div>•<script type ... </script>

Page 11: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 11

UGC data debugging (encoding and common mistakes)

ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding

Encoding: HTML entities

Gaudí: UTF-8 (GaudÃ--), HTML number (Gaud&#237;), HTML name (Gaud&iacute;)

Mistakes:

Correct noun Misspellings

Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ...

Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ...

Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ...

_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F

C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207

D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223

E_ à á â ã ä å æ ç è é ê ë ì í î ï224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239

F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

Number NameÀ &#192; &Agrave;Á &#193; &Aacute;Â &#194; &Acirc;Ã &#195; &Atilde;Ä &#196; &Auml;Å &#197; &Aring;

HEX SymbÀ c3 8o à €Á c3 81 à ? c3 82 à ‚ à c3 83 à ƒÄ c3 84 à „Å c3 85 à …

Page 12: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 12

Results: Trends 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045TB 22 139 254 427 662 415 328 362 231 148TP 29 100 236 276 258 226 238 218 189 346VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875cBarc 34 42 53 70 79 34 63 115 325 560cBrav 201 204 163 238 191 134 177 332 1,448 1,707cDaur 61 46 82 117 134 121 288 698 2,599 2,498pBarc 57 45 38 37 45 20 35 89 412 927Pyren 10 20 12 25 8 10 22 14 62 149tLlei 6 1 1 3 5 5 11 19 16 16tEbre 4 3 0 1 1 5 1 2 3 10vAran 1 0 7 0 0 3 2 3 9 11unCla 14 40 47 51 56 35 44 48 28 37

Trends in web hosting and Catalan brands

Monthly distribution of travel blogs and OTRs (TA, TB, TP, & VT)

Page 13: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 13

Results: Top keywords

Rank Keyword CountSite-wide

DensityAverage Weight

Remark

123456789

101112131415

barcelonagreattoursagrada familiagaudicityplacegoodvisitamazingparkbasilicapark guellbeautifulway

197,72351,52549,22138,34133,18728,15526,59726,09825,97325,24224,96223,61823,36723,32222,996

3.77 %0.98 %0.94 %0.73 %0.63 %0.54 %0.51 %0.50 %0.49 %0.48 %0.47 %0.45 %0.44 %0.44 %0.44 %

56.2623.7318.0860.7519.6611.7015.7315.0214.8624.1828.3881.6862.0623.0115.02

Capital of CataloniaGood feeling Gaudi’s masterpieceArchitect A. Gaudi Good feeling Good feeling Religious buildingGaudi’s workGood feeling

Site Content Analyzer (SCA) was applied to the dataset

Page 14: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 14

Top keywords: Barcelona, Gaudi and two Gaudi’s works

Barcelona: Guell Park / Mosaic Dragon

Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi

Page 15: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 15

Conclusions

The proposed methodology facilitates the massive gathering of UGC data from the most suitable sources for a specific case study.

The hierarchical territorial structure of folders and the articulation of the individual diaries’ file name, enable multiple classifications using utilities to order and manipulate the files.

This structure also allows to focus the analysis on a specific place, language or subject.

The cleaning and debugging phases are essential to obtain quality information, limited to what has been written by the diary author.

The HTML dataset is prepared for any offline content analysis in future work and most phases of this method are useful for the content analysis of other web data sources.

Page 16: A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 16

Thank you for your attention!

[email protected]

[email protected]