finding blue oceans in tourism

i

Finding Blue Oceans in Tourism:

Samira dos Santos Nogueira

Using Text Mining to Identify Business Opportunities in Tourism

Dissertation presented as a partial requirement for obtaining

the master’s degree in Statistics and Information

Management.

2

NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação

Universidade Nova de Lisboa

Finding Blue Oceans in Tourism:

by


Dissertation presented as a partial requirement for obtaining a master’s degree in Statistics and

Information Management, with a specialization in Market Research and CRM.

Advisor / Co Advisor: Diego Costa Pinto, Mauro Castelli

March 2021

3

DECLARATION OF ORIGINALITY

I declare that the work described in this document is my own and not from someone else. All the

assistance I have received from other people is duly acknowledged and all the sources (published or not

published) are referenced.

This work has not been previously evaluated or submitted to NOVA Information Management

School or elsewhere.

Lisbon, 02 of March of 2021


_______________________________________________________

DECLARAÇÃO DE ORIGINALIDADE

Declaro que o trabalho contido neste documento é da minha autoria e não de outra pessoa. Toda

a assistência recebida de outras pessoas está devidamente assinalada e é efetuada referência a todas as

fontes utilizadas (publicadas ou não).

O trabalho não foi anteriormente submetido ou avaliado na NOVA Information Management

School ou em qualquer outra instituição.

Lisboa, 02 de Março de 2021


4

ACKNOWLEDGEMENTS

THIS WORK IS DEDICATED TO ALL MY FAMILY, THAT GAVE ME THE NECESSARY SUPPORT FOR ME

TO CONCLUDE THE MASTER. ALL THE ARAÚJO, GOMES AND NOGUEIRA FAMILY THANK YOU VERY MUCH.

I ALSO DEDICATE THIS WORK TO MY FRIENDS THAT GAVE ME EMOTIONAL SUPPORT TO

CONTINUE MY DREAM.

A SPECIALL DEDICATION TO RODOLFO SALDANHA THAT HELPED ME ON THE DATA SCIENCE PART

OF THE RESEARCH.

THANK YOU VERY MUCH TO YOU ALL!

5

ABSTRACT

The amount of data produced and available are bringing innovation to well know areas. One

of them is Tourism for which the use of big data is particularly useful to offer ever more personalized

options to travelers. The main type of data that influence consumers preference and decisions are

online reviews made in specialized websites or social networks. That happens because consumers

tend to take into consideration the opinions and reviews of other travelers before deciding on a

destination or where to stay. In this study, a sentiment analysis of more than 1,300 reviews retrieved

from TripAdvisor shows what the main attributes that predict positive and negative online reviews

are. Naïve Bayes was used as an algorithm and given a result of 75% of accuracy on the sentiment

analysis. The next step was complementing the sentiment analysis by using the results to build a Blue

Ocean-inspired strategy that speaks to practitioners in the sector of tourism and hospitality. The

findings indicate that the targeted factors for improvement are developing venues for events,

establishing a feeling of safety for consumers, and fostering brand attachment.

Keywords: data science, sentiment analysis, blue ocean, text mining, tourism

6

INDEX

1. Introduction ........................................................................................................................... 8

2. Literature review ..................................................................................................................... 10

2.1. Big Data ............................................................................................................... 10

2.2. Data Science ........................................................................................................ 11

2.3. Tourism ............................................................................................................... 15

2.3.1. Smart Cities .................................................................................................. 17

2.3.2. Smart Tourism ............................................................................................. 17

2.4. Marketing strategy .............................................................................................. 18

2.4.1. Innovation .................................................................................................... 19

2.4.2. Blue Ocean Strategy ..................................................................................... 20

2.5. Previous Work ..................................................................................................... 22

3. CONCEPTUAL MODEL .............................................................................................................. 24

4. METHODOLOGY ...................................................................................................................... 27

4.1. Data Collection .................................................................................................... 27

4.2. Machine Learning Approach ............................................................................... 27

4.3. Blue Ocean Approach .......................................................................................... 28

5. FINDINGS ........................................................................................................................... 30

5.1. Blue Ocean Findings ............................................................................................ 37

6. DISCUSSION ....................................................................................................................... 42

7. THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS ................................................ 45

7.1. Theoretical implications ...................................................................................... 45

7.2. Managerial implications ...................................................................................... 45

7.3. Limitations and future research .......................................................................... 46

8. Bibliography ...................................................................................................................... 48

9. Attachment........................................................................................................................ 53

7

LIST OF FIGURES

Figure 1 - Big Data in Tourism Research………………………………………………………………………………………….. 10

Figure 2 - Data Mining tasks …………………………………………………………………………………………………………… 12

Figure 3 - Factors that influence tourists experience ……………………………………………………………………… 16

Figure 4 - Strategy Canvas ………………………………………………….………………………………………………………… 21

Figure 5 - Four frameworks from Blue Ocean …………………………………………………………………………………. 22

Figure 6 - New Conceptual Model ………………………………………………………………………………………………….. 25

Figure 7 - Tourism Factors …………………………………….……………………………………………………………………… 30

Figure 8 - Tourism topics ……………………………………………………………………………………………………………….. 31

Figure 9 - Sentiment analysis distribution ………………………………………………………………………………………. 31

Figure 10 - Frequency dictionary of the words ……………………………………………………………………………….. 32

Figure 11 - Bag of words cloud ……………………………………………………………………………………………………….. 33

Figure 12 - First classifier model …………………………………………………………………………………………………….. 34

Figure 13 - Second classifier model ………………………………………………………………………………………………… 34

Figure 14 - Third classifier model ……………………………………………………………………………………………………. 35

Figure 15 - Confusion Matrix ………………………………………………………………………………………………………….. 37

Figure 16 - Industry Canvas ……………………………………………………………………………………………………………. 38

Figure 17 - Industry x 3 to 1 stars strategy canvas ………………………………………………………………………….. 39

Figure 18 - Industry x 4 and 5 stars strategy canvas ……………………………………………………………………….. 40

Figure 19 - 3 to 1 stars category x 4 and 5 stars category strategy canvas ………………………………………. 41

Figure 20 - 4 actions framework based on 3 to 1 stars category …………………………………………………….. 43

8

1. INTRODUCTION

According to Statista (2021), the total amount of data created, captured, copied, and consumed

worldwide is forecast to increase rapidly, reaching 59 zettabytes in 2020. The rapid development of

digitalization contributes to the ever-growing global data sphere. Big Data increasingly attracts attention

from different sectors because of its impacts and cultural changes in people’s lives.

One of these areas of study is the tourism area, in which Big Data's applied techniques can benefit

the sector that, for some cities and regions, is a driver of the local economy (Wood et al., 2013). Big Data

focused on tourism can benefit the sector because provide a more data-driven approach for managers

and can give the opportunity to improve customer relationship management, both in terms of attracting

new travellers and maintaining the existing ones and identifying points for improvement and existing

issues in the business (Neidhardt, Rümmele, & Werthner, 2017).

In the future, the perspective is that the Tourism and Hospitality area will embrace Big Data and

Big Data Analytics at different levels, speeds and for different purposes. In particular, BD will increasingly

contribute to (a) frame novel research questions and hypotheses if combined with an underpinning

conceptual framework; (b) enrich research designs and methods; (c) improve the generalizability of

research findings across different institutional, economic, social and geographical contexts; (d) generate

relevant managerial insights and business intelligence by means of (digital) data analytics in real-time; and

(e) advance BD technological applications in the verticals of Tourism and Hospitality (Mariani, 2020, p.02).

One of the principal data sources to get information of is online reviews. This type of data can

significantly influence online booking intention (Phillips, Zigan, Silva, & Schegg, 2015; Zhao, Wang,

Guo, & Law, 2015), and more than 60% of travelers use other consumers’ comments as a source of

information when making travel plans (Cró & Martins, 2017; Fang, Ye, Kucukusta, & Law, 2016).

Regarding academic production, there is an increasing number of studies connecting data science

and tourism but is still necessary to develop more studies to connect the results to real actions for

business (Li, Xu, Tang, Wang, & Li, 2018). With this highlighted, deepening data science in tourism allows

improving customer satisfaction giving the necessary information to stand out from the competition

(Amadio & Procaccino, 2016).

Based on this information, this study proposes to connect results from data science techniques to

a marketing strategy, specifically Blue Ocean, to find business opportunities in Tourism. This is the first

approach of the type and one of the goals is to be data-driven from the beginning to the end of the

process. For this, a text mining technique will be used to perform a sentiment analysis about the reviews

made online on a social media platform (TripAdvisor). In this process, will be identified tourism factors,

the principal subtopics of tourism, and from the results, create a Strategy Canvas for the tourism industry

9

and their subsequent categories, to understand the value curve and what can be improved. Based on

quantity criteria, the category chosen to create a 4 Action Framework with marketing strategies was

touristic accommodation equal or below 03 stars, according to the classification of TripAdvisor.

The place chosen to perform the analysis is the city São Luís of the State of Maranhão, Brazil. The

choice of Brazil, specifically São Luís, is due to the country possess various touristic cities. The

methodology developed in this study can help the tourism area and hotel managers responsible for

decision making to apply the methodology in the professional field providing the improvement of tourism

in the cities, being São Luís one of them.

The first section of this study is focused on a literature review, which brings important concepts

necessary to comprehend this paper-like Big Data, Data Science, Data Mining, Text Mining, Tourism,

Smart Cities and Smart Tourism, Marketing Strategy, Innovation, and Blue Ocean Strategy. In the second

section, the conceptual model developed will be presented followed by a description of the methodology

applied in this paper, divided into Data Collection, Machine Learning Approach, and Blue Ocean Approach.

After that, the subsequent sections will be dedicated to the results: Findings will be a section to discuss

the results of the data analysis, specifically the sentiment analysis and Blue Ocean. In the next section, we

will be discussed the results more deeply and will be created marketing strategies based on the Blue

Ocean Strategy Canvas, plotted in a Blue Ocean tool, the 04 actions framework. Finally, theoretical, and

managerial implications will be discussed, with a proposition of potential questions for future research.

10

2. LITERATURE REVIEW

2.1. BIG DATA

As a Big Data concept, it can be understood as "to the large amount of data characterized by the

large volume, variety, and speed, requiring new processes to enable better decision making, the discovery

of insight and optimization of processes" (Siddiqa et al., 2016). Although there is no definitive concept

about Big Data, the best known is the one created by Laney (2001), where he characterizes big data as 3

Vs (Volume, Velocity, Variety).

The importance of Big Data on business is that allows longitudinal studies due to constant\regular

data capture, easy data storage and low cost (Xu, 2019) since it give the possibility of having a large

amount of data available without much effort and human resources.

Big Data can have different data sources that can be considered “the fount.” Of the many sources

available, like smartphones, IoT, etc., relating the data sources with tourism area, the area under study,

their division can be seen below as

Figure 1 – Big Data in Tourism Research

As seen, the data sources related to tourism can be divided into three main categories: UGC data

(generated by users), including online textual data and online photo data; device data (by devices),

including GPS data, mobile roaming data, Bluetooth data, etc.; transaction data (by operations), including

web search data, webpage visiting data, online booking data (J. Li, Xu, Tang, Wang, & Li, 2018). For this

research, it is going to be used UGC data, produced by users in the format of online textual data.

11

2.2. DATA SCIENCE

To analyze this large amount of data, there is a specific science: data science. According to Provost

and Fawcett (2013, p.03), data science "is a set of fundamental principles that support and guide the

extraction of information and knowledge from data."

One of the main objectives to analyze this big data serve to reveal patterns and trends (Chiappa et

al., 2015). Patterns and trends are really important information for business, in all areas, especially the

ones that seek innovation because can provide new opportunities to explore and grow a business.

Today, because of the variety of technological devices and data sources, there are many types of

data available: voice, text, images, videos, and these different formats are a new challenge to analyze

those data. In this way, the companies, in addition to dealing with large volumes of data, now need to be

able to handle new data types (Davenport & Dyche, 2013). In other words, different techniques must be

applied depending on the type of data available.

There are many possible techniques for analyzing these data, the most common being related to

describe the data (descriptive techniques, like cluster analysis) and prediction (predictive techniques, like

linear regression) applying the data mining methods. For this research, text mining, specifically, sentiment

analysis will be applied to the data.

2.2.1. Data Mining and Machine Learning

Data Mining, in a basic concept, is finding useful patterns in the data as also referred to as

knowledge discovery, machine learning, and predictive analytics (Chauhan & Kaur, 2015). It derives

computational techniques from the disciplines of statistics, artificial intelligence, machine learning,

database theories, pattern recognition and uses modeling and algorithms to extract knowledge from the

data, preferably using large datasets (Chauhan & Kaur, 2015).

Data mining requires large datasets to find patterns. Much of the process lends itself to

automation when the creation of algorithms can easily identify this, and that’s where machine learning

enter (Bruno, 2019).

Before we talk about Machine Learning, it is crucial to understand another concept: modeling. A

model is a “specification of a mathematical (or probabilistic) relationship that exists between different

variables” (Grus, 2015, p. 141).

Machine Learning is used broadly in data science to refer to the techniques to be applied to

analyze and get information from data. The basic concept is that it is a branch of artificial intelligence that

aims to enable machines to perform their jobs skillfully using intelligent software. Because they use

sophisticated statistical methods and need data to learn patterns, we can say that it is multidisciplinary

(Mohammed, Khan, and Bashier, 2016). In resume, machine learning is valuable to create and use models

learned from data (Grus, 2015).

12

There are many kinds of machine learning, and many different algorithms to choose from

depending upon the data available (structured or unstructured), data size, and the goals of the study

(Bruno, 2019).

2.2.2. Types of Data Mining

Data Mining questions can be divided between supervised or unsupervised learning models.

Supervised data mining tries to infer a function or relationship based on labeled training data and uses

this function to map new unlabeled data, also have as characteristics predict the value of the output

variables based on a set of input variables and needs a sufficient number of labeled records to learn the

model from the data. Unsupervised learning, on the other hand, uncovers hidden patterns in unlabeled

data, there are no output variables to predict and the objective is to find patterns in data based on the

relationship between data points themselves. Data Mining techniques can be grouped into: classification,

regression, association analysis, anomaly detection, time series, and text mining tasks as we can see in

figure 2 (Chauhan & Kaur, 2015).

Figure 2 – Data Mining tasks

For predictions that are related to supervised learning, two forms are popularly applied:

classification and regression. In unsupervised learning, which has the goal to find interesting and useful

generalities within the data (Bruno, 2019), the most common forms are clustering and association.

13

For this research, the analysis will use Text Mining techniques, specifically a supervised sentiment

analysis approach, due to the characteristics of the data and the desired output already available.

2.2.3. Text Mining

As a basic concept, text mining “is a data mining application where the input data is text, which

can be in the form of documents, messages, emails, or web pages” (Chauhan & Kaur, 2015, p.9). To

perform data mining using textual data, the text files are converted into document vectors where each

unique word is considered an attribute, and what matters is to reduce these attributes to the important

ones that could be extracted knowledge about it (Chauhan & Kaur, 2015).

Text Mining is a valuable technique for organizations since it allows them to understand

everything from consumer opinions to the brand's reputation in an online environment. Regarding

tourism, text mining is even more important since most information in an online environment is done

through text (Nave, Rita, & Guerreiro, 2018).

The concept of Text Mining is related closely to Opinion Mining. As Opinion Mining, it can be

understood as the use of natural language processing that aims to determine whether the piece of

content is positive, negative, or neutral whereas text mining is used to identify and extract subjective

information in feedbacks (Afzaal & Usman, 2016).

One popular technique of text mining that also is going to be used in this study, is sentiment

analysis. Sentiment analysis is focused on the extraction of the relevance of the product’s feature based

on sentiments of polarity (positive or negative) of consumer reviews expressed in review sentences. To

perform sentiment analysis is usual to use Natural Language Process (NLP), supervised/unsupervised

learning, and association rules. To do a sentiment classification, text mining and mutual information are

used (Aciar, 2010).

Sentiment Analysis can be divided into two categories: the first is based on machine learning

methods (such as neural networks) and the other is based on dictionary-based methods that use

predefined sentiment dictionaries such as WordNet, HowNet, LIWC and which retain terms related to

feelings and their polarization values. The combination of these two categories also shows great potential

for the results (Q. Li, Li, Zhang, Hu, & Hu, 2019).

If we are going to use a Machine Learning algorithm, the entire sentiment analysis process,

including text classification, is done manually. If using a dictionary-based approach there is already a list of

words that can be associated with each process (property, subjectivity, and sentiment) (Fuchs &

Lexhagen, 2013). Some dictionaries are SentiStrength, which calculates the positive and negative

sentiment score in a short text.

Positive sentiment value ranges from 1 (not positive) to 5 (extremely positive) and negative

sentiment value ranges from -1 (not negative) to -5 (extremely negative); SentiWordNet, that is a

14

sentiment lexicon holding a polarity score of the opinion words. It has approximately 3 million words

including nouns, verbs, adverbs, adjectives, and Opinion Lexicon, which is one of the oldest dictionaries.

Here, is important to cite the lexicon-based approach that adopts the sentiment orientation of a given

text document as the average of the sentiment orientation of its words and phrases (Ramanathan &

Meyyappan,2019).

The sentiment analysis technique is often divided into two consecutive steps: (a) detecting which

text segments contain the dimensions and (b) determining the polarity and strength of the sentiment of

each of these dimensions (Pang & Lee, 2004).

Sentiment analysis also can be split into Contextual or Conceptual Semantic Sentiment Analysis

which the first one is inferred from the co-occurrence patterns of words. Change the context may lead to

a change in the word’s sentiment. The sentiment is changed based on neighboring words and the second

one is often extracted from external knowledge sources such as ontologies and semantic networks

(Ramanathan & Meyyappan, 2019).

In the tourism area, the use of sentiment analysis helps to obtain tourists' feelings and opinions in

real-time, thus carrying out the appropriate measures to what is being explained (Q. Li et al., 2019).

2.2.4. Analyzing textual data

To analyze textual data, first, it is necessary to collect the data from the related social media (one

of the main sources of this type of data) or reviews on websites and this can be done via web crawling

technology (Xiang et al., 2015, 2017; Xu et al., 2015). A web crawler, that could be understood as a robot

or spider, is implemented to download web pages, extract uniform resource locators (URLs) from their

hypertext markup language (HTML) and fetch them (Thelwall, 2001) and is going to be implemented in

this study to extract the data for analysis iteratively.

The second step is data preprocessing and pattern discovery. Talking about data preprocessing,

popular operations are data cleaning, tokenization, normalization, word stemming/lemmatization, and

part-of-speech tagging (POST) (J. Li et al., 2018). Data cleaning has the goal to detect and remove

inaccurate or useless records from text data online such as misspelling (Xiang et al., 2015), stop words

(Xiang et al., 2015; Xu & Li, 2016; Xu et al., 2015), non-target language and low-frequency words (Guo et

al., 2017), to leave the valuable information. Tokenization, to break up the textual data into words,

phrases, or other meaningful elements, namely tokens. (J. Li et al., 2018). Word stemming/lemmatization,

to identify the word's roots and regard all words with the same root as one token (Xu & Li, 2016).

POST has the objective to label each word in a sentence with a POS tag that can be a noun, adjective, or

adverb (J. Li et al., 2018). After that, is necessary to transform the text into vectors because the vector

representation for the granularity of words, sentences, documents, etc., is the basis for related machine

15

learning, and the pre-trained models of these vectors provide the premise for the input of other models

(Q. Li et al., 2019).

About pattern discovery, the popular techniques to apply in online textual data, especially taken

into consideration the tourism research, are latent Dirichlet allocation (LDA), sentiment analysis,

statistical analysis, clustering and categorization, text summarization, and dependency modeling. LDA, a

topic model for identifying the abstract “topics” in textual data; sentiment analysis that aims to classify

textual data into sentiment categories (positive, negative, or neutral); statistical analysis, the most basic

technique to analyze data, englobes descriptive statistics (e.g., mean, variance, etc.), t-test, correlation

matrix and others; text summarization, to automatically produce a summary of single or multiple

documents (s), for refining key information from original texts and dependency modeling, that has the

aim for capturing the relationship between textual data (for example online reviews) and factors (like

hotel performance) (J. Li et al., 2018).

One of the major steps to perform text mining, especially sentiment analysis, is Topic

Classification. Connected with the tourism area, generally, the comments of the reviews are short but

involve several aspects, factors related to travel such as transportation, entertainment, accommodation,

food, among others. For this reason, text classification in the tourism area generally involves the

extraction of topics from the message text so that all aspects pointed out can be extracted for further

analysis (Q. Li et al., 2019).

Regarding algorithms used, traditionally, text Classification is based on machine learning and uses

Naive Bayes, maximum entropy, Support Vector Machine, K-nearest neighbor algorithm, and its main

characteristics are to use keywords or topics that reflect the character of the document and carry out the

classification of the text automatically. Today, the most prominent text extraction techniques are TF-IDF

and information divergence, and other deep learning approaches. There are also other methods, such as

N-gram, which are also used but which have a disadvantage (such as the loss of text information) (Q. Li et

al., 2019).

Other steps that could be applied is Bag-of-Words Language Model, TF-IDF, and Word Embeddings.

As said, several areas can benefit from the techniques of predictive analytics, especially Tourism.

2.3. TOURISM

Tourism constitutes one of the largest industries worldwide, contributing 6 trillion dollars annually

to the global economy with nearly 260 million jobs worldwide and by 2021 it is expected that 69 million

more jobs are going to be created. (Tsiotsou, Mild, & Sudharshan, 2012). Tourism is an industry that

depends heavily on stakeholders because it is a compilation of various services such as accommodation,

transportation, dining, recreation, and travel and all these factors affect customer satisfaction. (Tsiotsou

16

et al., 2012). As we can see in the figure below, there are some of the principal factors that influence a

tourist experience.

Figure 3 – Factors that influence tourist’s experience.

Due to economic development and national income, the demand for tourism has increased

significantly. On the other hand, due to the popularity of web technologies, the internet was used to

request information before leaving, like reviews online on travel websites (Mehmood, Ahmad, & Kim,

2019). Regarding that, reviews in which there is a strongly positive or strongly negative message, as well

as the quality of these reviews, has a great influence on the consumer's buying behavior (Dickinger &

Mazanec, 2015), also the user-generated content of the online reviews are accordingly today recognized

as an important component in the construction of a destination's image (Yeoh, Othman, & Ahmad, 2013).

In Tourism, the destination image is one of the main factors that influence travelers when

deciding a destination to go to. The destination image reflects the tourist market, including the national

country image, the city image, the scenic spot image, etc. (Q. Li et al., 2019). Destination images should be

promoted according to reality and all the agents of promotion have to communicate the same language

because when higher the expectation of the tourist (if the destination cannot achieve this expectation),

the higher is the disappointment (Gassiot & Coromina, 2013).

The tourism industry is amongst the most innovative worldwide, given its ability to incorporate

technological and societal advances through new business creation (Hjalager, 2015). Thus, in the context

of the tourist "the focus is on the management of tourist data and marketing strategy and the ability to

directly reach users through social media allows creating new opportunities for service providers".

(Pantano, Priporas, & Stylos, 2017).

17

Data collection and analysis are necessary for a country to make public policies and development

of the global travel and tourism industry (Mehmood et al., 2019) and that’s where Smart Cities and Smart

Tourism can be molded.

2.3.1. Smart Cities

Smart City is a modern concept that emerged from the problems that a city encounter due to a

higher influx of citizens in the urban areas (Lobao et al., 2019).

In Smart City, it is crucial to understand that citizens and technologies are the keys for a city to

become smart (Lobao et al., 2019). In a simple concept, Smart City is when a city uses citizens' data to

improve essential services like transportation and so on.

But if a smart city can improve services provided by a city to the citizens and tourists, in real life

most of the cities are far away from this scenario, and in the earliest phase, been necessary to rethink

governance, design, and creation to increase innovative solutions where data is the base of all and needs

that the diverse stakeholders interconnect between each other to a Smart City works (Lobao et al., 2019).

If data is the base for a Smart City so the access to this data should be facilitated and there’s

where open data enter. Open Data, according to the Open Data Guide is data that can be freely used, re-

used, and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike

(Dietrich et al., 2012, p.6).

Assuming this concept, Tourism can greatly benefit from free data, now more than ever available

in many forms and platforms. Based on that, it can be perceived the Smart Tourism concept/idea.

2.3.2. Smart Tourism

Big Data and Smart Tourism are deeply connected, and the use of this big data gives the

possibility to offer the right services that suit user’s preferences at the right time, mainly through the

adoption of information communication technology (ICT) (Brennan, Koo, & Bae, 2018). As Chiappa et al.

(2015) cite “With the availability of massive tourists’ data, destinations are expected to offer personalized

services to each different type of tourists to exceed their prior expectation and subsequently enhance

their tourism experience. Presumably, such experience would enrich how tourists value their trip.”

Another concept that can be highlighted is from Buhalis & Amaranggana (2014), that says the

Smart Tourism Destination is a destination embed with technology, having as priority the improvement of

tourists travel experience and having as characteristics efficient gather and distribution of information;

enable an efficient allocation of tourism resources; and distribution of the sector benefits at the local

society.

Bringing smartness into tourism destinations requires dynamically interconnecting stakeholders

through a technological platform on which information relating to tourism activities could be exchanged

18

instantly. Smart Tourism Destinations should make optimal use of Big Data by offering the right services

that suit users’ preference at the right time (Chiappa et al., 2015).

One of the biggest advantages of smart tourism is the possibility to give personalized services in

the destination and even offer personalized package travels. As personalization, we can say that “is the

process of collecting and utilizing personal information about the needs and preferences of customers to

create offers and information, which perfectly fits the needs of the customers” (Frank and Harnisch 2014

as cited in Yang et al. 2005).

Another advantage that smart tourism could gain using big data is to enhance tourism experience

for the tourist through an advanced feedback loop, enhanced access to real-time information, and

advanced customer service through the Internet of Things to address factors that potentially shape the

negative experience (Chiappa et al., 2015).

Smart tourism tends to grow and even more affect the area in the next years. That is why is

important to connect data science to business strategy and innovation to see how the results for

analyzing data could be implemented in real scenarios.

2.4. MARKETING STRATEGY

Marketing Strategy, in a general way, is the “total sum of the integration of segmentation, targeting,

differentiation, and positioning strategies designed to create, communicate, and deliver an offer to a

target market” (El-Ansary, 2006).

A similar concept is that strategic marketing is “an organization’s integrated pattern of decisions that

specify its crucial choices concerning products, markets, marketing activities and marketing resources in

the creation, communication or delivery of products that offer value to customers in exchanges with the

organization and thereby enables the organization to achieve specific objectives” (Varadarajan, 2010, p.

128).

From a classical point of view, Kotler (1991) defined marketing strategy as a plan to achieve the

organization’s objectives by specifying what resources should be allocated to marketing and how these

resources should be used to take advantage of opportunities that are expected to arise in the future.

One important concept that is related to marketing strategy is market segmentation. As Market

segmentation, we can refer as the process of dividing a market into different subsets of consumers with

the same needs or characteristics and selecting one or more segments to target with a distinct marketing

mix (Shiffman et al., 2004).

Connecting to tourism, tourism destinations need also to deal with competition and for this need to

differentiate from other destinations. About differentiation Potter (1990) says is the ability to provide the

buyer exceptional and superior value, in terms of quality, special features, or assistance services. For this,

is necessarily developed a competitive intelligence. As Nasri (2011) writes, competitive intelligence

19

provides information about competitors and their strategies, objectives, research, strengths, weaknesses,

and so on, giving companies the opportunity to understand their position relative to competitors.

Competitive intelligence helps companies to (1) Gain a better understanding of their business

environment and industry (2) Learn about corporate and business strategies of competitors (3) Forecast

opportunities and threats (4) Anticipate the research and development of competitors’ strategies (5)

Validate or deny industry rumors (6) Take effective decisions (7) Act instead of reacting (Gémar, G., &

Jiménez-Quintero, 2015).

Relating to the tourism industry, marketing strategies have been adopted to respond to current

challenges, achieve a competitive advantage, and increase their effectiveness (Tsiotsou et al., 2012).

Innovation is necessary to constantly achieve consumer’s needs.

2.4.1. Innovation

There are many concepts of innovation. In a simplistic concept, innovation can be understood as

“an idea, practice, or object that is perceived as new by an individual or other unit of adoption and help to

save costs or improve the quality of existence process (Mohd Zawawi et al., 2016).

Another concept that it’s similar and can be cited is that innovation are “activities and processes

of creation and implementation of new knowledge to produce distinctive products, services and

processes to meet the customers’ needs and preferences in different ways as well as to make process,

structure, and technology more sophisticated that can bring prosperity among individuals, groups and

into the entire society” (Akram et al., 2011).

Related to tourism, text mining of online data has big potential to inspire innovations for tourism

practitioners and has the potential to transform the tourism industry. One example of that is when text

mining is applied on tourists reviews it could be developed personalized recommendation systems

according to the tourist profile allowed to increase customer satisfaction (Q. Li et al., 2019).

Another example of innovation in tourism can do is benefit competitive analysis. When based on

data and text mining techniques, a competitive analysis must come from an automated system that relies

upon text mining tools to summarize an archive of reviews spanning multiple suppliers, and then to

identify relationships (Amadio & Procaccino, 2016).

The innovation objective is to create value for the business. In this competitive market, innovation

allows that business and companies produce distinct product and services meeting customers taste and

preferences there are even more demanding (Akram et al., 2011). Nowadays there many tools and

methods available to a company or brand to innovate their services or products, like Design Thinking and

Blue Ocean strategy.

20

2.4.2. Blue Ocean Strategy

Blue Ocean Strategy is a bestseller book that sold more than 4 million copies and was translated

into 46 languages across five continents. It was embraced by many organizations and argues that

cutthroat competition results in nothing but a bloody red ocean of rivals fighting over a shrinking profit

pool and for organizations create a lasting success it’s necessary to create a blue ocean - unexplored new

market spaces ready to grow. (Kim & Mauborgne, 2019).

Red Oceans represent all the industries in existence today. This is known as market space.

Otherwise, Blue Oceans denote all the industries not in existence today. This is the unknown market

space. Here competition is irrelevant because the rules of the game are waiting to be set (Chan Kim &

Marborgne, 2015).

One of the main objectives of Blue Ocean is to create Value Innovation for the business. Value

innovation is created in the region where a company's actions favorably affect both its cost structure and

its value proposition to buyers. Cost savings are made by eliminating and reducing the factors an industry

competes on. Buyer value is lifted by raising and creating elements the industry has never offered. Over

time, costs are reduced further as scale economies kick in due to the high sales volumes that superior

value generates. We call it to value innovation because instead of focusing on beating the competition,

you focus on making the competition irrelevant by creating a leap in value for buyers and your company,

thereby opening new and uncontested market space, in this sense, value innovation is more than

innovation. It is about a strategy that embraces the entire system of a company's activities (Chan Kim &

Marborgne, 2015).

The book presents a systematic perspective and approach, with diverse tools, to each brand

capture and define their blue ocean. In total, there are 14 tools that the book present to an organization

stands out from their competitors. For this research, were selected two of them: Strategy Canvas and 4

Actions Framework.

The Strategy Canvas is a central diagnostic tool and an action framework that graphically captures

the current strategic landscape and the prospects for an organization. (Kim & Mauborgne, 2019).

21

Figure 4 – Strategy Canvas

The horizontal axis on the strategy canvas is related to the range of factors that the industry

competes on and invests in, while the vertical axis captures the offering level that clients receive across all

these key competing factors. A value curve or strategic profile is the graphic depiction of a company’s

relative performance across its industry’s factors of competition (Kim & Mauborgne, 2019).

The Strategy Canvas it’s important because serves two purposes: (a) capture the current state of

play in the known market space, which allows users to see the factors that the industry competes on and

invests in, what buyers receive, and what the strategic profiles of the major players are and (b) propels

users to action by reorienting their focus from competitors to alternatives and from customers to non-

customers of the industry and allows you to visualize how a blue ocean strategic move breaks away from

the existing red ocean reality (Kim & Mauborgne, 2019).

When applying this tool, it is important to analyze the scores of the factors related to, in our case,

the tourism factors. A high score means that a company offers buyers more and hence invests more, in a

factor.

To produce innovation, the Strategy Canvas should create a Value Curve. The value curve, the

basic component of the strategy canvas, is a graphic depiction of a company's relative performance across

its industry's factors of competition. As you shift your strategic focus from current competition to

alternatives and noncustomers, you gain insight into how to redefine the problem the industry focuses on

and thereby reconstruct buyer value elements that reside across industry boundaries (Chan Kim &

Marborgne, 2015).

22

About the Four Actions Framework, his use can be defined “to reconstruct buyer value elements in

crafting a new value curve or strategic profile. To break the trade-off between differentiation and low cost

in creating a new value curve, the framework poses four key questions, shown in the diagram, to challenge

an industry’s strategic logic” (Kim & Mauborgne, 2019). This tool is important because the grid pushes

companies not only to ask all four questions in the four actions framework but also to act on all four to

create a new value curve (Chan Kim & Marborgne, 2015).

Figure 5 – Four frameworks from Blue Ocean

After the results of sentiment analysis (positive, negative, or neutral) collected by TripAdvisor, will be

assigned Boolean values for each category. If the result of a sentence is positive, then is going to be

assigned the value +1; if the sentence was negative, then is going to be assigned the value -1 and if the

result was neutral, the value to be assigned is 0. From the summarize of the results, is going to be created

Strategy Canva based on the topics described on the Conceptual Model and then created a 04 actions

framework based on the results.

2.5. PREVIOUS WORK

As addressed in the work “Big data in tourism research: A literature review”, the application of big

data in tourism is still recent (a little bit more than 10 years) and from the three data sources indicated in

figure 1, types of research involving UGC data – data produced by users like online textual data and online

photo data – are the predominant type of work, with 47% of the publications until now. The main subjects

approached was tourist sentiment analysis, tourism marketing, and tourism recommendation (Li et al.,

2018).

23

Although research using online textual data being advanced, there is still room for improvement. As

cited in this same study, the knowledge produced by data science methods, like text mining, in tourism

product design and tourism marketing is lacking. In other words, the connection between the results of

the analysis of online textual data in practical terms still needs to be evolved.

In the last 03 years, more than 40 papers were produced regarding text mining in the tourism

industry, especially analyzing reviews online through sentiment analysis, a popular method to classify

consumer sentiment. This research is one of the first that gather the results of sentiment analysis and unit

with another marketing strategy tool based mainly on statistical measures like they mean.

However, another paper has a similar proposal to what this work is developing (which connects the

results of text mining with marketing tourism). The paper Competitive Analysis of Online Reviews Using

Exploratory Text Mining of the authors William Amadio & J. Drew Procaccino approached the usefulness

of analyzing text-based online reviews using text mining tools and visual analytics for SWOT analysis, as

applied to the hotel industry to develop competitive actions. The findings showed that the hotels selected

for the study completed in almost the same characteristics and SWOT analysis helped develop strategies

for each one of them.

Is important to highlight that the SWOT analysis from the paper Competitive Analysis of Online

Reviews Using Exploratory Text Mining of the authors William Amadio & J. Drew Procaccino was not

created guided by data and this was one of the motivations to create a data-driven blue ocean strategy

for this study.

24

3. CONCEPTUAL MODEL

The touristic place where the analysis will be developed is in São Luís of Maranhão, Brazil. The

choice happens because this city has its economies influenced in large part by the local tourism and it is

necessary to improve the tourism strategy so that the local economy can benefit from the actions and

campaigns implemented by the state government. Maranhão has the “Observatório do Turismo” in which

it conducts research and seeks to help the state, municipalities, and the hotel sector to create strategies

and tourism policies appropriate to each of the local reality (source: Observatory of the Tourism of

Maranhão). However, as informed, the research could be expensive and provides limited spatial and

temporal coverage (Wood et al., 2013). So, the implementation of text mining methods based on data

from social networks aims to facilitate the acquisition of information to the area to develop more specific

campaigns according to the reality of the region, reducing the cost of its application and obtaining a

greater benefit.

Based on the author Xu (2019), the reviews that are going to be collected are going to be divided

between hotel segments: budget, midlevel, and luxury. Budget hotels focus on providing good value for

the money by offering standardized accommodation, limited services, and cheaper room rates as

compared with upgraded hotels; Midlevel hotels are in the midrange of functionality and price; and

Luxury hotels focus on providing customers additive pleasure and comfort with premium products and

services (Xu, 2019). For this paper, budget hotels will be described as touristic accommodations of 1 and 2

stars, midlevel hotels will be considered as being touristic accommodation of 3 stars and luxury hotels will

be considered as 4 and 5 stars.

To categorize the hotels is going to be considered the number of stars. This approach happens

because hotels with different star levels charge different prices and offer different levels of quality of

attributes of products and services to customers and although hotels with higher star levels usually offer a

higher quality of core attributes and more varied auxiliary attributes of products and services, hoteliers

should know this will not necessarily lead to customer satisfaction because the higher price raises

customer expectations; thus when the perceived attributes do not meet their expectations, customers are

dissatisfied (Xu, 2019).

To obtain an overall view of the reviews, we firstly combined all of them into a single text block to

identify the key tourism factors pointed out by the travelers. Then, in terms of hotel rating, we separated

the reviews into two parts, with text blocks for budget/midlevel (three-star and below) and those for

luxury (four- and five-star) hotels (H. Li, Ye, & Law, 2013). The data that is going to be gathered are review

content, review date, city and hotel star rating (H. Li et al., 2013).

25

After gathering the data and apply all the necessary text pre-processing steps, sentiment analysis

will be performed to the review content. As already mentioned, if the result of a factor is positive then is

going to be assigned the value +1, if the value was negative, then is going to be addressed the value -1

and if the result was neutral, the value to be addressed is 0.

Because of the necessity of understanding better the factors and topics that should be analyzed

to create a strategy, it is proposed a new conceptual model based on the tourism factors presented in

figure 3 with key topics that can give more guidance to the data-driven approach proposed by this paper.

For example, if the review has the word “quarto,” “cama” it will belong to the facilities topic; if

contains “atendimento,” “café” it will belong to the services topics that are under the Accommodation

factor. The proposal is to create a well-defined methodology to classify the reviews without too much

manual work. Considering that, this study will be based on the factors and subtopics of each factor, to

guide the classification of the reviews as we can see in figure 6.

Figure 6 – New Conceptual Model

The key-words topics created based on this conceptual model are cleaning, beaches, security,

food, facilities, events, customer service, services, brand, location, cost-benefit, touristic spots, breakfast,

transportation, restaurants. The proposal of these key-words is based on the new conceptual model and

its topics.

According to the touristic factors, the topic related to the factor “Expenditures” is brand. The

topics related to the factor “Activities\Satisfaction” are touristic spots, transportation, and restaurants.

The topics related to the factor “Visit” are location, security, and beaches. The rest of the topics are on

the factor “Accommodation,” they are cleaning, food, facilities, events, customer service, services, cost-

26

benefit, and breakfast. It was not verified any critic – positive, negative, or neutral regarding the factor

“Travel,” and that’s why no key-word was created and related with this factor.

Regarding the meaning, “brand” is referring to the investiment on the hotel brand as a marketing

strategy to create value for the guest; “touristic spots” is all the touristic points cited on the reviews;

“transportation” are referring to the transports of the city available (taxi, bus, cost of uber and others);

“restaurants” are referring to the restaurant chain of the city; “location” is the localization of the touristic

accomodation; “security” is regarding the sense of security of the localization that the touristic

accommodation are placed; “beaches” are related to the beaches of the city, since is a coastal city;

“cleaning” is related to the cleaning of the rooms and instructure; “food” is regarding the options and

quality of foods provided by the touristic accommodation (lunch, dinner, snacks); “facilities” are referring

all the structure of the hotel; “events” are the service and structure provided by the touristic

accommodation to perform external events; “customer service” is the treatment of the employee to the

guest; “services” is all services in general provided by the touristic accommodation; “cost benefit” is the

cost benefit of the staying perceived by the guest; “breakfast” is related to the quality of the breakfast

provided by the touristic accommodation. Is important to highlight here that “cleaning,” “customer

service” and “breakfast” are all services provided by the touristic accommodation but it was necessary to

split those into specific topics because in some reviews there are specific critics about these topics.

27

4. METHODOLOGY

4.1. DATA COLLECTION

The first step was having access to the data. For that, web-scrapping was used to download the data.

It is important to highlight that a previous inspection was made to spot the variables that should be used

to analyze the dataset and majorly, the review, data, and location were the variables used to do a web

scraping.

The timeframe selected was all the reviews made in 2018 and 2019. This timeframe is important

because once we get through this pandemic, we will emerge in a very different world compared to the

one before the outbreak (Donthu & Gustafsson, 2020, p.284). Also, it was selected only reviews made by

Brazilians and wrote in Portuguese, because TripAdvisor splits their reviews by language and the

translation of the reviews could lose a lot of information and biased the result.

By the end of web-scrapping, it was downloaded reviews from 54 establishment that includes hotels,

inns, and hostels and the total was 1.392 reviews, featuring as a small dataset. It was created one dataset

for each hotel, but all the classification and sentiment analyses were made in all the reviews in one single

dataset.

Because the dataset was small, the strategy was: (1) classify the reviews based on keywords provided

by the conceptual model and (2) classify the sentiment of the reviews into positive, negative, and neutral

and then run the sentiment analysis algorithm to confirm if the classification was made properly. Since it

was a Machine Learning model this step was important because, to apply a supervised algorithm, in

advance the model should have access to the desired output. The algorithm chosen was Naïve Bayes.

The reviews were classified regarding their factors and main topics. Both the factors and main topic

were made using keywords like “location,” “facilities,” “uber,” “beach” and so on that can be checked in

figure 6. The factors became binary (or dummy) variables, and the main topic was a category variable.

One review can have more than one facto and more than one topic, but will only have one sentiment:

positive, negative, or neutral.

4.2. MACHINE LEARNING APPROACH

Regarding the sentiment analysis, it was created the variable “label.” If the review was positive, then

it will be “+1,” if the review was negative, then it will be “-1” and if the review was neutral then it will be

“0”. These values were important because the average and distribution of these labels were used in the

other step of the analysis of the dataset, using Blue Ocean Tools.

It was made a previous data cleaning on the dataset because it was identified characters and

misspelling words that could interfere on the classification of the words into their topics and factors, that

was made using functions and could interfere in the pre-processing steps of sentiment analysis.

28

To perform sentiment analysis was used Google Collaboratory, for safety reasons. It was made data

analysis and exploratory analysis in the role dataset, especially on the variables regarding the tourism

factors and the label that classify the reviews regarding their sentiment. Because the reviews were

Portuguese, it was used a Portuguese library - spacy. load ("pt") - to treat the words before the sentiment

analysis.

It was necessary to apply previous data steps before the sentiment analysis: pre-processing,

abbreviations, creations of bigrams, creating new stop words, bag-of-words, and TD-IDF. After that, it was

applied the Naive Bayes algorithm. The choice of Naive Bayes was because it is a popular algorithm used

for text mining, easy to apply, and dealing with not so large datasets.

The measures to analyze the efficiency of the Naive Baes models were Accuracy, Precision, Recall,

and F1 score. Recall is the proportion of Real Positive cases that are correctly Predicted Positive. Precision,

on the contrary, denotes the proportion of Predicted Positive cases that are correctly Real Positives

(Powers, 2011). The relation between the two measures can be understood in the table below:

Table 1. Systematic and traditional notations in a binary contingency table. Shading indicates

correct (light=green) and incorrect (dark=red) rates or counts in the contingency table.

Regarding the measure Accuracy, first is necessary to understand other measures: Inverse Recall

and Inverse Precision. Inverse Recall is thus the proportion of Real Negative cases that are correctly

Predicted Negative. Conversely, Inverse Precision is the proportion of Predicted Negative cases that are

indeed Real Negatives (Powers, 2011). So, Accuracy explicitly takes into account the classification of

negatives and is expressible both as a weighted average of Precision and Inverse Precision and as a

weighted average of Recall and Inverse Recall (Powers, 2011). At least, the f1 score has a similar concept

to the accuracy - as a general measure to evaluate the efficiency of the system, but the difference is that

the f1 score doesn’t take into account true negatives, which could affect statistical results if true negatives

are crucial for the analysis of the results (Powers, 2011).

After the results of the sentiment analysis, the bag-of-words generated were also used on the Blue

Ocean step, which could be compared to the classification of the topics and provided in-depth

information regarding which aspects of the tourism industry we should focus the analysis on.

4.3. BLUE OCEAN APPROACH

For Blue Ocean, firstly, it was analyzed all the dataset and calculated the average of the labels and

their distribution. The topics to analyze were selected based on the frequency they appeared on the

29

reviews and then compared to the frequency dictionary and the bag-of-words created in the step of the

Machine Learning model. For example, the word “café” appeared on bag-of-words and frequency

dictionary as one of the main citations in the reviews it is also part of “services” provided by the

classification using keywords, so it was analyzed separately of the other services. Words like “quartos,”

“hotel” were inside of the other topic “facilities,” so it was preferred to analyze the role topic because in

that case, many points regarding facilities were cited.

For the decision-making, the average metric was used. If the topic had an average below of the total

average of the industry, then that topic is below the level; If the topic had an average above of the total

average of the industry, then it would above the level; and if the average is approximately close to the

tourism chain average, then the topic is in the same level. Based on that, it was created the graphic of the

Strategy Canvas for Blue Ocean to be compared with other segments of the tourism industry. This same

methodology was applied when analyzing the budget/midlevel hotels (3 – 1 star) and luxury hotels (4 and

5 stars).

The other tool used was Strategy Canvas. To construct this, was necessary to compare the position of

the 15 topics of the industry with the position of these same topics in the category that we want to

perceive the value curve. The part of the hotel chain industry chosen to analyze was budget and midlevel

touristic accommodation that was classified by TripAdvisor as 3,2 and 1 stars. This decision was made

because the number of establishments that were classified in this category is larger than the

establishment with 4 and 5 stars and, from the results, the improvements can be applied in more

establishments.

To construct the Strategy Canvas, on the horizontal axes was the name of the topics, in total 15, and

as input data on the vertical axes was considered the average of the labels of these topics made by the

tourists that were on the town in the year of 2018 and 2019. To analyze the topics, it was used the same

method that was applied to the role industry.

Based on the results, it was plotted on the 04 actions Framework using this methodology: If the topic

is below the industry but was present in a vast quantity of reviews then it should be Raised; If the topic is

too much below the industry and not cited in comparison with the industry, with a threshold of below 5%,

then should be Eliminated; If the topic is rated in a positive way on the industry but not appeared on the

reviews of the 3,2,1 stars category or appeared in 5% of the reviews, they should be created and if the

topic was rated in a positive way on the industry but it was rated in a negative way on the 3,21 stars

category, then should be Reduced.

The results were plotted on the Strategy Canvas in comparison with the industry graphic and it was

created a value curve that can be seen in the next Chapter Findings.

30

5. FINDINGS

The results of the data analysis showed that most of the reviews are under the factor Accommodation

- present in 1372 reviews (almost 98% of them), that includes all the installations characteristics (like

bedroom, bed, breakfast, etc.) been a major factor for the tourists and going into accord the research of

the area. It also could be observed that the factor Visit has significant importance on the reviews - been

cited in 711 reviews, majorly because the topic “localization” that appears in the bag-of-words shows that

this topic is important for the tourists. Activities and Satisfaction appeared in the third place with

participation on 396 reviews, mainly because notes of transportation of the city (taxi, bus), price of the

uber to a touristic point, restaurants, and malls nearby the hotel/inn/hostel was made. The other two

factors, Travel – in 12 reviews and Expenditure – in 82 reviews, were not statistically significatively so

topics like beach and comparison with other cities had the last volume than the others.

Figure 7 – Tourism Factors

Regarding the topics, we can see the most cited was services, facilities, location, cost-benefit, and

customer services, with special attention to the topics services and facilities presenting in almost all the

reviews, pointing out that these two are important for the guest. It also goes into the encounter of the

most cited factor Accommodation, since that the two topics are part of that. “Services” was cited in 1.231

reviews - almost 88% of the reviews, followed by “Facilities” in 1.093 reviews - representing 79%,

“Location” comes at third place presented in 674 reviews - representing 48% of them, “Cost-benefit” is

31

cited in 230 reviews, followed by “Touristic spots” with 109 reviews, “Breakfast” was specifically cited in

29 reviews, “Restaurants” was present in 28 reviews, “Cleaning” was specifically pointed out also in 28

reviews, “Events” was pointed out in 23 reviews, “Security” was present in 17 reviews, “Food” was

specifically cited in 14 reviews, “Beaches” was cited in 13 reviews and for least, “Brand” was cited in 2

reviews as we can see below ordered by the most cited to the last

Figure 8 - Tourism topics

For the sentiment analysis, it was considered the overall sentiment of the reviews. It was used the

variable “label” that contained positive, negative, and neutral scores as presented in methodology. The

percentual distribution of the labels shows that the majority of the reviews are positive (50.6%), followed

by a neutral sentiment (31.4%), and in the last place is the negative review (18%).

Figure 9 – Sentiment analysis distribution

The frequency dictionary of the bag-of-words shows the 30 words more cited on the reviews,

number random chose. Looking at the dictionary, it’s possible to notice that the top 5’s words related to

facilities topic, following the previous results. Divided by topics, 05 words are in the topic “facilities”; 5

words are in the topic “services.” The other words are related to other topics like “location,” “beach,”

32

“restaurants” and “touristic points.” The other words are adjectives and it is possible to notice that none

of them are negative that can explain why the majority of the reviews are positives.

Figure 10 – Frequency dictionary of the words

It was also created a word cloud based on the bag-of-words. In bag-of-words, the bigger the word,

the more cited on the reviews were. Following that, “preco,” “area,” “solicito,” “servico,” “funcionario”

and “otimo” are the most cited words on the reviews all related to the topic service, demonstrating the

importance of this on the travel experience of the tourist and matching the results shown on figure 8.

Differently from the frequency dictionary, in the bag-of-words, it is possible to see positives (like “otima”)

and negatives adjectives (like “ridiculo”), demonstrating the plurality of important themes for the tourist.

33

Figure 11 – Bag of words cloud

Related to the sentiment analysis per si, it was applied the Naive Bayes algorithm. For the results,

it was applied a sentiment analysis problem with three classes.

Looking at the results of the first classifier, we can perceive that the number of observations in

the positive class is significantly greater than the number of observations in the other classes. This shows

that the models have a bias for positive observations and tend to classify most of the observations in this

class. Verifying the measures, the accuracy of the first model is 69.37% meaning that this model is correct,

since results above 50% are considered meaningful. Precision has 75.09%, the recall has 58.07% and the

f1-score has 59.69%, confirming the bias for classifying the observations into positive class and for the

observations was classified as a false negative.

34

Figure 12 – First classifier model

Analyzing the results of the second model, we can assume they are correct and meaningful. This is

because all the measures (precision, recall, and f1 score) are greater than 0.5 and balanced distributed

between negative, neutral, or positive, with the highest values compared to the first models. Here

positive class receives more observations, followed by negative class and neutral. For this model, in

general, accuracy has 75.05%, precision had 72.19%, the recall has 74.66% and f1 score has 73.10%,

confirming that this model is meaningful.

Figure 13 – Second classifier model

35

At least, the third model can also be assumed as correct and meaningful. Here, the classes

positive, neutral, and negative are also above 0.5 and balanced distributed between them with almost the

same values as the second models. Positive classes continue receiving more observations, followed by

negative and neutral classes, respectively. In general, the measures show that accuracy has 74.69%,

precision has 71.69%, recall has 74.65% and f1 score has 72.76%, confirming that the models are also

meaningful.

Figure 14 – Third classifier model

For this step, we can conclude that the classify was correct and meaningful, with a small bias for

classifying positive observations having the second models the most significant of all them.

At least, another measure applied to the model was the confusion matrix. A Confusion Matrix is a

method for visualizing classification results reporting the labels predicted by the model (prediction) versus

the labels already classified previously on the dataset (Konkiewicz, 2019). In other words, it is another way

to view false\true negatives and positives.

For True Positives (TP), it refers to the numbers of predictions where the classifier correctly

predicts the positive class as positive; True Negatives (TN) are the number of predictions where the

classifier correctly predicts the negative class as negatives; False Positive (FP) can be understood as the

number of predictions where the classifier incorrectly predicts the negative class as positive and False

Negative (FN) are the number of predictions where the classifier incorrectly predicts the positive class as

negative (Mohajon, 2020). Is common to find the confusion matrix with only two classes to classify, but

36

according to this study, was created a confusion matrix with 3 classes, regarding the sentiment of the

reviews (positive, negative, and neutral).

For a confusion matrix with 3 classes, it is important to look at the numbers on the diagonal. For a

model to be considered correct, all the non-zero values should be on the main diagonal of the matrix also

the numbers on the diagonal show how many reviews were classified correctly with the class according to

the classifier. Taking into account the confusion matrix of the model, for all the considered observations,

we can assume that the prediction of the model is correct since we have the numbers 39 (correctly

predicted as negative), 57 (correctly predicted as positive), and 113 (correctly predicted as negative) at

the diagonal.

Non-zero values outside the main diagonal represent the number of observations for which the

model provided the wrong prediction. In this case, the 16 that is presented at the intersection between

the first row and the second column of the matrix indicates that 16 observations were classified as

positive by the model, but their label is negative. Similarly, the 02 that is presented on the first row and

third column of the matrix indicates that 02 observations were classified as neutral by the model, but

their correct class is negative. Following this approach, 11 observations as classified as negative by the

model but their labels are positive, the 25 presented at the intersection of the second row and third

column show that 25 observations were classified by the model as neutral but their labels are positive.

Regarding neutrals, no observation was classified by the model as negative but 15 observations were

classified as positive by the model when their labels were neutral.

As a result, from the confusion matrix, it seems that the model is particularly "good" when dealing

with neutral observations because the majority of observations that has the labels of neutral was also

classified as neutral by the model, while it has some difficulties in classifying the observations of the

remaining classes.

In the end, we can conclude that the sentiment analysis of this model is statistically valuable, take

into consideration the results of the Naive Baes classifier with 03 classes and the confusion matrix.

37

Figure 15 – Confusion Matrix

5.1. BLUE OCEAN FINDINGS

The average of all reviews was 0.33, so a topic-by-topic analysis was done to check their position.

If well below the general average, the topic was pointed out as a negative factor; if close to the average (2

points above or 2 points below), it was considered an “average” or “neutral” factor; if above average it

was considered a positive factor. In addition to the average, the proportional distribution of the positive,

negative, and neutral labels was considered to analyze the position of each topic.

That said when analyzing the entire dataset, we have 09 topics as positive, 01 as neutral, and 05

as negative. The positive topics are brand, customer service, breakfast, location, cost-benefit, touristic

spots, transportation, restaurants, and services in general; the neutral topic is just events; finally, the

negative topics are cleaning, food, facilities, security, and beaches.

Of the positive topics, "Brand" had an average of 0.5, “Customer Service” had an average of 0.37,

“Breakfast” had an average of 0.62, “Location” averaged 0.54, “Cost-benefit” averaged 0.55, “Touristic

spots” had an average of 0.55, “Transportation” averaged 0.73, “Restaurants” averaged 0.86 and

“Services in general” averaged 0.37. Regarding the neutral topic “Events,” it averaged 0.35. In the case of

negative topics, “Cleaning” had a negative average of -0.68, “Food” had an average of 0.29, “Facilities”

had an average of 0.30, “Security” had an average of 0, and “Beaches” had negative average of -0.15. The

38

graph below shows the distribution of averages for the entire dataset.

Figure 16 – Industry Canvas

Regarding the category to be analyzed, budget and mid-level touristic accommodation (that in

this section we going to refer to as 3 to 1 star), the average of all data related to this category was 0.19. In

comparison with the whole dataset, the average is much below the general (0.33), but concerning the

quantity, it is the most representative category with approximately 65% of the number of reviews. This is

one of the reasons why this category was selected to perform Blue Ocean analysis the other reason is that

most of the touristic accommodation was under this segment, almost 90%.

Regarding the distribution of topics, 08 topics are positive and 06 negatives none were classified

as neutral. The positive topics are touristic spots, transportation, services in general, customer service,

cost-benefit, location, breakfast, restaurants; the negative topics are cleaning, food, beaches, events,

security, facilities.

Of the positive topics, “Touristic spots” had averaged 0.57; “Transportation” had an average of

0.75; “Services” had an average of 0.23; “Customer service” had an average of 0.23; “Cost-benefit”

averaged 0.48; “Location” had an average of 0.47; “Breakfast” had an average of 0.65 and “Restaurants”

had an average of 0.85. Of the negative topics, “Cleaning” had an average of -0.77; “Food” had an average

of 0.14; “Beaches” had an average of -0.5; “Events” had an average of -0.57; “Security” had an average of

-0.06; “Facilities” averaged 0.15. It is important to note that in this category, “Brand” was not mentioned,

so it was not possible to obtain the average for this topic.

Because this category was chosen to create the value curve using the Strategy Canvas and 4

Actions Framework, based on the averages obtained, we were able to obtain the following result in

Strategy Canva:

39

Figure 17 – Industry x 3 to 1 stars strategy canvas

In the Strategy Canvas, it is possible to notice that most topics are below the industry average,

except for touristic spots, breakfasts, transportation, and restaurants. It is important to note that

“cleaning”, “beaches” and “security”, which were already negative in the analysis of the entire industry, in

this category they are even more negative, indicating that they are critical points that need urgency in

their improvement for tourist accommodations of 3, 2 and 1 stars. Another topic worth mentioning is

“events”, which in the category analyzed here is identified as the second most negative topic, very

contrary to the industry that was positive. This result may have been because tourist accommodations

under 03 stars may not have the necessary structure to hold events in their spaces or when there are

space and structure to perform, it does not satisfy the customer. Anyway, it is worth the investigation and

a deeper look at this topic.

In general, based on review averages, tourists who have stayed in 3.2.1-star tourist

accommodations consider the cleanliness of the facilities, the pollution of the beaches, and safety in the

location where they have located a critical factor that must be improved, because, in Marketing, when

customers do a negative review about your establishment, it is because he wants it to be improved. They

also consider the breakfasts served in the tourist accommodations to be satisfactory, as well as the

availability of transport in the vicinity of where it is located, the quality of the restaurants, and the sights

of the city. The other topics demonstrate that they need more attention and investment by the industry.

As a proposal to the value curve for this segment according to the Strategy Canvas created, is

starting to the topic “Cleaning” until “Cost-benefit,” with special attention to the topics “Events” and

“Brand,” that the value curve was bigger.

As noted, 3.2.1-star tourist accommodations averaged well below the general dataset, which

makes us look at the luxury hotels. In the analysis below, the same method was applied for the luxury

40

hotels (in this section referred as 4- and 5-stars hotels) so that the analysis of the topics could be

deepened.

The overall average of the luxury hotels dataset was 0.59 and represents approximately 35% of all

reviews of the São Luís do Maranhão hotel chain. This shows us that this niche has a significant

contribution to the average of all reviews of the hotel chain in the analyzed city.

Regarding the distribution of topics, 07 topics were considered positive, 03 were considered

neutral and 05 were considered negative. The positive topics are security, service, location, cost-benefit,

transportation, restaurants, events; neutral topics are breakfast, facilities, services; the negative topics

are cleaning, food, brand, tourist spots, and beaches.

Of the positive topics, “Security” averaged 1; “Customer service” had an average of 0.63;

“Location” had an average of 0.71; “Cost-benefit” averaged 0.68; “Transportation” had an average of

0.71; “Restaurants” had an average of 0.87 and “events” had an average of 0.75. Of the neutral topics,

“Breakfast” averaged 0.58; “Facilities” averaged 0.58, and “Services” averaged 0.60. Of the negative

topics, “Cleaning” had an average of -0.20; “Food” had an average of 0.43; "Brand" had an average of 0.5;

“Touristic spots” had an average of 0.48 and “Beaches” had an average of 0.14. In comparison with the

distribution of averages across the industry with the categories of 4 and 5 stars, it can see the Strategy

Canvas below:

Figure 18 – Industry x 4 and 5 stars strategy canvas

In comparison with the budget and mid-level categories, the Strategy Canvas of these two niches

is established as follows:

41

Figure 19 – 3 to 1 stars category x 4 and 5 stars category strategy canvas

Considering the Strategy Canvas created in comparison with the industry, the value curve for the

luxury hotels category should start from “Touristic spots” until “Transportation.” The same value curve

can be seen at the Strategy Canvas created in comparison with the touristic accommodations of 3 to 1

star.

Because the focus of the Blue Ocean strategy is on budget and mid-level hotels, the next section it will

be discussed other Blue Ocean strategy tools, 4 actions framework, with a proposal of marketing

strategies for this segment.

42

6. DISCUSSION

Based on the results of Blue Ocean, it is possible to notice that some tourist topics are perceived

in different ways by tourists, such as events, brand, tourist spots, breakfast, security, and transport. This

can be explained mainly by the characteristics of the categories being different.

In the case of “events,” “security” and “brand,” evaluated positively in hotels of 4 and 5 stars but

negatively in tourist accommodations from 3 stars down, this can be explained why 4 and 5 hotels stars

have a better-defined structure regarding the offer of spaces and event organizations, being a factor of

great weight in the evaluation of comments regarding this category. Another item that draws attention is

that “brand” was mentioned only in the category of 4- and 5-stars hotels and in a positive way, which

demonstrates that the investment in a brand can be better explored by the 3-star tourist

accommodations down, since “brand” had a positive average when quoted. The topic "safety,"

concerning the number of times it was mentioned, had a very big difference about the two categories

analyzed: while in tourist accommodations from 3 stars down it was mentioned 16 times (approximately

in 2% of reviews), in the category of 4 and 5 stars he was mentioned only once, in a positive way. This

result can demonstrate that for categories of 3 stars down, the feeling of insecurity becomes more

present for tourists who stay there compared to tourists from luxury hotels.

Regarding the topics “tourist spots”, “breakfast” and “transport”, they are perceived more

positively by guests of tourist accommodations from 3 stars down. About “tourist spots,” it can be

associated with the fact that many hostels, hotels, and inns are in the historic city center, which

concentrates a high number of tourist spots, thus contributing to this topic being more positive than

concerning luxury hotels, which are mostly concentrated in the coastal part of the city. Regarding

"transport," the availability of transport in the historic center of the city is also more diverse than in the

coastal part, which may have contributed to this topic having also been perceived as more positive than

luxury hotels. The other topic that draws attention is “breakfast,” which is also perceived more positively

by tourist accommodations from 3 stars down, indicating that this is a strong point of this category.

All other topics are seen more positively by guests of luxury hotels, except cleanliness - which is

seen as negative in the two studied niches and restaurants - which is seen as almost identical positive in

the two analyzed niches.

This demonstrates that the hygiene and cleanliness of the rooms is a critical factor that needs to

be looked at very carefully by the entire hotel chain in the city to be analyzed. About restaurants, there is

an opportunity to establish partnerships between tourist accommodations and restaurants in their

surroundings, as this topic was evaluated as the most positive among all the tourist factors addressed.

To build the 4 actions framework, when selecting the topics to be analyzed, the threshold was set

below 5% and depending on the average, if below or above the general average of the dataset (0.19), it

43

will be placed in the “Eliminate”, “Create” or “Raise”. Also, the number of times this topic was mentioned

in the reviews was considered to verify its relevance. The selected topics based on this criterion were:

restaurants, transport, breakfast, cleaning, beaches, security, food, brand, and events.

Based on what has been developed so far, the proposal for the 4 actions framework based on 3-

star tourist accommodation down is as follows:

Figure 20 – 4 actions framework based on 3 to 1 stars category

Based on the applied methodology, no topic was demonstrated that needed to be reduced

beyond the industry standard.

In Raise's proposal in the 4 actions framework, some topics have significant weight in citations

from the reviews in which they are below the average of the tourist industry and that investment in these

points can be increased, since in the 4 and 5 stars segment these topics were also well evaluated.

In Eliminate, are the critical topics that were evaluated as negative in the two segments evaluated

and concerning the industry, they are "cleanliness," "beaches" and "safety". The purpose of these frames

is to understand what the negative evaluation suggests, in the case of “cleaning” the fact of being

negative gives us the information that it is not done properly and therefore the frame was placed to

eliminate “uncleanliness” so that more efficient cleaning can be achieved, as this is a fundamental topic to

put attention on since in the year 2020 the world experienced a pandemic and this factor is fundamental

for the credibility of a tourist accommodation during that year. In the case of "beaches," the pollution of

the beaches caused this topic to be rated negative, so the proposal is to eliminate "pollution of the

beaches" since this is also another very important topic since most hotels 4 and 5 stars are located by the

sea. The other topic, safety, reflects the insecurity of guests around the location of the hotels. It is

important to mention that two of these topics ("beaches" and "safety") are of governmental competence,

44

but that they also have a responsibility in the image of the destination, being important to emphasize

them.

In relation to Create, topics were rated as very positive, but which were not often mentioned in

the 3-star segment down. At Create, several marketing strategy opportunities were mentioned, such as

partnering with restaurants and transportation. In general, restaurants in the city and the diversity of

transport (as well as uber values) were positively evaluated by tourists, that said, the creation of a

partnership between tourist accommodations and these establishments would benefit both as well, since

it would add value in the tourist experience. Another topic mentioned was to increase the choice of foods

and improve the quality of those offered by tourist accommodations since this factor is also evaluated as

positive in the luxury segment. Another suggestion is to invest in the brand, namely branding, as a

positive brand makes it engraved in the tourist's mind. Another fundamental point is events, which were

very well rated in the luxury segment, but in the 3 stars down segment, it was rated negatively and was

hardly mentioned. Many of the 3-star tourist accommodations down do not offer space for events and, if

possible, would be a great opportunity to expand the services offered by these accommodations.

Regarding the results of the topics, the most critical ones – that also are placed in the Eliminate

framework are external factors of the accommodations, related to tourism policy and government. It is

important to highlight that these topics should give enough attention due to the city of São Luís create

greater tourism competitiveness because that studies found that tourism competitiveness influences

tourism flow and gross domestic products (GDP) (Hossein, Bazargani, & Kiliç, 2021).

The proposals of the 4 actions framework were created comparing the value curve of the industry

against the budget-midlevel category and the value curve of this same segment with luxury hotels. The

strategies described meet the needs of the industry since tourism depends a lot on its stakeholders and

the analysis shows their importance as well as the way they are perceived, especially restaurants and

transport positively evaluated.

45

7. THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS

7.1. THEORETICAL IMPLICATIONS

This research makes an important theoretical contribution. Besides many previous types of research

perform sentiment analysis on review data (J. Li et al., 2018), this is the first approach that uses the

results of the sentiment analysis to create a marketing strategy using Blue Oceans tools, all guided by data

since the proposal of this study is to be data-driven from the beginning to the end.

Also, this study is an intersection between 3 areas: Marketing Strategy, Tourism, and Data Science

opening the door for other approaches and methodologies that can contribute to innovate those 3 areas

providing a methodology that can be applied by researchers as well as practitioners. Interdisciplinary

research is important because of many reasons one of them is to reach a wider audience as the result of

interdisciplinary research (Glod, 2016).

Another theoretical contribution comes from the new conceptual model, that deeper tourism factors

that influence travels, with the creation of the topics. Those topics and the importance of them, especially

related to facilities and services corroborate with the findings of Filieri, Galati & Raguseo (2021) that

shows that different hotel attributes play a different role in predicting extremely positive reviews and

extremely negative helpfulness, thus adding knowledge about the importance of product/service

attributes in electronic word-of-mouth. In this study. The resulting present that the topics related to

facilities (that includes room, bed, bathroom, and others) and services was highly cited and some of them,

like cleaning, was evaluated as extremely negative by the tourists in the reviews.

Another point of congruency is that the research of Filieri, Galati, and Raguseo (2021), shows that the

most important hotel attributes on the reviews considered extremely negatives include hospitality,

bathroom, room, and price/quality ratio. Consumers consider particularly important the hospitality

attribute, which is relevant and frequently discussed in both EPRs (extremely positive ratings) and ENR

(extremely negatives ratings) reviews. Those attributes can be seen on the sentiment analysis step in this

study, specifically on the frequency-dictionary and the bag-of-words approach. Hospitality can be related

with “funcionarios” and “atendimento”; bathroom in Brazilian Portuguese is “banheiro”; room is “quarto”

and price/quality ratio can be related with “preço”. All these attributes belong to a topic or is a topic

indeed (like cost-benefit).

7.2. MANAGERIAL IMPLICATIONS

For managers, this study also has a major contribution since the results of the sentiment analysis

are used to create marketing strategies for the categories of interest. This methodology can be applied

focused on a specific category, as was made in this study, but also can be applied in a specific hotel. The

46

only note is that this methodology should be applied by a professional that has a moderate understanding

of data science, especially with programming languages like python (used in this paper) and R and

understanding of what Blue Ocean is and what is the goal of the tools of this concept. In resume, this

method should be applied by a professional that has an understanding in at least two of the 3 areas

involved (since the tourism topics were already created) with the decision making of the marketing

strategy to be applied for the manager\marketing manager.

This methodology can be applied also to governmental institutions, specifically those related to the

manager of the tourism of a city. It was found that many tourism topics - like beaches, security, restaurants,

touristic spots, and transportation (33% of all the topics) were cited and for some of them was rating

extremely negative (beaches, security). This can give them information that investment of the government

in the city can give a positive return since the satisfying travel experience involves factors related to the

hotel - products and services – (Filieri, Galati, & Raguseo, 2021) but also natural and cultural resources

(Hossein, Bazargani, & Kiliç, 2021). So, the government institution can focus on the topics related to the city

in general also applying the results on a marketing strategy tool, like Blue Ocean.

7.3. LIMITATIONS AND FUTURE RESEARCH

The data-driven approach proposed in this paper unifies 03 areas: Tourism, Data Science and

Marketing Strategy. This research is a result of the master's in Statistic and Information with Data Science

as a secondary area, so the knowledge in python code was not so deep and this made the work harder

because everything made based on machine learning in this paper was learned during the process. So,

some steps could be made automatically.

Another limitation was that the data downloaded from the web pages did not recognize

characters with accentuation, so before applied the steps of data cleaning of sentiment analysis, it was

necessary to do a previous data cleaning on the words for them to be recognized on the later steps of text

mining.

The major limitation was that, because this methodology is in his way innovative, most of the

approach were created from scratch based on other papers been necessary to deepen the information to

achieve a satisfactory result to the approach be data-driven from the beginning to the end.

For future research, it is important to apply this methodology during the coronavirus pandemic

time, starting from 2020, because the topics analyzed could be changed or other topics (like hygiene)

could emerge, creating a new value curve. Also, it could be compared the period of this analysis and the

period after the pandemic to check if any factor or topic has changed. This is because of the odd times

that humanity was facing.

47

It also would be important to apply this methodology with a larger dataset, because some steps,

like text classification, could be done using an algorithm and in an automatic way and would be another

contribution to the areas.

Another point for future research is trying to implement this methodology on specific branches

(like inns or hostels) to be more accurate on what could be improved on that type of accommodations.

Another suggestion for future research is trying to apply another algorithm to see when

implementing the sentiment analysis models. Also, for future research, this methodology can be applied

by combining data from multiple data sources, to get a bigger view of the topics. Also, this same method

can be applied to different seasonality since this is one of the main important factors for tourism,

considering that seasonality can account for a significant proportion of variation in tourism demand

(Vatsa, 2020). At least, this method can be applied considering the nationality of the tourists (if domestics

or international) to see if the same view from the topic changes.

48

8. BIBLIOGRAPHY

Aciar, S. (2010). Mining Context Information from Consumer ’ s Reviews. 2nd Workshop on Context-Aware

Recommender Systems (CARS-2010).

Akram, K; Siddiqui, S.H; Nawaz M.A.; Ghauri, T.A; Cheema, A. K. . (2011). Role of knowledge management to bring innovation: an integrated approach. International Bulletin of Business Administration. (May 2014). Retrieved from https://www.econ-jobs.com/research/9515-Role-of-Knowledge-Management-to-Bring-Innovation-An-Integrated-Approach.pdf Amadio, W. J., & Procaccino, J. D. (2016). Competitive analysis of online reviews using exploratory text mining. Tourism and Hospitality Management. https://doi.org/10.20867/thm.22.2.3 Afzaal, M., & Usman, M. (2016). A novel framework for aspect-based opinion classification for tourist places. The 10th International Conference on Digital Information Management, ICDIM 2015. https://doi.org/10.1109/ICDIM.2015.7381850 Brennan, B. S., Koo, C., & Bae, K. M. (2018). Smart Tourism: A Study of Mobile Application Use by Tourists Visiting South Korea. Asia-Pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, 8(10), 1–9. https://doi.org/10.21742/AJMAHS.2018.10.15 Bruno, L. (2019). Introducing machine learning concepts with WEKA. Journal of Chemical Information and Modeling, 53(9), 1689–1699. https://doi.org/10.1017/CBO9781107415324.004. Buhalis, D., & Amaranggana, A. (2014). Smart Tourism Destinations. In Information and Communication Technologies in Tourism (pp. 553–564). https://doi.org/10.1007/978-3-319-03973-2 Cacho. C.L. Philip Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques, and technologies: A survey on Big Data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.01.015. Chauhan, R., & Kaur, H. (2015). Predictive Analytics and Data Mining. In Business Intelligence. https://doi.org/10.4018/978-1-4666-9562-7.ch019. Chan Kim, W., & Marborgne, R. (2015). Creating Blue Oceans. Engineering and Technology Magazine, 93–94.

Chiappa, G. Del, Zara, A., Murphy, H. C., Dang, Y., Chen, M., Fountoulaki, P., … Jung, T. (2015). Smart Tourism Destinations Enhancing Tourism Experience Through Personalization of Services. (February), 763–774. https://doi.org/10.1007/978-3-319-14343-9. Cró, S., & Martins, A. M. (2017). The importance of security for hostel price premiums: European empirical evidence. Tourism Management, 60, 159-165. Davenport, H.T., & Dyche, J. (2013). Big Data in Big Companies, Retrieved January 5, 2015 from http://www.sas.com/resources /asset/Big-Data-in-Big-Companies.pdf. Dickinger, A., & Mazanec, J. A. (2015). Significant word items in hotel guest reviews: A feature extraction approach. Tourism Recreation Research. https://doi.org/10.1080/02508281.2015.1079964 Dietrich, D., Gray, J., McNamara, T., Poikola, A., Pollock, R., Tait, J., & Zijlstra, T. (2012). What is Open Data?

https://doi.org/10.20867/thm.22.2.3

https://doi.org/10.1017/CBO9781107415324.004

https://doi.org/10.1007/978-3-319-14343-9

https://doi.org/10.1080/02508281.2015.1079964

49

(1.0.0). Open Knowledge Foundation. Retrieved from http://opendatahandbook.org/guide/en/what-is-open-data/. Donthu, N., & Gustafsson, A. (2020). Effects of COVID-19 on business and research. Journal of Business Research, 117(January), 284–289. https://doi.org/10.1016/j.jbusres.2020.06.008 El-Ansary, A. I. (2006). Marketing strategy: Taxonomy and frameworks. European Business Review, 18(4), 266–293. https://doi.org/10.1108/09555340610677499. Fang, B., Ye, Q., Kucukusta, D., & Law, R. (2016). Analysis of the perceived value of online tourism reviews: Influence of readability and reviewer characteristics. Tourism Management, 52,498-506. Feifei Xu, Nicholas Nash & Lorraine Whitmarsh (2019): Big data or small data? A methodological review of sustainable tourism, Journal of Sustainable Tourism, doi: 10.1080/09669582.2019.1631318. Filieri, R., Galati, F., & Raguseo, E. (2021). The impact of service attributes and category on eWOM helpfulness: An investigation of extremely negative and positive ratings using latent semantic analytics and regression analysis. Computers in Human Behavior, 114(February 2020), 106527. https://doi.org/10.1016/j.chb.2020.106527 Fuchs, M., & Lexhagen, M. (2013). Sentiment Analysis Extracting Decision-Relevant Knowledge from UGC. Information and Communication Technologies in Tourism 2014, (January). Gassiot, A., & Coromina, L. (2013). Destination image of Girona: an online text-mining approach. International Journal of Management Cases. Gémar, G., & Jiménez-Quintero, J. A. (2015). Text mining social media for competitive analysis. Tourism & Management Studies. Glod, B. (2016). The 5 Significant Advantages of Interdisciplinary Research No Title. Retrieved December 29, 2021, from https://theihs.org/blog/5-advantages-of-interdisciplinary-research/. Grus, J. (2015). Data science from scratch. Sebastopol, CA: O'Reilly Media. Hossein, R., Bazargani, Z., & Kiliç, H. (2021). Tourism competitiveness and tourism sector performance : Empirical insights from new data. Journal of Hospitality and Tourism Management, 46(October 2020), 73–82. https://doi.org/10.1016/j.jhtm.2020.11.011 Hjalager, A. (2013). 100 Innovations That Transformed Tourism. Journal Of Travel Research, 54(1), 3-21. doi: 10.1177/0047287513516390. Kim, C., & Mauborgne, R. (2019). blueoceanstrategy.com. Retrieved December 10, 2019, from https://www.blueoceanstrategy.com/blue-ocean-strategy-book/ Konkiewicz, K. (2019). Reading a confusion matrix. Retrieved from towardsdatascience.com website: https://towardsdatascience.com/reading-a-confusion-matrix-60c4dd232dd4 Kotler, Philip, Andreasen, & R., A. 1991. Strategic marketing for nonprofit organizations (4th ed.). Englewood Cliffs [N.J.]: Prentice-Hall. Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 6.

https://doi.org/10.1108/09555340610677499

50

Li, J., Xu, L., Tang, L., Wang, S., & Li, L. (2018). Big data in tourism research: A literature review. Tourism Management, 68, 301–323. https://doi.org/10.1016/j.tourman.2018.03.009 Li, H., Ye, Q., & Law, R. (2013). Determinants of Customer Satisfaction in the Hotel Industry: An Application of Online Review Analysis. Asia Pacific Journal of Tourism Research. https://doi.org/10.1080/10941665.2012.708351 Li, Q., Li, S., Zhang, S., Hu, J., & Hu, J. (2019). A review of text corpus-based tourism big data mining. Applied Sciences (Switzerland). https://doi.org/10.3390/app9163300 Lobao, F., Aparicio, M., & Neto, M. D. C. (2019). SMART TOURISM -CITY TOURISM RADAR : A Tourism Monitoring Tool at the City of Lisbon SMART TOURISM – CITY TOURISM RADAR : A Tourism Monitoring Tool at the City of Lisbon. (October). Mariani, M. (2019). Big Data and analytics in tourism and hospitality: a perspective article. Tourism Review, 75(1), 299–303. https://doi.org/10.1108/TR-06-2019-0259. Mehmood, F., Ahmad, S., & Kim, D. H. (2019). Design and development of a real-time optimal route recommendation system using big data for tourists in Jeju Island. Electronics (Switzerland), Vol. 8. https://doi.org/10.3390/electronics8050506. Mohajon, J. (2020). Confusion Matrix for Your Multi-Class Machine Learning Model. Retrieved from towardsdatascience.com website: https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826 Mohd Zawawi, N. F., Abd Wahab, S., Al-Mamun, A., Sofian Yaacob, A., Kumar AL Samy, N., & Ali Fazal, S. (2016). Defining the Concept of Innovation and Firm Innovativeness: A Critical Analysis from Resorce-Based View Perspective. International Journal of Business and Management, 11(6), 87. https://doi.org/10.5539/ijbm.v11n6p87. Nave, M., Rita, P., & Guerreiro, J. (2018). A decision support system framework to track consumer sentiments in social media. Journal of Hospitality Marketing and Management. https://doi.org/10.1080/19368623.2018.1435327. Neidhardt, J., Rümmele, N., & Werthner, H. (2017). Predicting happiness: user interactions and sentiment analysis in an online travel forum. Information Technology and Tourism. https://doi.org/10.1007/s40558-017-0079-2. Pantano, E., Priporas, C., & Stylos, N. (2017). ‘You will like it!’ using open data to predict tourists' response to a tourist attraction. Tourism Management, 60, 430-438. doi: 10.1016/j.tourman.2016.12.020. Phillips, P., Zigan, K., Silva, M. M. S., & Schegg, R. (2015). The interactive effects of online reviews on the determinants of Swiss hotel performance: A neural network analysis. Tourism Management, 50, 130-141. Powers, D. M. W. (2011). Evaluation: From Precision , Recall and F-Measure To Roc , Informedness ,

Markedness & Correlation − R. 2(1), 37–63.

Provost, F., & Fawcett, T. (2013). Data science for business: [what you need to know about data mining and data-analytic thinking]. Sebastopol, Calif.: O'Reilly.

https://doi.org/10.1080/10941665.2012.708351

https://doi.org/10.1108/TR-06-2019-0259

https://doi.org/10.5539/ijbm.v11n6p87

https://doi.org/10.1080/19368623.2018.1435327

51

Ramanathan, V., & Meyyappan, T. (2019). Twitter text mining for sentiment analysis on people’s feedback about Oman tourism. 2019 4th MEC International Conference on Big Data and Smart City, ICBDSC 2019. https://doi.org/10.1109/ICBDSC.2019.8645596. Rita, P., Rita, N., & Oliveira, C. (2018). Data science for hospitality and tourism. Worldwide Hospitality and Tourism Themes, 10(6), 717-725. https://doi.org/10.1108/WHATT-07-2018-0050. Statista. (2020). Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2024. Retrieved December 29, 2021, from https://www.statista.com/statistics/871513/worldwide-data-created/.

Secretaria de Turismo do Estado do Maranhão (2017, October 30). Observatório do Turismo do Maranhão [Web Page]. Retrieved from https://sites.google.com/view/observatorioturismomaranhao/p%C3%A1gina-inicial

Siddiqa, A., Hashem, I. A. T., Yaqoob, I., Marjani, M., Shamshirband, S., Gani, A., et al. (2016). A survey of big data management: Taxonomy and state-of-the-art. Journal of Network and Computer Applications, 71, 151–166.

Thelwall, M. (2001). A web crawler design for data mining. Journal of Information Science, 27(5), 319e325.

Travel, H., Are, B., & Marketing, C. D. (n.d.). GETTING TO PEAK How Travel Brands Are Making the Climb. 1–26. Tsiotsou, R. H., Mild, A., & Sudharshan, D. (2012). Tourism Market ( c ) E m er al ro up Pu bl is hi ( c ) E m ro up Pu bl is. (July). Varadarajan, R. (2010). Strategic marketing and marketing strategy: Domain, definition, fundamental issues and foundational premises. Journal of the Academy of Marketing Science, 38, 119–140. Vatsa, P. (2020). Annals of Tourism Research Seasonality and cycles in tourism demand — redux. Annals of Tourism Research, (xxxx), 103105. https://doi.org/10.1016/j.annals.2020.103105 Witten, I. H., Frank, E., & Hall, M. a. (2011). Data Mining: Practical Machine Learning Tools and Techniques (Google eBook). In Complementary literature None. Retrieved from http://books.google.com/books?id=bDtLM8CODsQC&pgis=1. Wood, S. A., Guerry, A. D., Silver, J. M., & Lacayo, M. (2013). Using social media to quantify nature-based tourism and recreation. Scientific Reports, 3. https://doi.org/10.1038/srep02976. Xiang, Z., Du, Q., Ma, Y., & Fan, W. (2017). A comparative analysis of major online review platforms: Implications for social media analytics in hospitality and tourism. Tourism Management, 58,51e65. Xu, H., Yuan, H., Ma, B., & Qian, Y. (2015). Where to go and what to play: Towards summarizing popular information from massive tourism blogs. Journal of Information Science, 41(6), 830-854. Xu, X., & Li, Y. (2016). The antecedents of customer satisfaction and dissatisfaction toward various types of hotels: A text mining approach. International Journal of Hospitality Management, 55,57e69. Xu, X. (2019). Examining the Relevance of Online Customer Textual Reviews on Hotels’ Product and Service Attributes. Journal of Hospitality and Tourism Research. https://doi.org/10.1177/1096348018764573.

https://doi.org/10.1109/ICBDSC.2019.8645596

https://doi.org/10.1108/WHATT-07-2018-0050

http://books.google.com/books?id=bDtLM8CODsQC&pgis=1

https://doi.org/10.1038/srep02976

52

Yang, Y., Williams, M. H., MacKinnon, L. M., & Pooley, R. (2005).A service-oriented personalization mechanism in pervasive environments. s.l., IEEE. Zhao, X., Wang, L., Guo, X., & Law, R. (2015). The influence of online reviews to online hotel booking intentions. International Journal of Contemporary Hospitality Management, 27(6), 1343-1364.

53

9. ATTACHMENT

import numpy as np

import pandas as pd

import re

import nltk

import matplotlib.pyplot as plt

%matplotlib inline

!pip install PyDrive

Requirement already satisfied: PyDrive in /usr/local/lib/python3.6/dist-packages (1. Requirement already satisfied: google-api-python-client>=1.2 in /usr/local/lib/pytho Requirement already satisfied: oauth2client>=4.0.0 in /usr/local/lib/python3.6/dist- Requirement already satisfied: PyYAML>=3.0 in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: six<2dev,>=1.6.1 in /usr/local/lib/python3.6/dist-pac Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python3.6/ Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3 Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: httplib2<1dev,>=0.17.0 in /usr/local/lib/python3.6/di Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: pyasn1-modules>=0.0.5 in /usr/local/lib/python3.6/dis Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python3.6/dist-packag Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/di Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/dist-p

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth

from oauth2client.client import GoogleCredentials

auth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

downloaded = drive.CreateFile({'id':"1VbyRP78BGCx1X3DStaPOriMJAAkewoPw"}) # replace the

downloaded.GetContentFile('todo_data.xlsx') # replace the file name with your file

df = pd.read_excel('todo_data.xlsx')

54

df.head()

55

0 1 passar oito 1 serviços:atendimento,café, limpeza NaN

dias no

Blue Tree...

Fiz uma

reserva

1 2 para ficar

-1 serviços:atendimento NaN

1.1.1. Id Reviews Label main_topic

Expenditures Tra

Fomos em

família

df1=df.fillna(0)

df1.head(5)

Id Reviews Label main_topic Expenditures Tra

0 1

Fomos em

família

passar oito

1

serviços:atendimento,café, limpeza

0.0

dias no

Blue Tree...

Fiz uma

reserva

1 2 para ficar -1 serviços:atendimento 0 0

56

DATA ANALYSIS

9.2. CLIQUE DUAS VEZES (OU PRESSIONE "ENTER") PARA EDITAR

#creating variables for data analysis and ploting on graphic chart

#Expenditures

e= df1.loc[df1['Expenditures'] == 1].sum(axis=1)

expenditures=len(e)

expenditures

#Travel

t = df1.loc[df1['Travel'] == 1].sum(axis=1)

travel=len(t)

travel

#Activities/Satisfaction

a_s = df1.loc[df1['Activities/Satisfaction'] == 1].sum(axis=1)

activities_satisfaction=len(a_s)

activities_satisfaction

#Visit

visit = df1['Visit'].sum()

visit

#Accommodation

accommodation = df1['Accommodation'].sum()

accommodation

categories = [expenditures,travel,activities_satisfaction,visit,accommodation]

# Pie chart, where the slices will be ordered and plotted counter-clockwise:

labels = 'Expenditures','Travel','Activities/Satisfaction','Visit','Accommodation'

sizes = categories

explode = (0.2, 0.2, 0, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',

shadow=True, startangle=180,textprops={'fontsize': 12})

ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

#Bar chart

plt.figure(figsize=(8,6))

plt.bar(labels, categories,color=['yellow', 'red', 'purple', 'blue', 'green'])

plt.title('Categories')

plt.xlabel('')

plt.ylabel('Values')

plt.show()

GENERAL SENTIMENT ANALYSIS

# Analisando o data frame

total_base = sum(df['Reviews'].value_counts())

print("Base Size: {0:.0f}".format(total_base))

print("Percentual Negativos: {0:.2f}%".format(100*sum(df[df['Label'] == -1]['Label'].value

print("Percentual Positivos: {0:.2f}%".format(100*sum(df[df['Label'] == 1]['Label'].value_

print("Percentual Neutro: {0:.2f}%".format(100*sum(df[df['Label'] == 0]['Label'].value_cou

Base Size: 1391 Percentual Negativos: 17.97% Percentual Positivos: 50.61% Percentual Neutro: 31.42%

# Pie chart sentiment analysis:

labels = 'Negatives','Positives','Neutral'

sizes = [17.97,50.61,31.42]

fig1, ax1 = plt.subplots()

ax1.pie(sizes, colors=['red','green','blue'], labels=labels, autopct='%1.1f%%',

shadow=True, startangle=180,textprops={'fontsize': 12})

ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

PERFORMING SENTIMENT ANALYSIS

1. DIVIDING THE DATASET TO CONTAIN ONLY REVIEWS

features = df1.iloc[:, 1]

features.astype('str')

0 Fomos em família passar oito dias no Blue Tree...

1 Fiz uma reserva para ficar uma semana, ao cheg...

2 Fomos em 05 pessoas e posso afirmar que a pous...

3 - funcionários atenciosos; - café bom; - Basta...

4 Hotel com acesso a minha rota de trabalho. Am... ...

1386 Vista linda, praia calma, único defeito é o ac... 1387 Vista para o mar. área nobre se slz. Um lugar ...

1388 você já deve ter lido bastante sobre o tamanho... 1389 Voltamos de Barreirinhas e ficamos hospedados ... 1390 Vou relatar minha hospedagem no hotel Portas d... Name: Reviews, Length: 1391, dtype: object

data = pd.DataFrame(features)

data = data.rename(columns = {'Reviews':'text'})

data['text'].convert_dtypes('str')

#data.reset_index(inplace= True)

data.dtypes

text object dtype: object

!python -m spacy download pt

Collecting pt_core_news_sm==2.2.5 Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_ne

|████████████████████████████████| 21.2MB 71.7MB/s Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/d Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/d Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist- Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist- Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dis Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist- Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6 Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packag Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in / Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-pa Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-p

https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-2.2.5/pt_core_news_sm-2.2.5.tar.gz

Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages ( Building wheels for collected packages: pt-core-news-sm

Building wheel for pt-core-news-sm (setup.py) ... done Created wheel for pt-core-news-sm: filename=pt_core_news_sm-2.2.5-cp36-none-any.wh Stored in directory: /tmp/pip-ephem-wheel-cache-pr42jtvg/wheels/ea/94/74/ec9be8418

Successfully built pt-core-news-sm Installing collected packages: pt-core-news-sm Successfully installed pt-core-news-sm-2.2.5 ✔ Download and installation successful You can now load the model via spacy.load('pt_core_news_sm') ✔ Linking successful /usr/local/lib/python3.6/dist-packages/pt_core_news_sm --> /usr/local/lib/python3.6/dist-packages/spacy/data/pt You can now load the model via spacy.load('pt')

import spacy

import matplotlib.pyplot as plt

import nltk

from nltk.tokenize import word tokenize

from nltk.stem import WordNetLemmatizer

from nltk.util import ngrams

from unicodedata import normalize

from wordcloud import WordCloud

nlp = spacy.load("pt")

# Pré processamento

def pre_process(data):

data['text'] = data['text'].apply(lambda x: re.sub(r'\bn\b', 'nao',x, flags=re.IGNOREC

data['text'] = data['text'].apply(lambda x: re.sub(r'(\w)(\1{2,})', r'\1',x)) # 3

data['text'] = data['text'].apply(lambda x: x.lower()) # 5

data['text'] = data['text'].apply(lambda x: re.sub(r'[\W*]+', ' ',x)) # 6

data['text'] = data['text'].apply(lambda x: re.sub(r'[0-9]', '',x)) # 7

data['text'] = data['text'].apply(lambda x: re.sub(r'\b \b', ' ',x)) # 9

# Abreviações básicas

data['text'] = data['text'].apply(lambda x: re.sub(r'\bpq\b', 'porque',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bvc\b', 'você',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bvcs\b', 'você',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\btb\b', 'também',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\btbm\b', 'também',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bpra\b', 'para',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bsr\b', 'senhor',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bta\b', 'está',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'\bq\b', 'que',x))

return data

def bgram(data):

data['text'] = data['text'].apply(lambda x: re.sub(r'boa localização', 'boa_localizaçã

data['text'] = data['text'].apply(lambda x: re.sub(r'localização privilegiada', 'local

data['text'] = data['text'].apply(lambda x: re.sub(r'atendimento bom', 'atendimento_bo

data['text'] = data['text'].apply(lambda x: re.sub(r'café bom', 'café_bom',x))

data['text'] = data['text'].apply(lambda x: re.sub(r'excelente atendimento', 'excelent

return data

# Adiconando novos stopwords

new_sw = ['o','a','e','dele', '',' ']

for word in new_sw:

nlp.Defaults.stop_words.add(word)

stopwords_set = nlp.Defaults.stop_words

# Segunda etapa de limpeza

!pip install Unidecode

Collecting Unidecode

Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d82490 |████████████████████████████████| 245kB 9.0MB/s

Installing collected packages: Unidecode Successfully installed Unidecode-1.1.1

import unidecode

def pre_process2(text):

try:

decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))

except:

decoded = unidecode.unidecode(text)

token = nlp(text)

final_tokens = []

for t in token:

if t.is_stop or t.is_punct or t.is_space or t.like_num:

pass

else:

if t.lemma_ == '-PRON-':

final_tokens.append(str(t))

else:

sc_removed = normalize('NFKD', str(t.lemma_)).encode('ASCII', 'ignore').de

if len(sc_removed) > 1:

final_tokens.append(sc_removed)

joined = ' '.join(final_tokens)

spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)

return spell_corrected

def spacy_cleaner3(text):

try:

decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))

except:

decoded = unidecode.unidecode(text)

apostrophe_handled = re.sub("’", "'", decoded)

expanded = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t i

parsed = nlp(expanded)

final_tokens = []

for t in parsed:

https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl

if t.is_punct or t.is_space or t.like_num or t.like_url:

pass

else:

if t.lemma_ == '-PRON-':

final_tokens.append(str(t))

else:

sc_removed = re.sub("[^a-zA-Z]", '', str(t.lemma_))

if len(sc_removed) > 1:

final_tokens.append(sc_removed)

joined = ' '.join(final_tokens)

spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)

return spell_corrected

# Criando Bigramas

def bigramReturner(text):

token = nltk.word_tokenize(text)

bigrams = list(ngrams(token,2))

return bigrams

pre_process(data)

bgram(data)

data['clean_text'] = [pre_process2(i) for i in data.text]

data['label'] = df['Label']

data.head(5)

1.1.2. text clean_text

label

0 fomos em família passar oito dias no blue

tree...

1 fiz uma reserva para ficar uma semana ao

chega...

2 fomos em pessoas e posso afirmar que a

pousada...

familia passar dia blue tree tawer confessar

f... 1

fazer reservar ficar semana chegar manha

pagar... -1

pessoa afirmar pousar aconchegante

espacoso or... 1

ANÁLISE EXPLORATÓRIA

# Função para criar núvem de palavras

def print_wordcloud(data, bg_color):

words = ' '.join(data)

wordcloud = WordCloud(stopwords=stopwords_set,

background_color=bg_color,

width=3000,

height=2000

).generate(words)

plt.figure(1, figsize=(15, 15))

plt.imshow(wordcloud)

plt.axis('off')

plt.show()

# Funções para bag of words

def get_all_words(text):

all_words = []

for words in text:

all_words.extend(words.split())

return all_words

def get_bag_of_words(all_words):

return nltk.FreqDist(all_words)

#All_words_text_deep

all_words = get_all_words(data["clean_text"]) # Escolha a coluna a ser analisada

bag_of_words = get_bag_of_words(all_words)

word_features = bag_of_words.keys()

# Analisando a frequencia do dicionário

bag_of_words.most_common(30)

[('hotel', 1485), ('cafe', 981), ('manha', 939), ('ficar', 703), ('atendimento', 427), ('quarto', 407), ('restaurante', 375), ('piscina', 354), ('banheiro', 336), ('funcionario', 330), ('cama', 330), ('excelente', 321), ('localizacao', 315), ('dia', 309), ('praia', 274), ('luis', 263), ('visto', 257), ('localizar', 253), ('confortavel', 243), ('ter', 235),

('centrar', 235), ('opcao', 234), ('haver', 229), ('mar', 228), ('recepcao', 220), ('hospedar', 215), ('ar', 211), ('historico', 203), ('recomendar', 201), ('noite', 197)]

# Núvem de palavras do dicionário

print_wordcloud(bag_of_words, 'black')

# Analisando a frequencia do dicionário

bag_of_words.most_common(30)

[('hotel', 1485), ('cafe', 981), ('manha', 939), ('ficar', 703), ('atendimento', 427), ('quarto', 407), ('restaurante', 375), ('piscina', 354), ('banheiro', 336), ('funcionario', 330), ('cama', 330), ('excelente', 321), ('localizacao', 315), ('dia', 309), ('praia', 274), ('luis', 263), ('visto', 257), ('localizar', 253), ('confortavel', 243), ('ter', 235), ('centrar', 235), ('opcao', 234), ('haver', 229), ('mar', 228), ('recepcao', 220), ('hospedar', 215), ('ar', 211), ('historico', 203), ('recomendar', 201), ('noite', 197)]

# Núvem de palavras do dicionário

print_wordcloud(bag_of_words, 'black')

SENTIMENT ANALYSIS

from sklearn import naive_bayes

from sklearn import metrics

from sklearn.model_selection import cross_val_predict

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score, confusion_matrix

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import precision_score, recall_score, f1_score

/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: FutureWarning: T "(https://pypi.org/project/six/).", FutureWarning)

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarni warnings.warn(message, FutureWarning)

# Aplicando TD_IDF

tvec = TfidfVectorizer(max_features=3000, ngram_range=(1, 3))

# Aplicando modelo de Naive Bayes

naive = naive_bayes.MultinomialNB()

# Função para o Naive bayes

def nb_cv(splits, X, Y, pipeline, average_method):

kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)

accuracy = []

precision = []

recall = []

f1 = []

for train, test in kfold.split(X, Y):

nb_fit = pipeline.fit(X[train], Y[train])

prediction = nb_fit.predict(X[test])

scores = nb_fit.score(X[test],Y[test])

accuracy.append(scores * 100)

precision.append(precision_score(Y[test], prediction, average=average_method)*100)

print(' neg neut pos')

print('precision:', precision_score(Y[test], prediction, average=None))

recall.append(recall_score(Y[test], prediction, average=average_method)*100)

print('recall: ',recall_score(Y[test], prediction, average=None))

f1.append(f1_score(Y[test], prediction, average=average_method)*100)

print('f1 score: ',f1_score(Y[test], prediction, average=None))

print('-'*27)

print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))

print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))

print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))

https://pypi.org/project/six/

print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))

def nb_teste(splits, X, Y, pipeline, average_method):


accuracy = []

precision = []

recall = []

f1 = []





precision.append(precision_score(Y[test], prediction, average=None))

recall.append(recall_score(Y[test], prediction, average=None))

f1.append(recall_score(Y[test], prediction, average=None))

df_precision = pd.DataFrame(precision, columns =['Negative', 'Neutral', 'Positive']) df_recal = pd.DataFrame(recall, columns =[ Negative , Neutral , Positive ])

df_f1 = pd.DataFrame(f1, columns =['Negative', 'Neutral', 'Positive'])

df2 = pd.concat([df_precision,df_recal,df_f1], axis=0)

return df2

from sklearn.pipeline import Pipeline

original_pipeline = Pipeline([

('vectorizer', tvec),

('classifier', naive)

])

nb_cv(5, data['clean_text'], data['label'], original_pipeline, 'macro')

#incluir labels para continuar

precision: neg neut [1.

pos 0.575

0.74331551]

recall: [0.24 0.52272727 0.9858156 ] f1 score: [0.38709677 0.54761905 0.84756098]

neg neut pos

precision: [1. 0.4939759 0.72432432] recall: [0.2 0.47126437 0.95035461] f1 score: [0.33333333 0.48235294 0.82208589]

neg neut pos

precision: [1. 0.55263158 0.71657754] recall: [0.3 0.48275862 0.95035461] f1 score: [0.46153846 0.51533742 0.81707317]

neg neut pos

precision: [0.94444444 0.54166667 0.7287234 ] recall: [0.34 0.44827586 0.97163121] f1 score: [0.5 0.49056604 0.83282675]

neg neut pos

precision: [0.9047619 0.59722222 0.74054054] recall: [0.38 0.48863636 0.97857143] f1 score: [0.53521127 0.5375 0.84307692]

accuracy: 69.37% (+/- 1.72%) precision: 75.09% (+/- 1.27%) recall: 58.07% (+/- 2.40%) f1 score: 59.69% (+/- 2.99%)

from imblearn.pipeline import make_pipeline

from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler

ROS_pipeline = make_pipeline(tvec, RandomOverSampler(random_state=777),naive)

SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),naive)

nb_cv(5, data.clean_text, data.label, ROS_pipeline, 'macro')

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)

neg neut pos



neg neut pos



neg neut pos



neg neut pos


neg neut pos


accuracy: 75.05% (+/- 2.04%) precision: 72.19% (+/- 2.50%) recall: 74.66% (+/- 2.92%) f1 score: 73.10% (+/- 2.61%) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin

warnings.warn(msg, category=FutureWarning)

nb_cv(5, data.clean_text, data.label, SMOTE_pipeline, 'macro')

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin

warnings.warn(msg, category=FutureWarning) neg neut pos



neg neut pos



neg neut pos



neg neut pos



neg neut pos


accuracy: 74.69% (+/- 2.43%) precision: 71.69% (+/- 3.00%) recall: 74.65% (+/- 3.10%) f1 score: 72.76% (+/- 3.06%)

PLOTS

from sklearn.utils.multiclass import unique_labels

def plot_confusion_matrix(y_true, y_pred, classes,

normalize=False,

title=None,

cmap=plt.cm.Blues):

"""

This function prints and plots the confusion matrix.

Normalization can be applied by setting `normalize=True`.

"""

if not title:

if normalize:

title = 'Normalized confusion matrix'

else:

title = 'Confusion matrix, without normalization'

# Compute confusion matrix

cm = confusion_matrix(y_true, y_pred)

# Only use the labels that appear in the data

classes = classes[unique_labels(y_true, y_pred)]

if normalize:

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print("Normalized confusion matrix")

else:

print('Confusion matrix, without normalization')

print(cm)

fig, ax = plt.subplots()

im = ax.imshow(cm, interpolation='nearest', cmap=cmap)

ax.figure.colorbar(im, ax=ax)

# We want to show all ticks...

ax.set(xticks=np.arange(cm.shape[1]),

yticks=np.arange(cm.shape[0]),

# ... and label them with the respective list entries

xticklabels=classes, yticklabels=classes,

title=title,

ylabel='True label',

xlabel='Predicted label')

# Rotate the tick labels and set their alignment.

plt.setp(ax.get_xticklabels(), rotation=45, ha="right",

rotation_mode="anchor")

# Loop over data dimensions and create text annotations.

fmt = '.2f' if normalize else 'd'

thresh = cm.max() / 2.

for i in range(cm.shape[0]):

for j in range(cm.shape[1]):

ax.text(j, i, format(cm[i, j], fmt),

ha="center", va="center",

color="white" if cm[i, j] > thresh else "black")

fig.tight_layout()

return ax

np.set_printoptions(precision=2)

def nb_prediction(splits, X, Y, pipeline, average_method):






return prediction

def nb_Ytest(splits, X, Y, pipeline, average_method):






return Y[test]

## Plot non-normalized confusion matrix

class_names = np.array(['Positive','Neutral','Negative'])

plot_confusion_matrix(nb_prediction(5, data.clean_text, data.label, ROS_pipeline, 'macro')

nb_Ytest(5, data.clean_text, data.label, ROS_pipeline, 'macro'), cla

title='Confusion matrix, without normalization')

plt.show()

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin

warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin









warnings.warn(msg, category=FutureWarning) Confusion matrix, without normalization [[ 39 16 2]

[ 11 57 25] [ 0 15 113]]

BLUE OCEAN DATA

#dataset for blue ocean

data_bo = data

data_bo['main_topic'] = df1['main_topic']

data_bo.head(5)

text clean_text label main_topic

0

fomos em família

passar oito dias

familia passar dia

blue tree tawer

1

serviços:atendimento,café, limpeza

Page | i

finding blue oceans in tourism

Documents