finding blue oceans in tourism
TRANSCRIPT
i
Finding Blue Oceans in Tourism:
Samira dos Santos Nogueira
Using Text Mining to Identify Business Opportunities in Tourism
Dissertation presented as a partial requirement for obtaining
the master’s degree in Statistics and Information
Management.
2
NOVA Information Management School
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
Finding Blue Oceans in Tourism:
by
Samira dos Santos Nogueira
Dissertation presented as a partial requirement for obtaining a master’s degree in Statistics and
Information Management, with a specialization in Market Research and CRM.
Advisor / Co Advisor: Diego Costa Pinto, Mauro Castelli
March 2021
3
DECLARATION OF ORIGINALITY
I declare that the work described in this document is my own and not from someone else. All the
assistance I have received from other people is duly acknowledged and all the sources (published or not
published) are referenced.
This work has not been previously evaluated or submitted to NOVA Information Management
School or elsewhere.
Lisbon, 02 of March of 2021
Samira dos Santos Nogueira
_______________________________________________________
DECLARAÇÃO DE ORIGINALIDADE
Declaro que o trabalho contido neste documento é da minha autoria e não de outra pessoa. Toda
a assistência recebida de outras pessoas está devidamente assinalada e é efetuada referência a todas as
fontes utilizadas (publicadas ou não).
O trabalho não foi anteriormente submetido ou avaliado na NOVA Information Management
School ou em qualquer outra instituição.
Lisboa, 02 de Março de 2021
Samira dos Santos Nogueira
4
ACKNOWLEDGEMENTS
THIS WORK IS DEDICATED TO ALL MY FAMILY, THAT GAVE ME THE NECESSARY SUPPORT FOR ME
TO CONCLUDE THE MASTER. ALL THE ARAÚJO, GOMES AND NOGUEIRA FAMILY THANK YOU VERY MUCH.
I ALSO DEDICATE THIS WORK TO MY FRIENDS THAT GAVE ME EMOTIONAL SUPPORT TO
CONTINUE MY DREAM.
A SPECIALL DEDICATION TO RODOLFO SALDANHA THAT HELPED ME ON THE DATA SCIENCE PART
OF THE RESEARCH.
THANK YOU VERY MUCH TO YOU ALL!
5
ABSTRACT
The amount of data produced and available are bringing innovation to well know areas. One
of them is Tourism for which the use of big data is particularly useful to offer ever more personalized
options to travelers. The main type of data that influence consumers preference and decisions are
online reviews made in specialized websites or social networks. That happens because consumers
tend to take into consideration the opinions and reviews of other travelers before deciding on a
destination or where to stay. In this study, a sentiment analysis of more than 1,300 reviews retrieved
from TripAdvisor shows what the main attributes that predict positive and negative online reviews
are. Naïve Bayes was used as an algorithm and given a result of 75% of accuracy on the sentiment
analysis. The next step was complementing the sentiment analysis by using the results to build a Blue
Ocean-inspired strategy that speaks to practitioners in the sector of tourism and hospitality. The
findings indicate that the targeted factors for improvement are developing venues for events,
establishing a feeling of safety for consumers, and fostering brand attachment.
Keywords: data science, sentiment analysis, blue ocean, text mining, tourism
6
INDEX
1. Introduction ........................................................................................................................... 8
2. Literature review ..................................................................................................................... 10
2.1. Big Data ............................................................................................................... 10
2.2. Data Science ........................................................................................................ 11
2.3. Tourism ............................................................................................................... 15
2.3.1. Smart Cities .................................................................................................. 17
2.3.2. Smart Tourism ............................................................................................. 17
2.4. Marketing strategy .............................................................................................. 18
2.4.1. Innovation .................................................................................................... 19
2.4.2. Blue Ocean Strategy ..................................................................................... 20
2.5. Previous Work ..................................................................................................... 22
3. CONCEPTUAL MODEL .............................................................................................................. 24
4. METHODOLOGY ...................................................................................................................... 27
4.1. Data Collection .................................................................................................... 27
4.2. Machine Learning Approach ............................................................................... 27
4.3. Blue Ocean Approach .......................................................................................... 28
5. FINDINGS ........................................................................................................................... 30
5.1. Blue Ocean Findings ............................................................................................ 37
6. DISCUSSION ....................................................................................................................... 42
7. THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS ................................................ 45
7.1. Theoretical implications ...................................................................................... 45
7.2. Managerial implications ...................................................................................... 45
7.3. Limitations and future research .......................................................................... 46
8. Bibliography ...................................................................................................................... 48
9. Attachment........................................................................................................................ 53
7
LIST OF FIGURES
Figure 1 - Big Data in Tourism Research………………………………………………………………………………………….. 10
Figure 2 - Data Mining tasks …………………………………………………………………………………………………………… 12
Figure 3 - Factors that influence tourists experience ……………………………………………………………………… 16
Figure 4 - Strategy Canvas ………………………………………………….………………………………………………………… 21
Figure 5 - Four frameworks from Blue Ocean …………………………………………………………………………………. 22
Figure 6 - New Conceptual Model ………………………………………………………………………………………………….. 25
Figure 7 - Tourism Factors …………………………………….……………………………………………………………………… 30
Figure 8 - Tourism topics ……………………………………………………………………………………………………………….. 31
Figure 9 - Sentiment analysis distribution ………………………………………………………………………………………. 31
Figure 10 - Frequency dictionary of the words ……………………………………………………………………………….. 32
Figure 11 - Bag of words cloud ……………………………………………………………………………………………………….. 33
Figure 12 - First classifier model …………………………………………………………………………………………………….. 34
Figure 13 - Second classifier model ………………………………………………………………………………………………… 34
Figure 14 - Third classifier model ……………………………………………………………………………………………………. 35
Figure 15 - Confusion Matrix ………………………………………………………………………………………………………….. 37
Figure 16 - Industry Canvas ……………………………………………………………………………………………………………. 38
Figure 17 - Industry x 3 to 1 stars strategy canvas ………………………………………………………………………….. 39
Figure 18 - Industry x 4 and 5 stars strategy canvas ……………………………………………………………………….. 40
Figure 19 - 3 to 1 stars category x 4 and 5 stars category strategy canvas ………………………………………. 41
Figure 20 - 4 actions framework based on 3 to 1 stars category …………………………………………………….. 43
8
1. INTRODUCTION
According to Statista (2021), the total amount of data created, captured, copied, and consumed
worldwide is forecast to increase rapidly, reaching 59 zettabytes in 2020. The rapid development of
digitalization contributes to the ever-growing global data sphere. Big Data increasingly attracts attention
from different sectors because of its impacts and cultural changes in people’s lives.
One of these areas of study is the tourism area, in which Big Data's applied techniques can benefit
the sector that, for some cities and regions, is a driver of the local economy (Wood et al., 2013). Big Data
focused on tourism can benefit the sector because provide a more data-driven approach for managers
and can give the opportunity to improve customer relationship management, both in terms of attracting
new travellers and maintaining the existing ones and identifying points for improvement and existing
issues in the business (Neidhardt, Rümmele, & Werthner, 2017).
In the future, the perspective is that the Tourism and Hospitality area will embrace Big Data and
Big Data Analytics at different levels, speeds and for different purposes. In particular, BD will increasingly
contribute to (a) frame novel research questions and hypotheses if combined with an underpinning
conceptual framework; (b) enrich research designs and methods; (c) improve the generalizability of
research findings across different institutional, economic, social and geographical contexts; (d) generate
relevant managerial insights and business intelligence by means of (digital) data analytics in real-time; and
(e) advance BD technological applications in the verticals of Tourism and Hospitality (Mariani, 2020, p.02).
One of the principal data sources to get information of is online reviews. This type of data can
significantly influence online booking intention (Phillips, Zigan, Silva, & Schegg, 2015; Zhao, Wang,
Guo, & Law, 2015), and more than 60% of travelers use other consumers’ comments as a source of
information when making travel plans (Cró & Martins, 2017; Fang, Ye, Kucukusta, & Law, 2016).
Regarding academic production, there is an increasing number of studies connecting data science
and tourism but is still necessary to develop more studies to connect the results to real actions for
business (Li, Xu, Tang, Wang, & Li, 2018). With this highlighted, deepening data science in tourism allows
improving customer satisfaction giving the necessary information to stand out from the competition
(Amadio & Procaccino, 2016).
Based on this information, this study proposes to connect results from data science techniques to
a marketing strategy, specifically Blue Ocean, to find business opportunities in Tourism. This is the first
approach of the type and one of the goals is to be data-driven from the beginning to the end of the
process. For this, a text mining technique will be used to perform a sentiment analysis about the reviews
made online on a social media platform (TripAdvisor). In this process, will be identified tourism factors,
the principal subtopics of tourism, and from the results, create a Strategy Canvas for the tourism industry
9
and their subsequent categories, to understand the value curve and what can be improved. Based on
quantity criteria, the category chosen to create a 4 Action Framework with marketing strategies was
touristic accommodation equal or below 03 stars, according to the classification of TripAdvisor.
The place chosen to perform the analysis is the city São Luís of the State of Maranhão, Brazil. The
choice of Brazil, specifically São Luís, is due to the country possess various touristic cities. The
methodology developed in this study can help the tourism area and hotel managers responsible for
decision making to apply the methodology in the professional field providing the improvement of tourism
in the cities, being São Luís one of them.
The first section of this study is focused on a literature review, which brings important concepts
necessary to comprehend this paper-like Big Data, Data Science, Data Mining, Text Mining, Tourism,
Smart Cities and Smart Tourism, Marketing Strategy, Innovation, and Blue Ocean Strategy. In the second
section, the conceptual model developed will be presented followed by a description of the methodology
applied in this paper, divided into Data Collection, Machine Learning Approach, and Blue Ocean Approach.
After that, the subsequent sections will be dedicated to the results: Findings will be a section to discuss
the results of the data analysis, specifically the sentiment analysis and Blue Ocean. In the next section, we
will be discussed the results more deeply and will be created marketing strategies based on the Blue
Ocean Strategy Canvas, plotted in a Blue Ocean tool, the 04 actions framework. Finally, theoretical, and
managerial implications will be discussed, with a proposition of potential questions for future research.
10
2. LITERATURE REVIEW
2.1. BIG DATA
As a Big Data concept, it can be understood as "to the large amount of data characterized by the
large volume, variety, and speed, requiring new processes to enable better decision making, the discovery
of insight and optimization of processes" (Siddiqa et al., 2016). Although there is no definitive concept
about Big Data, the best known is the one created by Laney (2001), where he characterizes big data as 3
Vs (Volume, Velocity, Variety).
The importance of Big Data on business is that allows longitudinal studies due to constant\regular
data capture, easy data storage and low cost (Xu, 2019) since it give the possibility of having a large
amount of data available without much effort and human resources.
Big Data can have different data sources that can be considered “the fount.” Of the many sources
available, like smartphones, IoT, etc., relating the data sources with tourism area, the area under study,
their division can be seen below as
Figure 1 – Big Data in Tourism Research
As seen, the data sources related to tourism can be divided into three main categories: UGC data
(generated by users), including online textual data and online photo data; device data (by devices),
including GPS data, mobile roaming data, Bluetooth data, etc.; transaction data (by operations), including
web search data, webpage visiting data, online booking data (J. Li, Xu, Tang, Wang, & Li, 2018). For this
research, it is going to be used UGC data, produced by users in the format of online textual data.
11
2.2. DATA SCIENCE
To analyze this large amount of data, there is a specific science: data science. According to Provost
and Fawcett (2013, p.03), data science "is a set of fundamental principles that support and guide the
extraction of information and knowledge from data."
One of the main objectives to analyze this big data serve to reveal patterns and trends (Chiappa et
al., 2015). Patterns and trends are really important information for business, in all areas, especially the
ones that seek innovation because can provide new opportunities to explore and grow a business.
Today, because of the variety of technological devices and data sources, there are many types of
data available: voice, text, images, videos, and these different formats are a new challenge to analyze
those data. In this way, the companies, in addition to dealing with large volumes of data, now need to be
able to handle new data types (Davenport & Dyche, 2013). In other words, different techniques must be
applied depending on the type of data available.
There are many possible techniques for analyzing these data, the most common being related to
describe the data (descriptive techniques, like cluster analysis) and prediction (predictive techniques, like
linear regression) applying the data mining methods. For this research, text mining, specifically, sentiment
analysis will be applied to the data.
2.2.1. Data Mining and Machine Learning
Data Mining, in a basic concept, is finding useful patterns in the data as also referred to as
knowledge discovery, machine learning, and predictive analytics (Chauhan & Kaur, 2015). It derives
computational techniques from the disciplines of statistics, artificial intelligence, machine learning,
database theories, pattern recognition and uses modeling and algorithms to extract knowledge from the
data, preferably using large datasets (Chauhan & Kaur, 2015).
Data mining requires large datasets to find patterns. Much of the process lends itself to
automation when the creation of algorithms can easily identify this, and that’s where machine learning
enter (Bruno, 2019).
Before we talk about Machine Learning, it is crucial to understand another concept: modeling. A
model is a “specification of a mathematical (or probabilistic) relationship that exists between different
variables” (Grus, 2015, p. 141).
Machine Learning is used broadly in data science to refer to the techniques to be applied to
analyze and get information from data. The basic concept is that it is a branch of artificial intelligence that
aims to enable machines to perform their jobs skillfully using intelligent software. Because they use
sophisticated statistical methods and need data to learn patterns, we can say that it is multidisciplinary
(Mohammed, Khan, and Bashier, 2016). In resume, machine learning is valuable to create and use models
learned from data (Grus, 2015).
12
There are many kinds of machine learning, and many different algorithms to choose from
depending upon the data available (structured or unstructured), data size, and the goals of the study
(Bruno, 2019).
2.2.2. Types of Data Mining
Data Mining questions can be divided between supervised or unsupervised learning models.
Supervised data mining tries to infer a function or relationship based on labeled training data and uses
this function to map new unlabeled data, also have as characteristics predict the value of the output
variables based on a set of input variables and needs a sufficient number of labeled records to learn the
model from the data. Unsupervised learning, on the other hand, uncovers hidden patterns in unlabeled
data, there are no output variables to predict and the objective is to find patterns in data based on the
relationship between data points themselves. Data Mining techniques can be grouped into: classification,
regression, association analysis, anomaly detection, time series, and text mining tasks as we can see in
figure 2 (Chauhan & Kaur, 2015).
Figure 2 – Data Mining tasks
For predictions that are related to supervised learning, two forms are popularly applied:
classification and regression. In unsupervised learning, which has the goal to find interesting and useful
generalities within the data (Bruno, 2019), the most common forms are clustering and association.
13
For this research, the analysis will use Text Mining techniques, specifically a supervised sentiment
analysis approach, due to the characteristics of the data and the desired output already available.
2.2.3. Text Mining
As a basic concept, text mining “is a data mining application where the input data is text, which
can be in the form of documents, messages, emails, or web pages” (Chauhan & Kaur, 2015, p.9). To
perform data mining using textual data, the text files are converted into document vectors where each
unique word is considered an attribute, and what matters is to reduce these attributes to the important
ones that could be extracted knowledge about it (Chauhan & Kaur, 2015).
Text Mining is a valuable technique for organizations since it allows them to understand
everything from consumer opinions to the brand's reputation in an online environment. Regarding
tourism, text mining is even more important since most information in an online environment is done
through text (Nave, Rita, & Guerreiro, 2018).
The concept of Text Mining is related closely to Opinion Mining. As Opinion Mining, it can be
understood as the use of natural language processing that aims to determine whether the piece of
content is positive, negative, or neutral whereas text mining is used to identify and extract subjective
information in feedbacks (Afzaal & Usman, 2016).
One popular technique of text mining that also is going to be used in this study, is sentiment
analysis. Sentiment analysis is focused on the extraction of the relevance of the product’s feature based
on sentiments of polarity (positive or negative) of consumer reviews expressed in review sentences. To
perform sentiment analysis is usual to use Natural Language Process (NLP), supervised/unsupervised
learning, and association rules. To do a sentiment classification, text mining and mutual information are
used (Aciar, 2010).
Sentiment Analysis can be divided into two categories: the first is based on machine learning
methods (such as neural networks) and the other is based on dictionary-based methods that use
predefined sentiment dictionaries such as WordNet, HowNet, LIWC and which retain terms related to
feelings and their polarization values. The combination of these two categories also shows great potential
for the results (Q. Li, Li, Zhang, Hu, & Hu, 2019).
If we are going to use a Machine Learning algorithm, the entire sentiment analysis process,
including text classification, is done manually. If using a dictionary-based approach there is already a list of
words that can be associated with each process (property, subjectivity, and sentiment) (Fuchs &
Lexhagen, 2013). Some dictionaries are SentiStrength, which calculates the positive and negative
sentiment score in a short text.
Positive sentiment value ranges from 1 (not positive) to 5 (extremely positive) and negative
sentiment value ranges from -1 (not negative) to -5 (extremely negative); SentiWordNet, that is a
14
sentiment lexicon holding a polarity score of the opinion words. It has approximately 3 million words
including nouns, verbs, adverbs, adjectives, and Opinion Lexicon, which is one of the oldest dictionaries.
Here, is important to cite the lexicon-based approach that adopts the sentiment orientation of a given
text document as the average of the sentiment orientation of its words and phrases (Ramanathan &
Meyyappan,2019).
The sentiment analysis technique is often divided into two consecutive steps: (a) detecting which
text segments contain the dimensions and (b) determining the polarity and strength of the sentiment of
each of these dimensions (Pang & Lee, 2004).
Sentiment analysis also can be split into Contextual or Conceptual Semantic Sentiment Analysis
which the first one is inferred from the co-occurrence patterns of words. Change the context may lead to
a change in the word’s sentiment. The sentiment is changed based on neighboring words and the second
one is often extracted from external knowledge sources such as ontologies and semantic networks
(Ramanathan & Meyyappan, 2019).
In the tourism area, the use of sentiment analysis helps to obtain tourists' feelings and opinions in
real-time, thus carrying out the appropriate measures to what is being explained (Q. Li et al., 2019).
2.2.4. Analyzing textual data
To analyze textual data, first, it is necessary to collect the data from the related social media (one
of the main sources of this type of data) or reviews on websites and this can be done via web crawling
technology (Xiang et al., 2015, 2017; Xu et al., 2015). A web crawler, that could be understood as a robot
or spider, is implemented to download web pages, extract uniform resource locators (URLs) from their
hypertext markup language (HTML) and fetch them (Thelwall, 2001) and is going to be implemented in
this study to extract the data for analysis iteratively.
The second step is data preprocessing and pattern discovery. Talking about data preprocessing,
popular operations are data cleaning, tokenization, normalization, word stemming/lemmatization, and
part-of-speech tagging (POST) (J. Li et al., 2018). Data cleaning has the goal to detect and remove
inaccurate or useless records from text data online such as misspelling (Xiang et al., 2015), stop words
(Xiang et al., 2015; Xu & Li, 2016; Xu et al., 2015), non-target language and low-frequency words (Guo et
al., 2017), to leave the valuable information. Tokenization, to break up the textual data into words,
phrases, or other meaningful elements, namely tokens. (J. Li et al., 2018). Word stemming/lemmatization,
to identify the word's roots and regard all words with the same root as one token (Xu & Li, 2016).
POST has the objective to label each word in a sentence with a POS tag that can be a noun, adjective, or
adverb (J. Li et al., 2018). After that, is necessary to transform the text into vectors because the vector
representation for the granularity of words, sentences, documents, etc., is the basis for related machine
15
learning, and the pre-trained models of these vectors provide the premise for the input of other models
(Q. Li et al., 2019).
About pattern discovery, the popular techniques to apply in online textual data, especially taken
into consideration the tourism research, are latent Dirichlet allocation (LDA), sentiment analysis,
statistical analysis, clustering and categorization, text summarization, and dependency modeling. LDA, a
topic model for identifying the abstract “topics” in textual data; sentiment analysis that aims to classify
textual data into sentiment categories (positive, negative, or neutral); statistical analysis, the most basic
technique to analyze data, englobes descriptive statistics (e.g., mean, variance, etc.), t-test, correlation
matrix and others; text summarization, to automatically produce a summary of single or multiple
documents (s), for refining key information from original texts and dependency modeling, that has the
aim for capturing the relationship between textual data (for example online reviews) and factors (like
hotel performance) (J. Li et al., 2018).
One of the major steps to perform text mining, especially sentiment analysis, is Topic
Classification. Connected with the tourism area, generally, the comments of the reviews are short but
involve several aspects, factors related to travel such as transportation, entertainment, accommodation,
food, among others. For this reason, text classification in the tourism area generally involves the
extraction of topics from the message text so that all aspects pointed out can be extracted for further
analysis (Q. Li et al., 2019).
Regarding algorithms used, traditionally, text Classification is based on machine learning and uses
Naive Bayes, maximum entropy, Support Vector Machine, K-nearest neighbor algorithm, and its main
characteristics are to use keywords or topics that reflect the character of the document and carry out the
classification of the text automatically. Today, the most prominent text extraction techniques are TF-IDF
and information divergence, and other deep learning approaches. There are also other methods, such as
N-gram, which are also used but which have a disadvantage (such as the loss of text information) (Q. Li et
al., 2019).
Other steps that could be applied is Bag-of-Words Language Model, TF-IDF, and Word Embeddings.
As said, several areas can benefit from the techniques of predictive analytics, especially Tourism.
2.3. TOURISM
Tourism constitutes one of the largest industries worldwide, contributing 6 trillion dollars annually
to the global economy with nearly 260 million jobs worldwide and by 2021 it is expected that 69 million
more jobs are going to be created. (Tsiotsou, Mild, & Sudharshan, 2012). Tourism is an industry that
depends heavily on stakeholders because it is a compilation of various services such as accommodation,
transportation, dining, recreation, and travel and all these factors affect customer satisfaction. (Tsiotsou
16
et al., 2012). As we can see in the figure below, there are some of the principal factors that influence a
tourist experience.
Figure 3 – Factors that influence tourist’s experience.
Due to economic development and national income, the demand for tourism has increased
significantly. On the other hand, due to the popularity of web technologies, the internet was used to
request information before leaving, like reviews online on travel websites (Mehmood, Ahmad, & Kim,
2019). Regarding that, reviews in which there is a strongly positive or strongly negative message, as well
as the quality of these reviews, has a great influence on the consumer's buying behavior (Dickinger &
Mazanec, 2015), also the user-generated content of the online reviews are accordingly today recognized
as an important component in the construction of a destination's image (Yeoh, Othman, & Ahmad, 2013).
In Tourism, the destination image is one of the main factors that influence travelers when
deciding a destination to go to. The destination image reflects the tourist market, including the national
country image, the city image, the scenic spot image, etc. (Q. Li et al., 2019). Destination images should be
promoted according to reality and all the agents of promotion have to communicate the same language
because when higher the expectation of the tourist (if the destination cannot achieve this expectation),
the higher is the disappointment (Gassiot & Coromina, 2013).
The tourism industry is amongst the most innovative worldwide, given its ability to incorporate
technological and societal advances through new business creation (Hjalager, 2015). Thus, in the context
of the tourist "the focus is on the management of tourist data and marketing strategy and the ability to
directly reach users through social media allows creating new opportunities for service providers".
(Pantano, Priporas, & Stylos, 2017).
17
Data collection and analysis are necessary for a country to make public policies and development
of the global travel and tourism industry (Mehmood et al., 2019) and that’s where Smart Cities and Smart
Tourism can be molded.
2.3.1. Smart Cities
Smart City is a modern concept that emerged from the problems that a city encounter due to a
higher influx of citizens in the urban areas (Lobao et al., 2019).
In Smart City, it is crucial to understand that citizens and technologies are the keys for a city to
become smart (Lobao et al., 2019). In a simple concept, Smart City is when a city uses citizens' data to
improve essential services like transportation and so on.
But if a smart city can improve services provided by a city to the citizens and tourists, in real life
most of the cities are far away from this scenario, and in the earliest phase, been necessary to rethink
governance, design, and creation to increase innovative solutions where data is the base of all and needs
that the diverse stakeholders interconnect between each other to a Smart City works (Lobao et al., 2019).
If data is the base for a Smart City so the access to this data should be facilitated and there’s
where open data enter. Open Data, according to the Open Data Guide is data that can be freely used, re-
used, and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike
(Dietrich et al., 2012, p.6).
Assuming this concept, Tourism can greatly benefit from free data, now more than ever available
in many forms and platforms. Based on that, it can be perceived the Smart Tourism concept/idea.
2.3.2. Smart Tourism
Big Data and Smart Tourism are deeply connected, and the use of this big data gives the
possibility to offer the right services that suit user’s preferences at the right time, mainly through the
adoption of information communication technology (ICT) (Brennan, Koo, & Bae, 2018). As Chiappa et al.
(2015) cite “With the availability of massive tourists’ data, destinations are expected to offer personalized
services to each different type of tourists to exceed their prior expectation and subsequently enhance
their tourism experience. Presumably, such experience would enrich how tourists value their trip.”
Another concept that can be highlighted is from Buhalis & Amaranggana (2014), that says the
Smart Tourism Destination is a destination embed with technology, having as priority the improvement of
tourists travel experience and having as characteristics efficient gather and distribution of information;
enable an efficient allocation of tourism resources; and distribution of the sector benefits at the local
society.
Bringing smartness into tourism destinations requires dynamically interconnecting stakeholders
through a technological platform on which information relating to tourism activities could be exchanged
18
instantly. Smart Tourism Destinations should make optimal use of Big Data by offering the right services
that suit users’ preference at the right time (Chiappa et al., 2015).
One of the biggest advantages of smart tourism is the possibility to give personalized services in
the destination and even offer personalized package travels. As personalization, we can say that “is the
process of collecting and utilizing personal information about the needs and preferences of customers to
create offers and information, which perfectly fits the needs of the customers” (Frank and Harnisch 2014
as cited in Yang et al. 2005).
Another advantage that smart tourism could gain using big data is to enhance tourism experience
for the tourist through an advanced feedback loop, enhanced access to real-time information, and
advanced customer service through the Internet of Things to address factors that potentially shape the
negative experience (Chiappa et al., 2015).
Smart tourism tends to grow and even more affect the area in the next years. That is why is
important to connect data science to business strategy and innovation to see how the results for
analyzing data could be implemented in real scenarios.
2.4. MARKETING STRATEGY
Marketing Strategy, in a general way, is the “total sum of the integration of segmentation, targeting,
differentiation, and positioning strategies designed to create, communicate, and deliver an offer to a
target market” (El-Ansary, 2006).
A similar concept is that strategic marketing is “an organization’s integrated pattern of decisions that
specify its crucial choices concerning products, markets, marketing activities and marketing resources in
the creation, communication or delivery of products that offer value to customers in exchanges with the
organization and thereby enables the organization to achieve specific objectives” (Varadarajan, 2010, p.
128).
From a classical point of view, Kotler (1991) defined marketing strategy as a plan to achieve the
organization’s objectives by specifying what resources should be allocated to marketing and how these
resources should be used to take advantage of opportunities that are expected to arise in the future.
One important concept that is related to marketing strategy is market segmentation. As Market
segmentation, we can refer as the process of dividing a market into different subsets of consumers with
the same needs or characteristics and selecting one or more segments to target with a distinct marketing
mix (Shiffman et al., 2004).
Connecting to tourism, tourism destinations need also to deal with competition and for this need to
differentiate from other destinations. About differentiation Potter (1990) says is the ability to provide the
buyer exceptional and superior value, in terms of quality, special features, or assistance services. For this,
is necessarily developed a competitive intelligence. As Nasri (2011) writes, competitive intelligence
19
provides information about competitors and their strategies, objectives, research, strengths, weaknesses,
and so on, giving companies the opportunity to understand their position relative to competitors.
Competitive intelligence helps companies to (1) Gain a better understanding of their business
environment and industry (2) Learn about corporate and business strategies of competitors (3) Forecast
opportunities and threats (4) Anticipate the research and development of competitors’ strategies (5)
Validate or deny industry rumors (6) Take effective decisions (7) Act instead of reacting (Gémar, G., &
Jiménez-Quintero, 2015).
Relating to the tourism industry, marketing strategies have been adopted to respond to current
challenges, achieve a competitive advantage, and increase their effectiveness (Tsiotsou et al., 2012).
Innovation is necessary to constantly achieve consumer’s needs.
2.4.1. Innovation
There are many concepts of innovation. In a simplistic concept, innovation can be understood as
“an idea, practice, or object that is perceived as new by an individual or other unit of adoption and help to
save costs or improve the quality of existence process (Mohd Zawawi et al., 2016).
Another concept that it’s similar and can be cited is that innovation are “activities and processes
of creation and implementation of new knowledge to produce distinctive products, services and
processes to meet the customers’ needs and preferences in different ways as well as to make process,
structure, and technology more sophisticated that can bring prosperity among individuals, groups and
into the entire society” (Akram et al., 2011).
Related to tourism, text mining of online data has big potential to inspire innovations for tourism
practitioners and has the potential to transform the tourism industry. One example of that is when text
mining is applied on tourists reviews it could be developed personalized recommendation systems
according to the tourist profile allowed to increase customer satisfaction (Q. Li et al., 2019).
Another example of innovation in tourism can do is benefit competitive analysis. When based on
data and text mining techniques, a competitive analysis must come from an automated system that relies
upon text mining tools to summarize an archive of reviews spanning multiple suppliers, and then to
identify relationships (Amadio & Procaccino, 2016).
The innovation objective is to create value for the business. In this competitive market, innovation
allows that business and companies produce distinct product and services meeting customers taste and
preferences there are even more demanding (Akram et al., 2011). Nowadays there many tools and
methods available to a company or brand to innovate their services or products, like Design Thinking and
Blue Ocean strategy.
20
2.4.2. Blue Ocean Strategy
Blue Ocean Strategy is a bestseller book that sold more than 4 million copies and was translated
into 46 languages across five continents. It was embraced by many organizations and argues that
cutthroat competition results in nothing but a bloody red ocean of rivals fighting over a shrinking profit
pool and for organizations create a lasting success it’s necessary to create a blue ocean - unexplored new
market spaces ready to grow. (Kim & Mauborgne, 2019).
Red Oceans represent all the industries in existence today. This is known as market space.
Otherwise, Blue Oceans denote all the industries not in existence today. This is the unknown market
space. Here competition is irrelevant because the rules of the game are waiting to be set (Chan Kim &
Marborgne, 2015).
One of the main objectives of Blue Ocean is to create Value Innovation for the business. Value
innovation is created in the region where a company's actions favorably affect both its cost structure and
its value proposition to buyers. Cost savings are made by eliminating and reducing the factors an industry
competes on. Buyer value is lifted by raising and creating elements the industry has never offered. Over
time, costs are reduced further as scale economies kick in due to the high sales volumes that superior
value generates. We call it to value innovation because instead of focusing on beating the competition,
you focus on making the competition irrelevant by creating a leap in value for buyers and your company,
thereby opening new and uncontested market space, in this sense, value innovation is more than
innovation. It is about a strategy that embraces the entire system of a company's activities (Chan Kim &
Marborgne, 2015).
The book presents a systematic perspective and approach, with diverse tools, to each brand
capture and define their blue ocean. In total, there are 14 tools that the book present to an organization
stands out from their competitors. For this research, were selected two of them: Strategy Canvas and 4
Actions Framework.
The Strategy Canvas is a central diagnostic tool and an action framework that graphically captures
the current strategic landscape and the prospects for an organization. (Kim & Mauborgne, 2019).
21
Figure 4 – Strategy Canvas
The horizontal axis on the strategy canvas is related to the range of factors that the industry
competes on and invests in, while the vertical axis captures the offering level that clients receive across all
these key competing factors. A value curve or strategic profile is the graphic depiction of a company’s
relative performance across its industry’s factors of competition (Kim & Mauborgne, 2019).
The Strategy Canvas it’s important because serves two purposes: (a) capture the current state of
play in the known market space, which allows users to see the factors that the industry competes on and
invests in, what buyers receive, and what the strategic profiles of the major players are and (b) propels
users to action by reorienting their focus from competitors to alternatives and from customers to non-
customers of the industry and allows you to visualize how a blue ocean strategic move breaks away from
the existing red ocean reality (Kim & Mauborgne, 2019).
When applying this tool, it is important to analyze the scores of the factors related to, in our case,
the tourism factors. A high score means that a company offers buyers more and hence invests more, in a
factor.
To produce innovation, the Strategy Canvas should create a Value Curve. The value curve, the
basic component of the strategy canvas, is a graphic depiction of a company's relative performance across
its industry's factors of competition. As you shift your strategic focus from current competition to
alternatives and noncustomers, you gain insight into how to redefine the problem the industry focuses on
and thereby reconstruct buyer value elements that reside across industry boundaries (Chan Kim &
Marborgne, 2015).
22
About the Four Actions Framework, his use can be defined “to reconstruct buyer value elements in
crafting a new value curve or strategic profile. To break the trade-off between differentiation and low cost
in creating a new value curve, the framework poses four key questions, shown in the diagram, to challenge
an industry’s strategic logic” (Kim & Mauborgne, 2019). This tool is important because the grid pushes
companies not only to ask all four questions in the four actions framework but also to act on all four to
create a new value curve (Chan Kim & Marborgne, 2015).
Figure 5 – Four frameworks from Blue Ocean
After the results of sentiment analysis (positive, negative, or neutral) collected by TripAdvisor, will be
assigned Boolean values for each category. If the result of a sentence is positive, then is going to be
assigned the value +1; if the sentence was negative, then is going to be assigned the value -1 and if the
result was neutral, the value to be assigned is 0. From the summarize of the results, is going to be created
Strategy Canva based on the topics described on the Conceptual Model and then created a 04 actions
framework based on the results.
2.5. PREVIOUS WORK
As addressed in the work “Big data in tourism research: A literature review”, the application of big
data in tourism is still recent (a little bit more than 10 years) and from the three data sources indicated in
figure 1, types of research involving UGC data – data produced by users like online textual data and online
photo data – are the predominant type of work, with 47% of the publications until now. The main subjects
approached was tourist sentiment analysis, tourism marketing, and tourism recommendation (Li et al.,
2018).
23
Although research using online textual data being advanced, there is still room for improvement. As
cited in this same study, the knowledge produced by data science methods, like text mining, in tourism
product design and tourism marketing is lacking. In other words, the connection between the results of
the analysis of online textual data in practical terms still needs to be evolved.
In the last 03 years, more than 40 papers were produced regarding text mining in the tourism
industry, especially analyzing reviews online through sentiment analysis, a popular method to classify
consumer sentiment. This research is one of the first that gather the results of sentiment analysis and unit
with another marketing strategy tool based mainly on statistical measures like they mean.
However, another paper has a similar proposal to what this work is developing (which connects the
results of text mining with marketing tourism). The paper Competitive Analysis of Online Reviews Using
Exploratory Text Mining of the authors William Amadio & J. Drew Procaccino approached the usefulness
of analyzing text-based online reviews using text mining tools and visual analytics for SWOT analysis, as
applied to the hotel industry to develop competitive actions. The findings showed that the hotels selected
for the study completed in almost the same characteristics and SWOT analysis helped develop strategies
for each one of them.
Is important to highlight that the SWOT analysis from the paper Competitive Analysis of Online
Reviews Using Exploratory Text Mining of the authors William Amadio & J. Drew Procaccino was not
created guided by data and this was one of the motivations to create a data-driven blue ocean strategy
for this study.
24
3. CONCEPTUAL MODEL
The touristic place where the analysis will be developed is in São Luís of Maranhão, Brazil. The
choice happens because this city has its economies influenced in large part by the local tourism and it is
necessary to improve the tourism strategy so that the local economy can benefit from the actions and
campaigns implemented by the state government. Maranhão has the “Observatório do Turismo” in which
it conducts research and seeks to help the state, municipalities, and the hotel sector to create strategies
and tourism policies appropriate to each of the local reality (source: Observatory of the Tourism of
Maranhão). However, as informed, the research could be expensive and provides limited spatial and
temporal coverage (Wood et al., 2013). So, the implementation of text mining methods based on data
from social networks aims to facilitate the acquisition of information to the area to develop more specific
campaigns according to the reality of the region, reducing the cost of its application and obtaining a
greater benefit.
Based on the author Xu (2019), the reviews that are going to be collected are going to be divided
between hotel segments: budget, midlevel, and luxury. Budget hotels focus on providing good value for
the money by offering standardized accommodation, limited services, and cheaper room rates as
compared with upgraded hotels; Midlevel hotels are in the midrange of functionality and price; and
Luxury hotels focus on providing customers additive pleasure and comfort with premium products and
services (Xu, 2019). For this paper, budget hotels will be described as touristic accommodations of 1 and 2
stars, midlevel hotels will be considered as being touristic accommodation of 3 stars and luxury hotels will
be considered as 4 and 5 stars.
To categorize the hotels is going to be considered the number of stars. This approach happens
because hotels with different star levels charge different prices and offer different levels of quality of
attributes of products and services to customers and although hotels with higher star levels usually offer a
higher quality of core attributes and more varied auxiliary attributes of products and services, hoteliers
should know this will not necessarily lead to customer satisfaction because the higher price raises
customer expectations; thus when the perceived attributes do not meet their expectations, customers are
dissatisfied (Xu, 2019).
To obtain an overall view of the reviews, we firstly combined all of them into a single text block to
identify the key tourism factors pointed out by the travelers. Then, in terms of hotel rating, we separated
the reviews into two parts, with text blocks for budget/midlevel (three-star and below) and those for
luxury (four- and five-star) hotels (H. Li, Ye, & Law, 2013). The data that is going to be gathered are review
content, review date, city and hotel star rating (H. Li et al., 2013).
25
After gathering the data and apply all the necessary text pre-processing steps, sentiment analysis
will be performed to the review content. As already mentioned, if the result of a factor is positive then is
going to be assigned the value +1, if the value was negative, then is going to be addressed the value -1
and if the result was neutral, the value to be addressed is 0.
Because of the necessity of understanding better the factors and topics that should be analyzed
to create a strategy, it is proposed a new conceptual model based on the tourism factors presented in
figure 3 with key topics that can give more guidance to the data-driven approach proposed by this paper.
For example, if the review has the word “quarto,” “cama” it will belong to the facilities topic; if
contains “atendimento,” “café” it will belong to the services topics that are under the Accommodation
factor. The proposal is to create a well-defined methodology to classify the reviews without too much
manual work. Considering that, this study will be based on the factors and subtopics of each factor, to
guide the classification of the reviews as we can see in figure 6.
Figure 6 – New Conceptual Model
The key-words topics created based on this conceptual model are cleaning, beaches, security,
food, facilities, events, customer service, services, brand, location, cost-benefit, touristic spots, breakfast,
transportation, restaurants. The proposal of these key-words is based on the new conceptual model and
its topics.
According to the touristic factors, the topic related to the factor “Expenditures” is brand. The
topics related to the factor “Activities\Satisfaction” are touristic spots, transportation, and restaurants.
The topics related to the factor “Visit” are location, security, and beaches. The rest of the topics are on
the factor “Accommodation,” they are cleaning, food, facilities, events, customer service, services, cost-
26
benefit, and breakfast. It was not verified any critic – positive, negative, or neutral regarding the factor
“Travel,” and that’s why no key-word was created and related with this factor.
Regarding the meaning, “brand” is referring to the investiment on the hotel brand as a marketing
strategy to create value for the guest; “touristic spots” is all the touristic points cited on the reviews;
“transportation” are referring to the transports of the city available (taxi, bus, cost of uber and others);
“restaurants” are referring to the restaurant chain of the city; “location” is the localization of the touristic
accomodation; “security” is regarding the sense of security of the localization that the touristic
accommodation are placed; “beaches” are related to the beaches of the city, since is a coastal city;
“cleaning” is related to the cleaning of the rooms and instructure; “food” is regarding the options and
quality of foods provided by the touristic accommodation (lunch, dinner, snacks); “facilities” are referring
all the structure of the hotel; “events” are the service and structure provided by the touristic
accommodation to perform external events; “customer service” is the treatment of the employee to the
guest; “services” is all services in general provided by the touristic accommodation; “cost benefit” is the
cost benefit of the staying perceived by the guest; “breakfast” is related to the quality of the breakfast
provided by the touristic accommodation. Is important to highlight here that “cleaning,” “customer
service” and “breakfast” are all services provided by the touristic accommodation but it was necessary to
split those into specific topics because in some reviews there are specific critics about these topics.
27
4. METHODOLOGY
4.1. DATA COLLECTION
The first step was having access to the data. For that, web-scrapping was used to download the data.
It is important to highlight that a previous inspection was made to spot the variables that should be used
to analyze the dataset and majorly, the review, data, and location were the variables used to do a web
scraping.
The timeframe selected was all the reviews made in 2018 and 2019. This timeframe is important
because once we get through this pandemic, we will emerge in a very different world compared to the
one before the outbreak (Donthu & Gustafsson, 2020, p.284). Also, it was selected only reviews made by
Brazilians and wrote in Portuguese, because TripAdvisor splits their reviews by language and the
translation of the reviews could lose a lot of information and biased the result.
By the end of web-scrapping, it was downloaded reviews from 54 establishment that includes hotels,
inns, and hostels and the total was 1.392 reviews, featuring as a small dataset. It was created one dataset
for each hotel, but all the classification and sentiment analyses were made in all the reviews in one single
dataset.
Because the dataset was small, the strategy was: (1) classify the reviews based on keywords provided
by the conceptual model and (2) classify the sentiment of the reviews into positive, negative, and neutral
and then run the sentiment analysis algorithm to confirm if the classification was made properly. Since it
was a Machine Learning model this step was important because, to apply a supervised algorithm, in
advance the model should have access to the desired output. The algorithm chosen was Naïve Bayes.
The reviews were classified regarding their factors and main topics. Both the factors and main topic
were made using keywords like “location,” “facilities,” “uber,” “beach” and so on that can be checked in
figure 6. The factors became binary (or dummy) variables, and the main topic was a category variable.
One review can have more than one facto and more than one topic, but will only have one sentiment:
positive, negative, or neutral.
4.2. MACHINE LEARNING APPROACH
Regarding the sentiment analysis, it was created the variable “label.” If the review was positive, then
it will be “+1,” if the review was negative, then it will be “-1” and if the review was neutral then it will be
“0”. These values were important because the average and distribution of these labels were used in the
other step of the analysis of the dataset, using Blue Ocean Tools.
It was made a previous data cleaning on the dataset because it was identified characters and
misspelling words that could interfere on the classification of the words into their topics and factors, that
was made using functions and could interfere in the pre-processing steps of sentiment analysis.
28
To perform sentiment analysis was used Google Collaboratory, for safety reasons. It was made data
analysis and exploratory analysis in the role dataset, especially on the variables regarding the tourism
factors and the label that classify the reviews regarding their sentiment. Because the reviews were
Portuguese, it was used a Portuguese library - spacy. load ("pt") - to treat the words before the sentiment
analysis.
It was necessary to apply previous data steps before the sentiment analysis: pre-processing,
abbreviations, creations of bigrams, creating new stop words, bag-of-words, and TD-IDF. After that, it was
applied the Naive Bayes algorithm. The choice of Naive Bayes was because it is a popular algorithm used
for text mining, easy to apply, and dealing with not so large datasets.
The measures to analyze the efficiency of the Naive Baes models were Accuracy, Precision, Recall,
and F1 score. Recall is the proportion of Real Positive cases that are correctly Predicted Positive. Precision,
on the contrary, denotes the proportion of Predicted Positive cases that are correctly Real Positives
(Powers, 2011). The relation between the two measures can be understood in the table below:
Table 1. Systematic and traditional notations in a binary contingency table. Shading indicates
correct (light=green) and incorrect (dark=red) rates or counts in the contingency table.
Regarding the measure Accuracy, first is necessary to understand other measures: Inverse Recall
and Inverse Precision. Inverse Recall is thus the proportion of Real Negative cases that are correctly
Predicted Negative. Conversely, Inverse Precision is the proportion of Predicted Negative cases that are
indeed Real Negatives (Powers, 2011). So, Accuracy explicitly takes into account the classification of
negatives and is expressible both as a weighted average of Precision and Inverse Precision and as a
weighted average of Recall and Inverse Recall (Powers, 2011). At least, the f1 score has a similar concept
to the accuracy - as a general measure to evaluate the efficiency of the system, but the difference is that
the f1 score doesn’t take into account true negatives, which could affect statistical results if true negatives
are crucial for the analysis of the results (Powers, 2011).
After the results of the sentiment analysis, the bag-of-words generated were also used on the Blue
Ocean step, which could be compared to the classification of the topics and provided in-depth
information regarding which aspects of the tourism industry we should focus the analysis on.
4.3. BLUE OCEAN APPROACH
For Blue Ocean, firstly, it was analyzed all the dataset and calculated the average of the labels and
their distribution. The topics to analyze were selected based on the frequency they appeared on the
29
reviews and then compared to the frequency dictionary and the bag-of-words created in the step of the
Machine Learning model. For example, the word “café” appeared on bag-of-words and frequency
dictionary as one of the main citations in the reviews it is also part of “services” provided by the
classification using keywords, so it was analyzed separately of the other services. Words like “quartos,”
“hotel” were inside of the other topic “facilities,” so it was preferred to analyze the role topic because in
that case, many points regarding facilities were cited.
For the decision-making, the average metric was used. If the topic had an average below of the total
average of the industry, then that topic is below the level; If the topic had an average above of the total
average of the industry, then it would above the level; and if the average is approximately close to the
tourism chain average, then the topic is in the same level. Based on that, it was created the graphic of the
Strategy Canvas for Blue Ocean to be compared with other segments of the tourism industry. This same
methodology was applied when analyzing the budget/midlevel hotels (3 – 1 star) and luxury hotels (4 and
5 stars).
The other tool used was Strategy Canvas. To construct this, was necessary to compare the position of
the 15 topics of the industry with the position of these same topics in the category that we want to
perceive the value curve. The part of the hotel chain industry chosen to analyze was budget and midlevel
touristic accommodation that was classified by TripAdvisor as 3,2 and 1 stars. This decision was made
because the number of establishments that were classified in this category is larger than the
establishment with 4 and 5 stars and, from the results, the improvements can be applied in more
establishments.
To construct the Strategy Canvas, on the horizontal axes was the name of the topics, in total 15, and
as input data on the vertical axes was considered the average of the labels of these topics made by the
tourists that were on the town in the year of 2018 and 2019. To analyze the topics, it was used the same
method that was applied to the role industry.
Based on the results, it was plotted on the 04 actions Framework using this methodology: If the topic
is below the industry but was present in a vast quantity of reviews then it should be Raised; If the topic is
too much below the industry and not cited in comparison with the industry, with a threshold of below 5%,
then should be Eliminated; If the topic is rated in a positive way on the industry but not appeared on the
reviews of the 3,2,1 stars category or appeared in 5% of the reviews, they should be created and if the
topic was rated in a positive way on the industry but it was rated in a negative way on the 3,21 stars
category, then should be Reduced.
The results were plotted on the Strategy Canvas in comparison with the industry graphic and it was
created a value curve that can be seen in the next Chapter Findings.
30
5. FINDINGS
The results of the data analysis showed that most of the reviews are under the factor Accommodation
- present in 1372 reviews (almost 98% of them), that includes all the installations characteristics (like
bedroom, bed, breakfast, etc.) been a major factor for the tourists and going into accord the research of
the area. It also could be observed that the factor Visit has significant importance on the reviews - been
cited in 711 reviews, majorly because the topic “localization” that appears in the bag-of-words shows that
this topic is important for the tourists. Activities and Satisfaction appeared in the third place with
participation on 396 reviews, mainly because notes of transportation of the city (taxi, bus), price of the
uber to a touristic point, restaurants, and malls nearby the hotel/inn/hostel was made. The other two
factors, Travel – in 12 reviews and Expenditure – in 82 reviews, were not statistically significatively so
topics like beach and comparison with other cities had the last volume than the others.
Figure 7 – Tourism Factors
Regarding the topics, we can see the most cited was services, facilities, location, cost-benefit, and
customer services, with special attention to the topics services and facilities presenting in almost all the
reviews, pointing out that these two are important for the guest. It also goes into the encounter of the
most cited factor Accommodation, since that the two topics are part of that. “Services” was cited in 1.231
reviews - almost 88% of the reviews, followed by “Facilities” in 1.093 reviews - representing 79%,
“Location” comes at third place presented in 674 reviews - representing 48% of them, “Cost-benefit” is
31
cited in 230 reviews, followed by “Touristic spots” with 109 reviews, “Breakfast” was specifically cited in
29 reviews, “Restaurants” was present in 28 reviews, “Cleaning” was specifically pointed out also in 28
reviews, “Events” was pointed out in 23 reviews, “Security” was present in 17 reviews, “Food” was
specifically cited in 14 reviews, “Beaches” was cited in 13 reviews and for least, “Brand” was cited in 2
reviews as we can see below ordered by the most cited to the last
Figure 8 - Tourism topics
For the sentiment analysis, it was considered the overall sentiment of the reviews. It was used the
variable “label” that contained positive, negative, and neutral scores as presented in methodology. The
percentual distribution of the labels shows that the majority of the reviews are positive (50.6%), followed
by a neutral sentiment (31.4%), and in the last place is the negative review (18%).
Figure 9 – Sentiment analysis distribution
The frequency dictionary of the bag-of-words shows the 30 words more cited on the reviews,
number random chose. Looking at the dictionary, it’s possible to notice that the top 5’s words related to
facilities topic, following the previous results. Divided by topics, 05 words are in the topic “facilities”; 5
words are in the topic “services.” The other words are related to other topics like “location,” “beach,”
32
“restaurants” and “touristic points.” The other words are adjectives and it is possible to notice that none
of them are negative that can explain why the majority of the reviews are positives.
Figure 10 – Frequency dictionary of the words
It was also created a word cloud based on the bag-of-words. In bag-of-words, the bigger the word,
the more cited on the reviews were. Following that, “preco,” “area,” “solicito,” “servico,” “funcionario”
and “otimo” are the most cited words on the reviews all related to the topic service, demonstrating the
importance of this on the travel experience of the tourist and matching the results shown on figure 8.
Differently from the frequency dictionary, in the bag-of-words, it is possible to see positives (like “otima”)
and negatives adjectives (like “ridiculo”), demonstrating the plurality of important themes for the tourist.
33
Figure 11 – Bag of words cloud
Related to the sentiment analysis per si, it was applied the Naive Bayes algorithm. For the results,
it was applied a sentiment analysis problem with three classes.
Looking at the results of the first classifier, we can perceive that the number of observations in
the positive class is significantly greater than the number of observations in the other classes. This shows
that the models have a bias for positive observations and tend to classify most of the observations in this
class. Verifying the measures, the accuracy of the first model is 69.37% meaning that this model is correct,
since results above 50% are considered meaningful. Precision has 75.09%, the recall has 58.07% and the
f1-score has 59.69%, confirming the bias for classifying the observations into positive class and for the
observations was classified as a false negative.
34
Figure 12 – First classifier model
Analyzing the results of the second model, we can assume they are correct and meaningful. This is
because all the measures (precision, recall, and f1 score) are greater than 0.5 and balanced distributed
between negative, neutral, or positive, with the highest values compared to the first models. Here
positive class receives more observations, followed by negative class and neutral. For this model, in
general, accuracy has 75.05%, precision had 72.19%, the recall has 74.66% and f1 score has 73.10%,
confirming that this model is meaningful.
Figure 13 – Second classifier model
35
At least, the third model can also be assumed as correct and meaningful. Here, the classes
positive, neutral, and negative are also above 0.5 and balanced distributed between them with almost the
same values as the second models. Positive classes continue receiving more observations, followed by
negative and neutral classes, respectively. In general, the measures show that accuracy has 74.69%,
precision has 71.69%, recall has 74.65% and f1 score has 72.76%, confirming that the models are also
meaningful.
Figure 14 – Third classifier model
For this step, we can conclude that the classify was correct and meaningful, with a small bias for
classifying positive observations having the second models the most significant of all them.
At least, another measure applied to the model was the confusion matrix. A Confusion Matrix is a
method for visualizing classification results reporting the labels predicted by the model (prediction) versus
the labels already classified previously on the dataset (Konkiewicz, 2019). In other words, it is another way
to view false\true negatives and positives.
For True Positives (TP), it refers to the numbers of predictions where the classifier correctly
predicts the positive class as positive; True Negatives (TN) are the number of predictions where the
classifier correctly predicts the negative class as negatives; False Positive (FP) can be understood as the
number of predictions where the classifier incorrectly predicts the negative class as positive and False
Negative (FN) are the number of predictions where the classifier incorrectly predicts the positive class as
negative (Mohajon, 2020). Is common to find the confusion matrix with only two classes to classify, but
36
according to this study, was created a confusion matrix with 3 classes, regarding the sentiment of the
reviews (positive, negative, and neutral).
For a confusion matrix with 3 classes, it is important to look at the numbers on the diagonal. For a
model to be considered correct, all the non-zero values should be on the main diagonal of the matrix also
the numbers on the diagonal show how many reviews were classified correctly with the class according to
the classifier. Taking into account the confusion matrix of the model, for all the considered observations,
we can assume that the prediction of the model is correct since we have the numbers 39 (correctly
predicted as negative), 57 (correctly predicted as positive), and 113 (correctly predicted as negative) at
the diagonal.
Non-zero values outside the main diagonal represent the number of observations for which the
model provided the wrong prediction. In this case, the 16 that is presented at the intersection between
the first row and the second column of the matrix indicates that 16 observations were classified as
positive by the model, but their label is negative. Similarly, the 02 that is presented on the first row and
third column of the matrix indicates that 02 observations were classified as neutral by the model, but
their correct class is negative. Following this approach, 11 observations as classified as negative by the
model but their labels are positive, the 25 presented at the intersection of the second row and third
column show that 25 observations were classified by the model as neutral but their labels are positive.
Regarding neutrals, no observation was classified by the model as negative but 15 observations were
classified as positive by the model when their labels were neutral.
As a result, from the confusion matrix, it seems that the model is particularly "good" when dealing
with neutral observations because the majority of observations that has the labels of neutral was also
classified as neutral by the model, while it has some difficulties in classifying the observations of the
remaining classes.
In the end, we can conclude that the sentiment analysis of this model is statistically valuable, take
into consideration the results of the Naive Baes classifier with 03 classes and the confusion matrix.
37
Figure 15 – Confusion Matrix
5.1. BLUE OCEAN FINDINGS
The average of all reviews was 0.33, so a topic-by-topic analysis was done to check their position.
If well below the general average, the topic was pointed out as a negative factor; if close to the average (2
points above or 2 points below), it was considered an “average” or “neutral” factor; if above average it
was considered a positive factor. In addition to the average, the proportional distribution of the positive,
negative, and neutral labels was considered to analyze the position of each topic.
That said when analyzing the entire dataset, we have 09 topics as positive, 01 as neutral, and 05
as negative. The positive topics are brand, customer service, breakfast, location, cost-benefit, touristic
spots, transportation, restaurants, and services in general; the neutral topic is just events; finally, the
negative topics are cleaning, food, facilities, security, and beaches.
Of the positive topics, "Brand" had an average of 0.5, “Customer Service” had an average of 0.37,
“Breakfast” had an average of 0.62, “Location” averaged 0.54, “Cost-benefit” averaged 0.55, “Touristic
spots” had an average of 0.55, “Transportation” averaged 0.73, “Restaurants” averaged 0.86 and
“Services in general” averaged 0.37. Regarding the neutral topic “Events,” it averaged 0.35. In the case of
negative topics, “Cleaning” had a negative average of -0.68, “Food” had an average of 0.29, “Facilities”
had an average of 0.30, “Security” had an average of 0, and “Beaches” had negative average of -0.15. The
38
graph below shows the distribution of averages for the entire dataset.
Figure 16 – Industry Canvas
Regarding the category to be analyzed, budget and mid-level touristic accommodation (that in
this section we going to refer to as 3 to 1 star), the average of all data related to this category was 0.19. In
comparison with the whole dataset, the average is much below the general (0.33), but concerning the
quantity, it is the most representative category with approximately 65% of the number of reviews. This is
one of the reasons why this category was selected to perform Blue Ocean analysis the other reason is that
most of the touristic accommodation was under this segment, almost 90%.
Regarding the distribution of topics, 08 topics are positive and 06 negatives none were classified
as neutral. The positive topics are touristic spots, transportation, services in general, customer service,
cost-benefit, location, breakfast, restaurants; the negative topics are cleaning, food, beaches, events,
security, facilities.
Of the positive topics, “Touristic spots” had averaged 0.57; “Transportation” had an average of
0.75; “Services” had an average of 0.23; “Customer service” had an average of 0.23; “Cost-benefit”
averaged 0.48; “Location” had an average of 0.47; “Breakfast” had an average of 0.65 and “Restaurants”
had an average of 0.85. Of the negative topics, “Cleaning” had an average of -0.77; “Food” had an average
of 0.14; “Beaches” had an average of -0.5; “Events” had an average of -0.57; “Security” had an average of
-0.06; “Facilities” averaged 0.15. It is important to note that in this category, “Brand” was not mentioned,
so it was not possible to obtain the average for this topic.
Because this category was chosen to create the value curve using the Strategy Canvas and 4
Actions Framework, based on the averages obtained, we were able to obtain the following result in
Strategy Canva:
39
Figure 17 – Industry x 3 to 1 stars strategy canvas
In the Strategy Canvas, it is possible to notice that most topics are below the industry average,
except for touristic spots, breakfasts, transportation, and restaurants. It is important to note that
“cleaning”, “beaches” and “security”, which were already negative in the analysis of the entire industry, in
this category they are even more negative, indicating that they are critical points that need urgency in
their improvement for tourist accommodations of 3, 2 and 1 stars. Another topic worth mentioning is
“events”, which in the category analyzed here is identified as the second most negative topic, very
contrary to the industry that was positive. This result may have been because tourist accommodations
under 03 stars may not have the necessary structure to hold events in their spaces or when there are
space and structure to perform, it does not satisfy the customer. Anyway, it is worth the investigation and
a deeper look at this topic.
In general, based on review averages, tourists who have stayed in 3.2.1-star tourist
accommodations consider the cleanliness of the facilities, the pollution of the beaches, and safety in the
location where they have located a critical factor that must be improved, because, in Marketing, when
customers do a negative review about your establishment, it is because he wants it to be improved. They
also consider the breakfasts served in the tourist accommodations to be satisfactory, as well as the
availability of transport in the vicinity of where it is located, the quality of the restaurants, and the sights
of the city. The other topics demonstrate that they need more attention and investment by the industry.
As a proposal to the value curve for this segment according to the Strategy Canvas created, is
starting to the topic “Cleaning” until “Cost-benefit,” with special attention to the topics “Events” and
“Brand,” that the value curve was bigger.
As noted, 3.2.1-star tourist accommodations averaged well below the general dataset, which
makes us look at the luxury hotels. In the analysis below, the same method was applied for the luxury
40
hotels (in this section referred as 4- and 5-stars hotels) so that the analysis of the topics could be
deepened.
The overall average of the luxury hotels dataset was 0.59 and represents approximately 35% of all
reviews of the São Luís do Maranhão hotel chain. This shows us that this niche has a significant
contribution to the average of all reviews of the hotel chain in the analyzed city.
Regarding the distribution of topics, 07 topics were considered positive, 03 were considered
neutral and 05 were considered negative. The positive topics are security, service, location, cost-benefit,
transportation, restaurants, events; neutral topics are breakfast, facilities, services; the negative topics
are cleaning, food, brand, tourist spots, and beaches.
Of the positive topics, “Security” averaged 1; “Customer service” had an average of 0.63;
“Location” had an average of 0.71; “Cost-benefit” averaged 0.68; “Transportation” had an average of
0.71; “Restaurants” had an average of 0.87 and “events” had an average of 0.75. Of the neutral topics,
“Breakfast” averaged 0.58; “Facilities” averaged 0.58, and “Services” averaged 0.60. Of the negative
topics, “Cleaning” had an average of -0.20; “Food” had an average of 0.43; "Brand" had an average of 0.5;
“Touristic spots” had an average of 0.48 and “Beaches” had an average of 0.14. In comparison with the
distribution of averages across the industry with the categories of 4 and 5 stars, it can see the Strategy
Canvas below:
Figure 18 – Industry x 4 and 5 stars strategy canvas
In comparison with the budget and mid-level categories, the Strategy Canvas of these two niches
is established as follows:
41
Figure 19 – 3 to 1 stars category x 4 and 5 stars category strategy canvas
Considering the Strategy Canvas created in comparison with the industry, the value curve for the
luxury hotels category should start from “Touristic spots” until “Transportation.” The same value curve
can be seen at the Strategy Canvas created in comparison with the touristic accommodations of 3 to 1
star.
Because the focus of the Blue Ocean strategy is on budget and mid-level hotels, the next section it will
be discussed other Blue Ocean strategy tools, 4 actions framework, with a proposal of marketing
strategies for this segment.
42
6. DISCUSSION
Based on the results of Blue Ocean, it is possible to notice that some tourist topics are perceived
in different ways by tourists, such as events, brand, tourist spots, breakfast, security, and transport. This
can be explained mainly by the characteristics of the categories being different.
In the case of “events,” “security” and “brand,” evaluated positively in hotels of 4 and 5 stars but
negatively in tourist accommodations from 3 stars down, this can be explained why 4 and 5 hotels stars
have a better-defined structure regarding the offer of spaces and event organizations, being a factor of
great weight in the evaluation of comments regarding this category. Another item that draws attention is
that “brand” was mentioned only in the category of 4- and 5-stars hotels and in a positive way, which
demonstrates that the investment in a brand can be better explored by the 3-star tourist
accommodations down, since “brand” had a positive average when quoted. The topic "safety,"
concerning the number of times it was mentioned, had a very big difference about the two categories
analyzed: while in tourist accommodations from 3 stars down it was mentioned 16 times (approximately
in 2% of reviews), in the category of 4 and 5 stars he was mentioned only once, in a positive way. This
result can demonstrate that for categories of 3 stars down, the feeling of insecurity becomes more
present for tourists who stay there compared to tourists from luxury hotels.
Regarding the topics “tourist spots”, “breakfast” and “transport”, they are perceived more
positively by guests of tourist accommodations from 3 stars down. About “tourist spots,” it can be
associated with the fact that many hostels, hotels, and inns are in the historic city center, which
concentrates a high number of tourist spots, thus contributing to this topic being more positive than
concerning luxury hotels, which are mostly concentrated in the coastal part of the city. Regarding
"transport," the availability of transport in the historic center of the city is also more diverse than in the
coastal part, which may have contributed to this topic having also been perceived as more positive than
luxury hotels. The other topic that draws attention is “breakfast,” which is also perceived more positively
by tourist accommodations from 3 stars down, indicating that this is a strong point of this category.
All other topics are seen more positively by guests of luxury hotels, except cleanliness - which is
seen as negative in the two studied niches and restaurants - which is seen as almost identical positive in
the two analyzed niches.
This demonstrates that the hygiene and cleanliness of the rooms is a critical factor that needs to
be looked at very carefully by the entire hotel chain in the city to be analyzed. About restaurants, there is
an opportunity to establish partnerships between tourist accommodations and restaurants in their
surroundings, as this topic was evaluated as the most positive among all the tourist factors addressed.
To build the 4 actions framework, when selecting the topics to be analyzed, the threshold was set
below 5% and depending on the average, if below or above the general average of the dataset (0.19), it
43
will be placed in the “Eliminate”, “Create” or “Raise”. Also, the number of times this topic was mentioned
in the reviews was considered to verify its relevance. The selected topics based on this criterion were:
restaurants, transport, breakfast, cleaning, beaches, security, food, brand, and events.
Based on what has been developed so far, the proposal for the 4 actions framework based on 3-
star tourist accommodation down is as follows:
Figure 20 – 4 actions framework based on 3 to 1 stars category
Based on the applied methodology, no topic was demonstrated that needed to be reduced
beyond the industry standard.
In Raise's proposal in the 4 actions framework, some topics have significant weight in citations
from the reviews in which they are below the average of the tourist industry and that investment in these
points can be increased, since in the 4 and 5 stars segment these topics were also well evaluated.
In Eliminate, are the critical topics that were evaluated as negative in the two segments evaluated
and concerning the industry, they are "cleanliness," "beaches" and "safety". The purpose of these frames
is to understand what the negative evaluation suggests, in the case of “cleaning” the fact of being
negative gives us the information that it is not done properly and therefore the frame was placed to
eliminate “uncleanliness” so that more efficient cleaning can be achieved, as this is a fundamental topic to
put attention on since in the year 2020 the world experienced a pandemic and this factor is fundamental
for the credibility of a tourist accommodation during that year. In the case of "beaches," the pollution of
the beaches caused this topic to be rated negative, so the proposal is to eliminate "pollution of the
beaches" since this is also another very important topic since most hotels 4 and 5 stars are located by the
sea. The other topic, safety, reflects the insecurity of guests around the location of the hotels. It is
important to mention that two of these topics ("beaches" and "safety") are of governmental competence,
44
but that they also have a responsibility in the image of the destination, being important to emphasize
them.
In relation to Create, topics were rated as very positive, but which were not often mentioned in
the 3-star segment down. At Create, several marketing strategy opportunities were mentioned, such as
partnering with restaurants and transportation. In general, restaurants in the city and the diversity of
transport (as well as uber values) were positively evaluated by tourists, that said, the creation of a
partnership between tourist accommodations and these establishments would benefit both as well, since
it would add value in the tourist experience. Another topic mentioned was to increase the choice of foods
and improve the quality of those offered by tourist accommodations since this factor is also evaluated as
positive in the luxury segment. Another suggestion is to invest in the brand, namely branding, as a
positive brand makes it engraved in the tourist's mind. Another fundamental point is events, which were
very well rated in the luxury segment, but in the 3 stars down segment, it was rated negatively and was
hardly mentioned. Many of the 3-star tourist accommodations down do not offer space for events and, if
possible, would be a great opportunity to expand the services offered by these accommodations.
Regarding the results of the topics, the most critical ones – that also are placed in the Eliminate
framework are external factors of the accommodations, related to tourism policy and government. It is
important to highlight that these topics should give enough attention due to the city of São Luís create
greater tourism competitiveness because that studies found that tourism competitiveness influences
tourism flow and gross domestic products (GDP) (Hossein, Bazargani, & Kiliç, 2021).
The proposals of the 4 actions framework were created comparing the value curve of the industry
against the budget-midlevel category and the value curve of this same segment with luxury hotels. The
strategies described meet the needs of the industry since tourism depends a lot on its stakeholders and
the analysis shows their importance as well as the way they are perceived, especially restaurants and
transport positively evaluated.
45
7. THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS
7.1. THEORETICAL IMPLICATIONS
This research makes an important theoretical contribution. Besides many previous types of research
perform sentiment analysis on review data (J. Li et al., 2018), this is the first approach that uses the
results of the sentiment analysis to create a marketing strategy using Blue Oceans tools, all guided by data
since the proposal of this study is to be data-driven from the beginning to the end.
Also, this study is an intersection between 3 areas: Marketing Strategy, Tourism, and Data Science
opening the door for other approaches and methodologies that can contribute to innovate those 3 areas
providing a methodology that can be applied by researchers as well as practitioners. Interdisciplinary
research is important because of many reasons one of them is to reach a wider audience as the result of
interdisciplinary research (Glod, 2016).
Another theoretical contribution comes from the new conceptual model, that deeper tourism factors
that influence travels, with the creation of the topics. Those topics and the importance of them, especially
related to facilities and services corroborate with the findings of Filieri, Galati & Raguseo (2021) that
shows that different hotel attributes play a different role in predicting extremely positive reviews and
extremely negative helpfulness, thus adding knowledge about the importance of product/service
attributes in electronic word-of-mouth. In this study. The resulting present that the topics related to
facilities (that includes room, bed, bathroom, and others) and services was highly cited and some of them,
like cleaning, was evaluated as extremely negative by the tourists in the reviews.
Another point of congruency is that the research of Filieri, Galati, and Raguseo (2021), shows that the
most important hotel attributes on the reviews considered extremely negatives include hospitality,
bathroom, room, and price/quality ratio. Consumers consider particularly important the hospitality
attribute, which is relevant and frequently discussed in both EPRs (extremely positive ratings) and ENR
(extremely negatives ratings) reviews. Those attributes can be seen on the sentiment analysis step in this
study, specifically on the frequency-dictionary and the bag-of-words approach. Hospitality can be related
with “funcionarios” and “atendimento”; bathroom in Brazilian Portuguese is “banheiro”; room is “quarto”
and price/quality ratio can be related with “preço”. All these attributes belong to a topic or is a topic
indeed (like cost-benefit).
7.2. MANAGERIAL IMPLICATIONS
For managers, this study also has a major contribution since the results of the sentiment analysis
are used to create marketing strategies for the categories of interest. This methodology can be applied
focused on a specific category, as was made in this study, but also can be applied in a specific hotel. The
46
only note is that this methodology should be applied by a professional that has a moderate understanding
of data science, especially with programming languages like python (used in this paper) and R and
understanding of what Blue Ocean is and what is the goal of the tools of this concept. In resume, this
method should be applied by a professional that has an understanding in at least two of the 3 areas
involved (since the tourism topics were already created) with the decision making of the marketing
strategy to be applied for the manager\marketing manager.
This methodology can be applied also to governmental institutions, specifically those related to the
manager of the tourism of a city. It was found that many tourism topics - like beaches, security, restaurants,
touristic spots, and transportation (33% of all the topics) were cited and for some of them was rating
extremely negative (beaches, security). This can give them information that investment of the government
in the city can give a positive return since the satisfying travel experience involves factors related to the
hotel - products and services – (Filieri, Galati, & Raguseo, 2021) but also natural and cultural resources
(Hossein, Bazargani, & Kiliç, 2021). So, the government institution can focus on the topics related to the city
in general also applying the results on a marketing strategy tool, like Blue Ocean.
7.3. LIMITATIONS AND FUTURE RESEARCH
The data-driven approach proposed in this paper unifies 03 areas: Tourism, Data Science and
Marketing Strategy. This research is a result of the master's in Statistic and Information with Data Science
as a secondary area, so the knowledge in python code was not so deep and this made the work harder
because everything made based on machine learning in this paper was learned during the process. So,
some steps could be made automatically.
Another limitation was that the data downloaded from the web pages did not recognize
characters with accentuation, so before applied the steps of data cleaning of sentiment analysis, it was
necessary to do a previous data cleaning on the words for them to be recognized on the later steps of text
mining.
The major limitation was that, because this methodology is in his way innovative, most of the
approach were created from scratch based on other papers been necessary to deepen the information to
achieve a satisfactory result to the approach be data-driven from the beginning to the end.
For future research, it is important to apply this methodology during the coronavirus pandemic
time, starting from 2020, because the topics analyzed could be changed or other topics (like hygiene)
could emerge, creating a new value curve. Also, it could be compared the period of this analysis and the
period after the pandemic to check if any factor or topic has changed. This is because of the odd times
that humanity was facing.
47
It also would be important to apply this methodology with a larger dataset, because some steps,
like text classification, could be done using an algorithm and in an automatic way and would be another
contribution to the areas.
Another point for future research is trying to implement this methodology on specific branches
(like inns or hostels) to be more accurate on what could be improved on that type of accommodations.
Another suggestion for future research is trying to apply another algorithm to see when
implementing the sentiment analysis models. Also, for future research, this methodology can be applied
by combining data from multiple data sources, to get a bigger view of the topics. Also, this same method
can be applied to different seasonality since this is one of the main important factors for tourism,
considering that seasonality can account for a significant proportion of variation in tourism demand
(Vatsa, 2020). At least, this method can be applied considering the nationality of the tourists (if domestics
or international) to see if the same view from the topic changes.
48
8. BIBLIOGRAPHY
Aciar, S. (2010). Mining Context Information from Consumer ’ s Reviews. 2nd Workshop on Context-Aware
Recommender Systems (CARS-2010).
Akram, K; Siddiqui, S.H; Nawaz M.A.; Ghauri, T.A; Cheema, A. K. . (2011). Role of knowledge management to bring innovation: an integrated approach. International Bulletin of Business Administration. (May 2014). Retrieved from https://www.econ-jobs.com/research/9515-Role-of-Knowledge-Management-to-Bring-Innovation-An-Integrated-Approach.pdf Amadio, W. J., & Procaccino, J. D. (2016). Competitive analysis of online reviews using exploratory text mining. Tourism and Hospitality Management. https://doi.org/10.20867/thm.22.2.3 Afzaal, M., & Usman, M. (2016). A novel framework for aspect-based opinion classification for tourist places. The 10th International Conference on Digital Information Management, ICDIM 2015. https://doi.org/10.1109/ICDIM.2015.7381850 Brennan, B. S., Koo, C., & Bae, K. M. (2018). Smart Tourism: A Study of Mobile Application Use by Tourists Visiting South Korea. Asia-Pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, 8(10), 1–9. https://doi.org/10.21742/AJMAHS.2018.10.15 Bruno, L. (2019). Introducing machine learning concepts with WEKA. Journal of Chemical Information and Modeling, 53(9), 1689–1699. https://doi.org/10.1017/CBO9781107415324.004. Buhalis, D., & Amaranggana, A. (2014). Smart Tourism Destinations. In Information and Communication Technologies in Tourism (pp. 553–564). https://doi.org/10.1007/978-3-319-03973-2 Cacho. C.L. Philip Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques, and technologies: A survey on Big Data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.01.015. Chauhan, R., & Kaur, H. (2015). Predictive Analytics and Data Mining. In Business Intelligence. https://doi.org/10.4018/978-1-4666-9562-7.ch019. Chan Kim, W., & Marborgne, R. (2015). Creating Blue Oceans. Engineering and Technology Magazine, 93–94.
Chiappa, G. Del, Zara, A., Murphy, H. C., Dang, Y., Chen, M., Fountoulaki, P., … Jung, T. (2015). Smart Tourism Destinations Enhancing Tourism Experience Through Personalization of Services. (February), 763–774. https://doi.org/10.1007/978-3-319-14343-9. Cró, S., & Martins, A. M. (2017). The importance of security for hostel price premiums: European empirical evidence. Tourism Management, 60, 159-165. Davenport, H.T., & Dyche, J. (2013). Big Data in Big Companies, Retrieved January 5, 2015 from http://www.sas.com/resources /asset/Big-Data-in-Big-Companies.pdf. Dickinger, A., & Mazanec, J. A. (2015). Significant word items in hotel guest reviews: A feature extraction approach. Tourism Recreation Research. https://doi.org/10.1080/02508281.2015.1079964 Dietrich, D., Gray, J., McNamara, T., Poikola, A., Pollock, R., Tait, J., & Zijlstra, T. (2012). What is Open Data?
49
(1.0.0). Open Knowledge Foundation. Retrieved from http://opendatahandbook.org/guide/en/what-is-open-data/. Donthu, N., & Gustafsson, A. (2020). Effects of COVID-19 on business and research. Journal of Business Research, 117(January), 284–289. https://doi.org/10.1016/j.jbusres.2020.06.008 El-Ansary, A. I. (2006). Marketing strategy: Taxonomy and frameworks. European Business Review, 18(4), 266–293. https://doi.org/10.1108/09555340610677499. Fang, B., Ye, Q., Kucukusta, D., & Law, R. (2016). Analysis of the perceived value of online tourism reviews: Influence of readability and reviewer characteristics. Tourism Management, 52,498-506. Feifei Xu, Nicholas Nash & Lorraine Whitmarsh (2019): Big data or small data? A methodological review of sustainable tourism, Journal of Sustainable Tourism, doi: 10.1080/09669582.2019.1631318. Filieri, R., Galati, F., & Raguseo, E. (2021). The impact of service attributes and category on eWOM helpfulness: An investigation of extremely negative and positive ratings using latent semantic analytics and regression analysis. Computers in Human Behavior, 114(February 2020), 106527. https://doi.org/10.1016/j.chb.2020.106527 Fuchs, M., & Lexhagen, M. (2013). Sentiment Analysis Extracting Decision-Relevant Knowledge from UGC. Information and Communication Technologies in Tourism 2014, (January). Gassiot, A., & Coromina, L. (2013). Destination image of Girona: an online text-mining approach. International Journal of Management Cases. Gémar, G., & Jiménez-Quintero, J. A. (2015). Text mining social media for competitive analysis. Tourism & Management Studies. Glod, B. (2016). The 5 Significant Advantages of Interdisciplinary Research No Title. Retrieved December 29, 2021, from https://theihs.org/blog/5-advantages-of-interdisciplinary-research/. Grus, J. (2015). Data science from scratch. Sebastopol, CA: O'Reilly Media. Hossein, R., Bazargani, Z., & Kiliç, H. (2021). Tourism competitiveness and tourism sector performance : Empirical insights from new data. Journal of Hospitality and Tourism Management, 46(October 2020), 73–82. https://doi.org/10.1016/j.jhtm.2020.11.011 Hjalager, A. (2013). 100 Innovations That Transformed Tourism. Journal Of Travel Research, 54(1), 3-21. doi: 10.1177/0047287513516390. Kim, C., & Mauborgne, R. (2019). blueoceanstrategy.com. Retrieved December 10, 2019, from https://www.blueoceanstrategy.com/blue-ocean-strategy-book/ Konkiewicz, K. (2019). Reading a confusion matrix. Retrieved from towardsdatascience.com website: https://towardsdatascience.com/reading-a-confusion-matrix-60c4dd232dd4 Kotler, Philip, Andreasen, & R., A. 1991. Strategic marketing for nonprofit organizations (4th ed.). Englewood Cliffs [N.J.]: Prentice-Hall. Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 6.
50
Li, J., Xu, L., Tang, L., Wang, S., & Li, L. (2018). Big data in tourism research: A literature review. Tourism Management, 68, 301–323. https://doi.org/10.1016/j.tourman.2018.03.009 Li, H., Ye, Q., & Law, R. (2013). Determinants of Customer Satisfaction in the Hotel Industry: An Application of Online Review Analysis. Asia Pacific Journal of Tourism Research. https://doi.org/10.1080/10941665.2012.708351 Li, Q., Li, S., Zhang, S., Hu, J., & Hu, J. (2019). A review of text corpus-based tourism big data mining. Applied Sciences (Switzerland). https://doi.org/10.3390/app9163300 Lobao, F., Aparicio, M., & Neto, M. D. C. (2019). SMART TOURISM -CITY TOURISM RADAR : A Tourism Monitoring Tool at the City of Lisbon SMART TOURISM – CITY TOURISM RADAR : A Tourism Monitoring Tool at the City of Lisbon. (October). Mariani, M. (2019). Big Data and analytics in tourism and hospitality: a perspective article. Tourism Review, 75(1), 299–303. https://doi.org/10.1108/TR-06-2019-0259. Mehmood, F., Ahmad, S., & Kim, D. H. (2019). Design and development of a real-time optimal route recommendation system using big data for tourists in Jeju Island. Electronics (Switzerland), Vol. 8. https://doi.org/10.3390/electronics8050506. Mohajon, J. (2020). Confusion Matrix for Your Multi-Class Machine Learning Model. Retrieved from towardsdatascience.com website: https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826 Mohd Zawawi, N. F., Abd Wahab, S., Al-Mamun, A., Sofian Yaacob, A., Kumar AL Samy, N., & Ali Fazal, S. (2016). Defining the Concept of Innovation and Firm Innovativeness: A Critical Analysis from Resorce-Based View Perspective. International Journal of Business and Management, 11(6), 87. https://doi.org/10.5539/ijbm.v11n6p87. Nave, M., Rita, P., & Guerreiro, J. (2018). A decision support system framework to track consumer sentiments in social media. Journal of Hospitality Marketing and Management. https://doi.org/10.1080/19368623.2018.1435327. Neidhardt, J., Rümmele, N., & Werthner, H. (2017). Predicting happiness: user interactions and sentiment analysis in an online travel forum. Information Technology and Tourism. https://doi.org/10.1007/s40558-017-0079-2. Pantano, E., Priporas, C., & Stylos, N. (2017). ‘You will like it!’ using open data to predict tourists' response to a tourist attraction. Tourism Management, 60, 430-438. doi: 10.1016/j.tourman.2016.12.020. Phillips, P., Zigan, K., Silva, M. M. S., & Schegg, R. (2015). The interactive effects of online reviews on the determinants of Swiss hotel performance: A neural network analysis. Tourism Management, 50, 130-141. Powers, D. M. W. (2011). Evaluation: From Precision , Recall and F-Measure To Roc , Informedness ,
Markedness & Correlation − R. 2(1), 37–63.
Provost, F., & Fawcett, T. (2013). Data science for business: [what you need to know about data mining and data-analytic thinking]. Sebastopol, Calif.: O'Reilly.
51
Ramanathan, V., & Meyyappan, T. (2019). Twitter text mining for sentiment analysis on people’s feedback about Oman tourism. 2019 4th MEC International Conference on Big Data and Smart City, ICBDSC 2019. https://doi.org/10.1109/ICBDSC.2019.8645596. Rita, P., Rita, N., & Oliveira, C. (2018). Data science for hospitality and tourism. Worldwide Hospitality and Tourism Themes, 10(6), 717-725. https://doi.org/10.1108/WHATT-07-2018-0050. Statista. (2020). Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2024. Retrieved December 29, 2021, from https://www.statista.com/statistics/871513/worldwide-data-created/.
Secretaria de Turismo do Estado do Maranhão (2017, October 30). Observatório do Turismo do Maranhão [Web Page]. Retrieved from https://sites.google.com/view/observatorioturismomaranhao/p%C3%A1gina-inicial
Siddiqa, A., Hashem, I. A. T., Yaqoob, I., Marjani, M., Shamshirband, S., Gani, A., et al. (2016). A survey of big data management: Taxonomy and state-of-the-art. Journal of Network and Computer Applications, 71, 151–166.
Thelwall, M. (2001). A web crawler design for data mining. Journal of Information Science, 27(5), 319e325.
Travel, H., Are, B., & Marketing, C. D. (n.d.). GETTING TO PEAK How Travel Brands Are Making the Climb. 1–26. Tsiotsou, R. H., Mild, A., & Sudharshan, D. (2012). Tourism Market ( c ) E m er al ro up Pu bl is hi ( c ) E m ro up Pu bl is. (July). Varadarajan, R. (2010). Strategic marketing and marketing strategy: Domain, definition, fundamental issues and foundational premises. Journal of the Academy of Marketing Science, 38, 119–140. Vatsa, P. (2020). Annals of Tourism Research Seasonality and cycles in tourism demand — redux. Annals of Tourism Research, (xxxx), 103105. https://doi.org/10.1016/j.annals.2020.103105 Witten, I. H., Frank, E., & Hall, M. a. (2011). Data Mining: Practical Machine Learning Tools and Techniques (Google eBook). In Complementary literature None. Retrieved from http://books.google.com/books?id=bDtLM8CODsQC&pgis=1. Wood, S. A., Guerry, A. D., Silver, J. M., & Lacayo, M. (2013). Using social media to quantify nature-based tourism and recreation. Scientific Reports, 3. https://doi.org/10.1038/srep02976. Xiang, Z., Du, Q., Ma, Y., & Fan, W. (2017). A comparative analysis of major online review platforms: Implications for social media analytics in hospitality and tourism. Tourism Management, 58,51e65. Xu, H., Yuan, H., Ma, B., & Qian, Y. (2015). Where to go and what to play: Towards summarizing popular information from massive tourism blogs. Journal of Information Science, 41(6), 830-854. Xu, X., & Li, Y. (2016). The antecedents of customer satisfaction and dissatisfaction toward various types of hotels: A text mining approach. International Journal of Hospitality Management, 55,57e69. Xu, X. (2019). Examining the Relevance of Online Customer Textual Reviews on Hotels’ Product and Service Attributes. Journal of Hospitality and Tourism Research. https://doi.org/10.1177/1096348018764573.
52
Yang, Y., Williams, M. H., MacKinnon, L. M., & Pooley, R. (2005).A service-oriented personalization mechanism in pervasive environments. s.l., IEEE. Zhao, X., Wang, L., Guo, X., & Law, R. (2015). The influence of online reviews to online hotel booking intentions. International Journal of Contemporary Hospitality Management, 27(6), 1343-1364.
53
9. ATTACHMENT
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
%matplotlib inline
!pip install PyDrive
Requirement already satisfied: PyDrive in /usr/local/lib/python3.6/dist-packages (1. Requirement already satisfied: google-api-python-client>=1.2 in /usr/local/lib/pytho Requirement already satisfied: oauth2client>=4.0.0 in /usr/local/lib/python3.6/dist- Requirement already satisfied: PyYAML>=3.0 in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: six<2dev,>=1.6.1 in /usr/local/lib/python3.6/dist-pac Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python3.6/ Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3 Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: httplib2<1dev,>=0.17.0 in /usr/local/lib/python3.6/di Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: pyasn1-modules>=0.0.5 in /usr/local/lib/python3.6/dis Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python3.6/dist-packag Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/di Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/dist-p
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
downloaded = drive.CreateFile({'id':"1VbyRP78BGCx1X3DStaPOriMJAAkewoPw"}) # replace the
downloaded.GetContentFile('todo_data.xlsx') # replace the file name with your file
df = pd.read_excel('todo_data.xlsx')
54
df.head()
55
0 1 passar oito 1 serviços:atendimento,café, limpeza NaN
dias no
Blue Tree...
Fiz uma
reserva
1 2 para ficar
-1 serviços:atendimento NaN
1.1.1. Id Reviews Label main_topic
Expenditures Tra
Fomos em
família
df1=df.fillna(0)
df1.head(5)
Id Reviews Label main_topic Expenditures Tra
0 1
Fomos em
família
passar oito
1
serviços:atendimento,café, limpeza
0.0
dias no
Blue Tree...
Fiz uma
reserva
1 2 para ficar -1 serviços:atendimento 0 0
56
DATA ANALYSIS
9.2. CLIQUE DUAS VEZES (OU PRESSIONE "ENTER") PARA EDITAR
#creating variables for data analysis and ploting on graphic chart
#Expenditures
e= df1.loc[df1['Expenditures'] == 1].sum(axis=1)
expenditures=len(e)
expenditures
#Travel
t = df1.loc[df1['Travel'] == 1].sum(axis=1)
travel=len(t)
travel
#Activities/Satisfaction
a_s = df1.loc[df1['Activities/Satisfaction'] == 1].sum(axis=1)
activities_satisfaction=len(a_s)
activities_satisfaction
#Visit
visit = df1['Visit'].sum()
visit
#Accommodation
accommodation = df1['Accommodation'].sum()
accommodation
categories = [expenditures,travel,activities_satisfaction,visit,accommodation]
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Expenditures','Travel','Activities/Satisfaction','Visit','Accommodation'
sizes = categories
explode = (0.2, 0.2, 0, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=180,textprops={'fontsize': 12})
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
#Bar chart
plt.figure(figsize=(8,6))
plt.bar(labels, categories,color=['yellow', 'red', 'purple', 'blue', 'green'])
plt.title('Categories')
plt.xlabel('')
plt.ylabel('Values')
plt.show()
GENERAL SENTIMENT ANALYSIS
# Analisando o data frame
total_base = sum(df['Reviews'].value_counts())
print("Base Size: {0:.0f}".format(total_base))
print("Percentual Negativos: {0:.2f}%".format(100*sum(df[df['Label'] == -1]['Label'].value
print("Percentual Positivos: {0:.2f}%".format(100*sum(df[df['Label'] == 1]['Label'].value_
print("Percentual Neutro: {0:.2f}%".format(100*sum(df[df['Label'] == 0]['Label'].value_cou
Base Size: 1391 Percentual Negativos: 17.97% Percentual Positivos: 50.61% Percentual Neutro: 31.42%
# Pie chart sentiment analysis:
labels = 'Negatives','Positives','Neutral'
sizes = [17.97,50.61,31.42]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, colors=['red','green','blue'], labels=labels, autopct='%1.1f%%',
shadow=True, startangle=180,textprops={'fontsize': 12})
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
PERFORMING SENTIMENT ANALYSIS
1. DIVIDING THE DATASET TO CONTAIN ONLY REVIEWS
features = df1.iloc[:, 1]
features.astype('str')
0 Fomos em família passar oito dias no Blue Tree...
1 Fiz uma reserva para ficar uma semana, ao cheg...
2 Fomos em 05 pessoas e posso afirmar que a pous...
3 - funcionários atenciosos; - café bom; - Basta...
4 Hotel com acesso a minha rota de trabalho. Am... ...
1386 Vista linda, praia calma, único defeito é o ac... 1387 Vista para o mar. área nobre se slz. Um lugar ...
1388 você já deve ter lido bastante sobre o tamanho... 1389 Voltamos de Barreirinhas e ficamos hospedados ... 1390 Vou relatar minha hospedagem no hotel Portas d... Name: Reviews, Length: 1391, dtype: object
data = pd.DataFrame(features)
data = data.rename(columns = {'Reviews':'text'})
data['text'].convert_dtypes('str')
#data.reset_index(inplace= True)
data.dtypes
text object dtype: object
!python -m spacy download pt
Collecting pt_core_news_sm==2.2.5 Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_ne
|████████████████████████████████| 21.2MB 71.7MB/s Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/d Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/d Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist- Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist- Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dis Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist- Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6 Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packag Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-p Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in / Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-pa Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-package Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-p
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages ( Building wheels for collected packages: pt-core-news-sm
Building wheel for pt-core-news-sm (setup.py) ... done Created wheel for pt-core-news-sm: filename=pt_core_news_sm-2.2.5-cp36-none-any.wh Stored in directory: /tmp/pip-ephem-wheel-cache-pr42jtvg/wheels/ea/94/74/ec9be8418
Successfully built pt-core-news-sm Installing collected packages: pt-core-news-sm Successfully installed pt-core-news-sm-2.2.5 ✔ Download and installation successful You can now load the model via spacy.load('pt_core_news_sm') ✔ Linking successful /usr/local/lib/python3.6/dist-packages/pt_core_news_sm --> /usr/local/lib/python3.6/dist-packages/spacy/data/pt You can now load the model via spacy.load('pt')
import spacy
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams
from unicodedata import normalize
from wordcloud import WordCloud
nlp = spacy.load("pt")
# Pré processamento
def pre_process(data):
data['text'] = data['text'].apply(lambda x: re.sub(r'\bn\b', 'nao',x, flags=re.IGNOREC
data['text'] = data['text'].apply(lambda x: re.sub(r'(\w)(\1{2,})', r'\1',x)) # 3
data['text'] = data['text'].apply(lambda x: x.lower()) # 5
data['text'] = data['text'].apply(lambda x: re.sub(r'[\W*]+', ' ',x)) # 6
data['text'] = data['text'].apply(lambda x: re.sub(r'[0-9]', '',x)) # 7
data['text'] = data['text'].apply(lambda x: re.sub(r'\b \b', ' ',x)) # 9
# Abreviações básicas
data['text'] = data['text'].apply(lambda x: re.sub(r'\bpq\b', 'porque',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bvc\b', 'você',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bvcs\b', 'você',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\btb\b', 'também',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\btbm\b', 'também',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bpra\b', 'para',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bsr\b', 'senhor',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bta\b', 'está',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\bq\b', 'que',x))
return data
def bgram(data):
data['text'] = data['text'].apply(lambda x: re.sub(r'boa localização', 'boa_localizaçã
data['text'] = data['text'].apply(lambda x: re.sub(r'localização privilegiada', 'local
data['text'] = data['text'].apply(lambda x: re.sub(r'atendimento bom', 'atendimento_bo
data['text'] = data['text'].apply(lambda x: re.sub(r'café bom', 'café_bom',x))
data['text'] = data['text'].apply(lambda x: re.sub(r'excelente atendimento', 'excelent
return data
# Adiconando novos stopwords
new_sw = ['o','a','e','dele', '',' ']
for word in new_sw:
nlp.Defaults.stop_words.add(word)
stopwords_set = nlp.Defaults.stop_words
# Segunda etapa de limpeza
!pip install Unidecode
Collecting Unidecode
Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d82490 |████████████████████████████████| 245kB 9.0MB/s
Installing collected packages: Unidecode Successfully installed Unidecode-1.1.1
import unidecode
def pre_process2(text):
try:
decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))
except:
decoded = unidecode.unidecode(text)
token = nlp(text)
final_tokens = []
for t in token:
if t.is_stop or t.is_punct or t.is_space or t.like_num:
pass
else:
if t.lemma_ == '-PRON-':
final_tokens.append(str(t))
else:
sc_removed = normalize('NFKD', str(t.lemma_)).encode('ASCII', 'ignore').de
if len(sc_removed) > 1:
final_tokens.append(sc_removed)
joined = ' '.join(final_tokens)
spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)
return spell_corrected
def spacy_cleaner3(text):
try:
decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))
except:
decoded = unidecode.unidecode(text)
apostrophe_handled = re.sub("’", "'", decoded)
expanded = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t i
parsed = nlp(expanded)
final_tokens = []
for t in parsed:
if t.is_punct or t.is_space or t.like_num or t.like_url:
pass
else:
if t.lemma_ == '-PRON-':
final_tokens.append(str(t))
else:
sc_removed = re.sub("[^a-zA-Z]", '', str(t.lemma_))
if len(sc_removed) > 1:
final_tokens.append(sc_removed)
joined = ' '.join(final_tokens)
spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)
return spell_corrected
# Criando Bigramas
def bigramReturner(text):
token = nltk.word_tokenize(text)
bigrams = list(ngrams(token,2))
return bigrams
pre_process(data)
bgram(data)
data['clean_text'] = [pre_process2(i) for i in data.text]
data['label'] = df['Label']
data.head(5)
1.1.2. text clean_text
label
0 fomos em família passar oito dias no blue
tree...
1 fiz uma reserva para ficar uma semana ao
chega...
2 fomos em pessoas e posso afirmar que a
pousada...
familia passar dia blue tree tawer confessar
f... 1
fazer reservar ficar semana chegar manha
pagar... -1
pessoa afirmar pousar aconchegante
espacoso or... 1
ANÁLISE EXPLORATÓRIA
# Função para criar núvem de palavras
def print_wordcloud(data, bg_color):
words = ' '.join(data)
wordcloud = WordCloud(stopwords=stopwords_set,
background_color=bg_color,
width=3000,
height=2000
).generate(words)
plt.figure(1, figsize=(15, 15))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
# Funções para bag of words
def get_all_words(text):
all_words = []
for words in text:
all_words.extend(words.split())
return all_words
def get_bag_of_words(all_words):
return nltk.FreqDist(all_words)
#All_words_text_deep
all_words = get_all_words(data["clean_text"]) # Escolha a coluna a ser analisada
bag_of_words = get_bag_of_words(all_words)
word_features = bag_of_words.keys()
# Analisando a frequencia do dicionário
bag_of_words.most_common(30)
[('hotel', 1485), ('cafe', 981), ('manha', 939), ('ficar', 703), ('atendimento', 427), ('quarto', 407), ('restaurante', 375), ('piscina', 354), ('banheiro', 336), ('funcionario', 330), ('cama', 330), ('excelente', 321), ('localizacao', 315), ('dia', 309), ('praia', 274), ('luis', 263), ('visto', 257), ('localizar', 253), ('confortavel', 243), ('ter', 235),
('centrar', 235), ('opcao', 234), ('haver', 229), ('mar', 228), ('recepcao', 220), ('hospedar', 215), ('ar', 211), ('historico', 203), ('recomendar', 201), ('noite', 197)]
# Núvem de palavras do dicionário
print_wordcloud(bag_of_words, 'black')
# Analisando a frequencia do dicionário
bag_of_words.most_common(30)
[('hotel', 1485), ('cafe', 981), ('manha', 939), ('ficar', 703), ('atendimento', 427), ('quarto', 407), ('restaurante', 375), ('piscina', 354), ('banheiro', 336), ('funcionario', 330), ('cama', 330), ('excelente', 321), ('localizacao', 315), ('dia', 309), ('praia', 274), ('luis', 263), ('visto', 257), ('localizar', 253), ('confortavel', 243), ('ter', 235), ('centrar', 235), ('opcao', 234), ('haver', 229), ('mar', 228), ('recepcao', 220), ('hospedar', 215), ('ar', 211), ('historico', 203), ('recomendar', 201), ('noite', 197)]
# Núvem de palavras do dicionário
print_wordcloud(bag_of_words, 'black')
SENTIMENT ANALYSIS
from sklearn import naive_bayes
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score
/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: FutureWarning: T "(https://pypi.org/project/six/).", FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarni warnings.warn(message, FutureWarning)
# Aplicando TD_IDF
tvec = TfidfVectorizer(max_features=3000, ngram_range=(1, 3))
# Aplicando modelo de Naive Bayes
naive = naive_bayes.MultinomialNB()
# Função para o Naive bayes
def nb_cv(splits, X, Y, pipeline, average_method):
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
accuracy = []
precision = []
recall = []
f1 = []
for train, test in kfold.split(X, Y):
nb_fit = pipeline.fit(X[train], Y[train])
prediction = nb_fit.predict(X[test])
scores = nb_fit.score(X[test],Y[test])
accuracy.append(scores * 100)
precision.append(precision_score(Y[test], prediction, average=average_method)*100)
print(' neg neut pos')
print('precision:', precision_score(Y[test], prediction, average=None))
recall.append(recall_score(Y[test], prediction, average=average_method)*100)
print('recall: ',recall_score(Y[test], prediction, average=None))
f1.append(f1_score(Y[test], prediction, average=average_method)*100)
print('f1 score: ',f1_score(Y[test], prediction, average=None))
print('-'*27)
print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))
print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))
print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))
print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))
def nb_teste(splits, X, Y, pipeline, average_method):
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
accuracy = []
precision = []
recall = []
f1 = []
for train, test in kfold.split(X, Y):
nb_fit = pipeline.fit(X[train], Y[train])
prediction = nb_fit.predict(X[test])
scores = nb_fit.score(X[test],Y[test])
precision.append(precision_score(Y[test], prediction, average=None))
recall.append(recall_score(Y[test], prediction, average=None))
f1.append(recall_score(Y[test], prediction, average=None))
df_precision = pd.DataFrame(precision, columns =['Negative', 'Neutral', 'Positive']) df_recal = pd.DataFrame(recall, columns =[ Negative , Neutral , Positive ])
df_f1 = pd.DataFrame(f1, columns =['Negative', 'Neutral', 'Positive'])
df2 = pd.concat([df_precision,df_recal,df_f1], axis=0)
return df2
from sklearn.pipeline import Pipeline
original_pipeline = Pipeline([
('vectorizer', tvec),
('classifier', naive)
])
nb_cv(5, data['clean_text'], data['label'], original_pipeline, 'macro')
#incluir labels para continuar
precision: neg neut [1.
pos 0.575
0.74331551]
recall: [0.24 0.52272727 0.9858156 ] f1 score: [0.38709677 0.54761905 0.84756098]
neg neut pos
precision: [1. 0.4939759 0.72432432] recall: [0.2 0.47126437 0.95035461] f1 score: [0.33333333 0.48235294 0.82208589]
neg neut pos
precision: [1. 0.55263158 0.71657754] recall: [0.3 0.48275862 0.95035461] f1 score: [0.46153846 0.51533742 0.81707317]
neg neut pos
precision: [0.94444444 0.54166667 0.7287234 ] recall: [0.34 0.44827586 0.97163121] f1 score: [0.5 0.49056604 0.83282675]
neg neut pos
precision: [0.9047619 0.59722222 0.74054054] recall: [0.38 0.48863636 0.97857143] f1 score: [0.53521127 0.5375 0.84307692]
accuracy: 69.37% (+/- 1.72%) precision: 75.09% (+/- 1.27%) recall: 58.07% (+/- 2.40%) f1 score: 59.69% (+/- 2.99%)
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
ROS_pipeline = make_pipeline(tvec, RandomOverSampler(random_state=777),naive)
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),naive)
nb_cv(5, data.clean_text, data.label, ROS_pipeline, 'macro')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.72881356 0.64516129 0.88976378] recall: [0.86 0.68181818 0.80141844] f1 score: [0.78899083 0.66298343 0.84328358]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.63492063 0.60714286 0.87022901] recall: [0.8 0.5862069 0.80851064] f1 score: [0.7079646 0.59649123 0.83823529]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.63793103 0.5875 0.82857143] recall: [0.74 0.54022989 0.82269504] f1 score: [0.68518519 0.56287425 0.82562278]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.67164179 0.68 0.86764706] recall: [0.9 0.5862069 0.83687943] f1 score: [0.76923077 0.62962963 0.85198556]
neg neut pos
precision: [0.68421053 0.61290323 0.8828125 ] recall: [0.78 0.64772727 0.80714286] f1 score: [0.72897196 0.62983425 0.84328358]
accuracy: 75.05% (+/- 2.04%) precision: 72.19% (+/- 2.50%) recall: 74.66% (+/- 2.92%) f1 score: 73.10% (+/- 2.61%) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning)
nb_cv(5, data.clean_text, data.label, SMOTE_pipeline, 'macro')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) neg neut pos
precision: [0.72881356 0.64772727 0.87878788] recall: [0.86 0.64772727 0.82269504] f1 score: [0.78899083 0.64772727 0.84981685]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.62121212 0.57692308 0.85074627] recall: [0.82 0.51724138 0.80851064] f1 score: [0.70689655 0.54545455 0.82909091]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.61290323 0.57317073 0.85074627] recall: [0.76 0.54022989 0.80851064] f1 score: [0.67857143 0.55621302 0.82909091]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.68181818 0.66666667 0.85820896] recall: [0.9 0.59770115 0.81560284] f1 score: [0.77586207 0.63030303 0.83636364]
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin warnings.warn(msg, category=FutureWarning)
neg neut pos
precision: [0.68253968 0.625 0.8976378 ] recall: [0.86 0.625 0.81428571] f1 score: [0.76106195 0.625 0.85393258]
accuracy: 74.69% (+/- 2.43%) precision: 71.69% (+/- 3.00%) recall: 74.65% (+/- 3.10%) f1 score: 72.76% (+/- 3.06%)
PLOTS
from sklearn.utils.multiclass import unique_labels
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
classes = classes[unique_labels(y_true, y_pred)]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
np.set_printoptions(precision=2)
def nb_prediction(splits, X, Y, pipeline, average_method):
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
for train, test in kfold.split(X, Y):
nb_fit = pipeline.fit(X[train], Y[train])
prediction = nb_fit.predict(X[test])
scores = nb_fit.score(X[test],Y[test])
return prediction
def nb_Ytest(splits, X, Y, pipeline, average_method):
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
for train, test in kfold.split(X, Y):
nb_fit = pipeline.fit(X[train], Y[train])
prediction = nb_fit.predict(X[test])
scores = nb_fit.score(X[test],Y[test])
return Y[test]
## Plot non-normalized confusion matrix
class_names = np.array(['Positive','Neutral','Negative'])
plot_confusion_matrix(nb_prediction(5, data.clean_text, data.label, ROS_pipeline, 'macro')
nb_Ytest(5, data.clean_text, data.label, ROS_pipeline, 'macro'), cla
title='Confusion matrix, without normalization')
plt.show()
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarnin
warnings.warn(msg, category=FutureWarning) Confusion matrix, without normalization [[ 39 16 2]
[ 11 57 25] [ 0 15 113]]
BLUE OCEAN DATA
#dataset for blue ocean
data_bo = data
data_bo['main_topic'] = df1['main_topic']
data_bo.head(5)
text clean_text label main_topic
0
fomos em família
passar oito dias
familia passar dia
blue tree tawer
1
serviços:atendimento,café, limpeza
Page | i