investigation of cloud computing technology on the visualisation and
TRANSCRIPT
INFORMATICS ENGINEERING (07 T)
INVESTIGATION OF CLOUD
COMPUTING TECHNOLOGY ON THE
VISUALISATION AND CLASSIFICATION
ALGORITHMS
Tomas Pranckevičius
October 2013
Technical Report MII-DS-07T-13-14
Institute of mathematics and informatics, Akademijos str. 4, Vilnius LT 08663,
Lithuania
http://www.mii.lt/
Vilnius University
INSTITUTE OF MATHEMATICS AND
INFORMATICS
L I T H U A N I A
MII-DS-07T-13-14 2
Abstract
This technical report reviews the current state issues of Cloud computing technology
using Naive Bayes classifier and examines the predicted future advancements of Cloud
computing technology impact. A brief history of Cloud computing technology is
initially outlined. The report then focuses on the advantages and limitations of the Map
Reduce paradigm and Naive Bayes classifier implementation on Cloud computing
environment using its library Mahout. It is concluded that further Cloud computing
technological advances in field of processing data will continue to improve quality of
the visualisation and classification algorithms. It is also suggested that specialised
algorithms will be increasingly be incorporated into other types of technology such as
modern web service based data mining systems. Science, economics, finance or
medicine is the field of applying tools and algorithms for data visualization and
classification and improve decision making based on data analysis.
Keywords: Cloud computing, Map Reduce, Naive Bayes, classification and
visualisation algorithms.
MII-DS-07T-13-14 3
Contents
Introduction .................................................................................................................... 4 1 Research ................................................................................................................. 5
1.1 Research Methodology .................................................................................... 5 1.2 The Object ....................................................................................................... 5 1.3 The Goal .......................................................................................................... 5 1.4 The Tasks ........................................................................................................ 5 1.5 Main questions ................................................................................................ 6 1.6 Theoretical research ........................................................................................ 6
2 Computing Technology ......................................................................................... 7 2.1 Review of Cloud Computing Technology ...................................................... 7
2.1.1 Architecture .............................................................................................. 8
2.1.2 Characteristics .......................................................................................... 8
2.1.3 Service models ......................................................................................... 9
2.1.4 Deployment models ................................................................................. 9
2.2 Map Reduce Paradigm .................................................................................. 10 2.2.1 Apache Mahout Library ......................................................................... 11
2.2.2 Designing algorithms for Map Reduce .................................................. 12
2.2.3 Selection of algorithms .......................................................................... 12
Conclusions .................................................................................................................. 15 References .................................................................................................................... 16
MII-DS-07T-13-14 4
Introduction
One of our days computing advancements and issues are based on data processing
which challenge with growing data sets in parallel. Science, economics, finance or
medicine is the field of applying such tools and algorithms for data visualization and
classification in order to detect or indicate data clusters and other specific areas which
is needed for decision making. The purpose of this research is to investigate the issues
and advantages of impact of Cloud computing technology on the visualisation and
classification algorithms; that is, this technical report reviews theory, computer
experiment and current state issues of Cloud computing technology using Map Reduce
paradigm and designing algorithms, Mahout library and Naive Bayes classifier.
MII-DS-07T-13-14 5
1 Research
1.1 Research Methodology All methods applied in this research is mainly based on theoretical research,
constructive approach and experiment:
1. The analysis of the scientific and experimental achievements in the field of data
visualization, classification and Cloud computing technologies, the use of
information retrieval, organization, analysis, benchmarking and aggregation
methods.
2. Software development and conceptual methods.
3. Theoretical methods used to prove theorems and algorithms for testing
convergence. Applied the principle of mathematical induction to prove
statements.
4. Based on the experimental method of analysis, the statistical data and analysis
of the results, which are used to evaluate the results of aggregation method. Such
as selection of a random sample, generation of two random “identical” data
groups, variable analysis, measuring of dependent variables according to given
conditions.
1.2 The Object The object – visualization and classification algorithms based on modern Cloud
computing technology solutions.
1.3 The Goal The goal – investigate what is impact of Cloud computing technology on the
visualization and classification algorithms.
1.4 The Tasks These tasks of the research have to be implemented in case the goal of the research will
be reached:
1. Perform Cloud computing technologies and the classification and visualization
algorithms review.
2. Identify scientific issues in arising tasks, while applying visualization and
classification algorithms in Cloud computing technologies.
3. Select and describe the methodology used in the dissertation fulfillment of tasks.
4. Overview research methods of solving realization impact of visualization and
classification algorithms issues in Cloud computing technology.
5. Identify key research issues and formulate concerns for experimental research
and analyzes.
6. Perform principles and methodology of analysis of development visualization
and classification algorithms on Cloud computing technologies.
7. Perform analysis and selection process of visualization and classification
algorithms which are suitable for Cloud computing technologies.
8. Select criteria for evaluation.
9. Investigate the main issues of visualization and classification algorithms applied
to Cloud computing technologies.
10. Perform comparative realization and analysis applied to visualization and
classification algorithms in Cloud computing technologies.
MII-DS-07T-13-14 6
11. Perform application and modification of visualization and classification
algorithms in used in Cloud computing technologies.
12. Perform experimental research and comparative analysis of realized, applied
and modified existing algorithms by using real data sets.
13. Perform statistical analysis of the data obtained.
14. Summaries the conclusions and identify the essential findings.
15. Prepare and publish the findings.
1.5 Main questions The core questions for future research, experimentations and analysis:
1. What are the features of Cloud computing that will enable the execution of
visualization and classification algorithms?
2. What kind of Cloud computing technology/resources should be used in this
investigation?
3. What are visualization and classification algorithms useful for?
4. What are exactly these algorithms?
5. What are the resources for testing and implementation?
6. Why Cloud computing?
7. What kind of real data?
1.6 Theoretical research The realization of visualization and classification algorithms in Cloud computing
technology are based on theoretical research:
1. Overview of distributed parallelism and scalability technologies;
2. Overview of foundation of classification and visualization algorithms;
3. Designing algorithms for distributed parallel and scalable technologies;
4. Computer experimentation;
5. Improving algorithms.
MII-DS-07T-13-14 7
2 Computing Technology
2.1 Review of Cloud Computing Technology
The computer industry in point of technological view witnessed several major fractures.
Firstly, the technology began slowly migrate from universities and public institutions
to the first personal computers. In other meanings – mostly all mainframe computers
have been decentralized to client – server systems, and accelerated to the first personal
computer in the houses of humans, and with the rapid spread of technological progress.
Then there was so called a second technological transformation, when computers were
connected to the global network, which today is called the Internet. The third and crucial
turning point, when users started to share the available IT infrastructure and services in
the form of purchasing or renting it from the Cloud computing service providers. In
technological terms, it is a transition from a client – server to centralized systems with
distributed parallelism (), like a cyclical return to the past, but this is done in order to
optimize time or cost and focus on core of enterprise assets. For example, millions of
users every day open their web mail accounts or edit documents online and store files
on the Internet, they use social networks or media sites, uploads a wide range of data
and everything happens independently of the location.
Mostly all of these services are based on Cloud computing technologies, because after
all the information is out on the Internet or somewhere in the data centre. User based
approach on Cloud computing is simple – just the hardware and software set through
the internet as a service. Users may enter the web address in browser, authenticate to
the information systems and use any services or applications which are provided
through local or remote data centres. To do this mostly there is need to have access to
the network and all remote assets including data, software and hardware capabilities are
available.
But there is professional or scientifically approach on Cloud computing based
technologies which is not such simple because of variety elements included to this
process. One of them is data science. This paper is dedicated to data science which
provided throw Cloud computing with web based software included data visualization
and classification algorithms.
The idea of Cloud computing technology is not a new one. John McCarthy in the 1960s
already envisioned that computing facilities will be provided to the general public like
utility (Parkhill, 1966). The term “Cloud“ has also been used in various context such as
describing the business model of providing services across the internet, that the term
really started to gain popularity by Eric Schmidt in 2006. (Qi Zhang, Lu Cheng, Rouf
Boutaba, 2010). Cloud computing technologies are becoming more reasonable to use
in today’s technology world and there are many advantages such as cost reduction
impact, because less quantity of hardware is needed (virtualization), less energy
consumption, no up-front investment and pay-as-you-go model, lower operating costs,
rapid allocation and de-allocation, scalability when service demands are changing,
accessibility through different devices and reducing business risk while outsourcing
from infrastructure providers.
MII-DS-07T-13-14 8
2.1.1 Architecture
Today Cloud computing architecture can be described in layered model represented as
four abstractions: hardware, infrastructure, platform and application (NIST) (Peter
Mell, Timothy Grance, September 2011).
Fig. 1. Architecture layered model
Hardware layer is responsible for managing the physical resources of the Cloud.
Infrastructure layer creates a pool of resources using virtualization technologies.
Platform layer consists of operating system and application frameworks. Application
layer runs actual Cloud applications under physical or virtual environment.
Virtualization first time was introduced by Gerald J. Popek and Robert P. Goldberg in
their 1974 article “Formal Requirements for Virtualizable Third Generation
Architectures”.
2.1.2 Characteristics
According National Institute of Standards and Technology (NIST) (Peter Mell,
Timothy Grance, September 2011) there are five essential characteristics in Cloud
computing such as:
1. On-demand self-service.
2. Broad network access.
3. Resource pooling.
4. Rapid elasticity and measured service.
The idea behind is to describe variety of technical characteristics, i.e. possibility to
extend computing capabilities without human interaction, ensure availability over the
network, provide computing resources to different consumers with sense of location
independence or automatically monitor, control and optimize resources.
Fig. 2. Physical and virtual infrastructure model
APPLICATION LAYER
PLATFORM LAYER
INFRASTRUCTURE LAYER
HARDWARE LAYER
MII-DS-07T-13-14 9
2.1.3 Service models
Recommendations for service models of Cloud computing was defined by NIST (Peter
Mell, Timothy Grance, September 2011). Service models are describing Cloud
computing as a part of service industry by providing intangible products and services.
Cloud service providers can be divided to infrastructure providers which are
responsible of managing Cloud platforms and lease resources or service providers
which are responsible of renting resources from other infrastructure providers.
Basically these service models are divided to:
1. SaaS (Software as a Service) – capability used as provider’s applications and
are accessible from client devices. The consumer is not responsible for
managing and controlling Cloud infrastructure or even individual application
capabilities except application configuration settings.
2. PaaS (Platform as a Service) – capability provided to the consumer to use
consumer-created applications on Cloud infrastructure. Programming
languages, libraries, services, and tools are supported by the service provider.
3. IaaS (Infrastructure as a Service) – capability provided to the consumer as
computing resource of processing, storage, networks. The consumer is able to
deploy and run arbitrary software, which can include operating systems and
applications, manage network configuration, but consumer does not manage or
control the underlying physical hardware infrastructure.
MODEL LAYER TECHNOLOGY SaaS Application Cloud applications PaaS Platform Operating system and application frameworks IaaS Infrastructure Virtualization technologies
Fig. 3. Service models
2.1.4 Deployment models
According to NIST (Peter Mell, Timothy Grance, September 2011) there are four
deployment models of Cloud computing which are based on private or public Cloud
service availability:
1. Private Cloud – infrastructure provisioned for exclusive use by a single
organization and it may be owned, managed or operated by the same
organization or combination of third party.
2. Community Cloud – infrastructure provisioned for use by a specific
community of consumers from organizations that have shared concerns such as
mission, security, requirements, policy and compliance considerations and it
may be owned, managed or operated by the organizations in the community or
combination of third party.
3. Public Cloud – infrastructure provisioned for open use by the general public
and it may be owned, managed or operated by the business, academic, or
government organization.
4. Hybrid Cloud – infrastructure composed of two or more Cloud models.
Below the matrix of Cloud computing models is presented the same way as described
before.
MII-DS-07T-13-14 10
MODEL ORGANIZATION MANAGING OWNER PREMISES
Private Internal Multiple Multiple On/Off Community Shared Multiple Multiple On/Off Public Public Multiple Multiple On Hybrid Multiple Multiple Multiple On/Off
Fig. 4. Matrix of Cloud computing modelel
2.2 Map Reduce Paradigm
Map Reduce framework is used in order to make classification algorithms applicable to
large scale of data (Lijuan Zhou, Hui Wang, Wenbo Wang, 2012). It is mostly know as
Product of Apache Hadoop foundation and its advantages is:
Leader on Map Reduce paradigm implementation.
High scalability level of parallel computing.
High level of infrastructure tolerance end change.
Created several copies on all cluster and still continues to be fault tolerance.
Uses HDFS.
Disadvantages:
Hadoop is run by a master node, and specifically a namenode, that’s a single
point of failure.
HDFS compression could be better.
HDFS likes to store three copies of everything, whereas many DBMS and file
systems are satisfied with two.
Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/
Process of Map Reduce:
For instance we have set of documents D1, ..., DN
Map
Analyses document D into conditions T1, ..., TN.
Output (key, value) into pairs
(T1, D1), ..., (TN, DN)
Reduce
Output (key, value) pair with condition T
(T1,D1), …, (TN, DN)
Result (key, value) in pairs
(T,(D1, ..., DN))
Alternatives, but not using HDFS:
BashReduce
Disco Project
Spark
MII-DS-07T-13-14 11
GraphLab
Storm
HPCC Systems (from LexisNexis)
http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/
Fig. 5. Map Reduce process
2.2.1 Apache Mahout Library
The Apache Mahout™ is library with extended learning system capabilities and data
mining algorithms such as: clustering, classification, collaborative filtering and
frequent pattern mining. The core of clustering, classification, collaborative filtering
algorithms realization is based on Map Reduce paradigm.
MII-DS-07T-13-14 12
2.2.2 Designing algorithms for Map Reduce
Nutolusiose kompiuterinių išteklių technologijose taikomų vizualizavimo ir
klasifikavimo algoritmų kūrimo principų ir metodologijos analizė.
The general processing flow is as follows:
Input data is "split" into multiple mapper process which executes in parallel
The result of the mapper is partitioned by key and locally sorted
Result of mapper of the same key will land on the same reducer and
consolidated there
Merge sorted happens at the reducer so all keys arriving the same reducer is
sorted
Fig. 6. Designing algorithms for Map Reduce
2.2.3 Selection of algorithms
Real data of natural and social sciences are often high-dimensional and it is difficult to
understand the data. And more over the human being can comprehend visual
information more quickly than textual one. The goal of the projection (visualization)
methods is to represent the input data items in a lower-dimensional space so that
certain properties of the structure of the data set were preserved as faithfully as
possible. Visualization of multidimensional data is a complex problem followed
by extensive researches because it allows to the investigator (Gintautas Dzemyda,
Olga Kurasova, Julius Žilinskas, 2008):
1. Observe data clusters.
2. Estimate the inter-nearness between the multidimensional points.
3. Make proper decisions.
MII-DS-07T-13-14 13
Most of the data types in real world applications lack the ability to be directly illustrated
by 2-D or 3-D graphics. There are several techniques that have been commonly used to
visualize data types including point plots and histograms. However, these traditional
techniques are too limited for analyzing highly dimensional data. During the last
decades a number of novel techniques have been developed and classified in the
following types (Keim, 2002) (Evangelos Triantaphyllou, Giovanni Felici, 2006):
1. Geometrically transformed displays, such as landscapes and parallel coordinates
as in scalable framework.
2. Icon-based displays, such as needle icons and star icons.
3. Dense pixel displays, such as the recursive pattern, circle segments techniques
and the graph sketches.
4. Stacked display, such as tree maps or dimensional stacking.
There are several mostly known algorithms related to visualization:
1. Multidimensional scaling.
2. Relative multidimensional scaling.
3. Diagonal majorization.
4. Samonn’s projection.
5. Relational perspective map.
And classification such as:
1. Naive Bayes trees.
2. C4.5.
For the analysis and better understanding Naïve Bayes algorithm are selected in
this research.
2.2.3.1 Naive Bayes Classifier
Classification algorithms can be adopted for classification of documents, images, spam
filters and other data sets.
To get more details on how the Naive Bayes Classifier is implemented, you can look at
the mahout wiki page.
This tutorial will give you a step-by-step description on how to create a training set,
train the Naive Bayes classifier and then use it to classify new tweets.
Naive Bayes Classifier algorithm is modelling probabilistic and can be implemented
easy and quickly with Map Reduce paradigm. It is used for probability evaluation and
MII-DS-07T-13-14 14
can be improved after doing fault analysis or using additional algorithms. Naive Bayes
Classifier can be trained according know set. All data is classified according known
similarities, for unknown objects it assigns new class. Mostly Naive Bayes Classifier is
used to classify text entries, such as:
Art (book, music, movie, … )
Event (travel, concert, … )
Health (beauty, SPA, … )
Home (kitchen, furniture, garden, …)
Technology (desktop computer, laptop, smartphone, smart tv)
Tarkime, turime duomenis apie vaisius, kuriuose aprašytos spalvos ir formos.
Bajeso klasifikatorius yra apmokomas taip, kad kuo tiksliau parinktu klases.
Objekto tipas klasifikuojamas remiantis jo savybėmis, pavyzdžiui:
Matome vaisius, kurie yra raudoni ir apvalūs.
Klausimas: kokios labiausiai tikėtinos rūšies tai vaisiai?
Atsakydami, remiamės atrinktų duomenų pavyzdžiu, t.y. raudoni ir
apvalūs.
Taigi, ateityje galime klasifikuoti visus raudonus ir apvalius vaisius, tik
kaip tam tikros rūšies vaisius.
Duomenų sekų diagram 1
Duomenų sekų diagrama 2
Duomenų sekų diagram 3
Duomenų sekų diagram 4
MII-DS-07T-13-14 15
Conclusions
This study was conducted for the purposes of clarify findings and do the analysis of
using classification algorithms on Cloud computing technologies.
1. The findings indicate that Cloud computing technology may be adopted and
successfully used to work with Big Data sets including clustering, classification,
collaborative filtering, frequent pattern mining, visualisation and with other
kind of algorithms.
2. The findings indicate that Map Reduce paradigm could be adopted to use
classification and other kind of algorithms with Big Data sets and utilising
computer clusters, but many algorithms still have technical and code realisation
limitations.
3. The critical problem of the massive data mining is the algorithm parallelization
of data mining. Cloud computing uses the new computing model known as
MapReduce, which means that the existing data mining algorithms and parallel
strategies cannot be applied directly to cloud computing platform for massive
data mining, so some transformation must be done. Based on this, for the
characteristics of massive data mining algorithms, the cloud computing model
has been optimized and expanded to make it more suitable for massive data
mining. Therefore, this paper adopts the Hadoop distributedsystem
infrastructure, which provides the storage capacity of HDFS and the computing
capability of MapReduce to implement parallel classification algorithms.
MII-DS-07T-13-14 16
References
Evangelos Triantaphyllou, Giovanni Felici. (2006). Data mining and knowledge
discovery approaches based on rule induction techniques. New York: Spinger.
Gintautas Dzemyda, Olga Kurasova, Julius Žilinskas. (2008). Daugiamačių duomenų
vizualizavimo metodai. Vilnius: Mokslo aidai.
Janna Anderson, Lee Rainie. (2012). Pew Research Center. Future of Internet .
Washington, USA.
Lijuan Zhou, Hui Wang, Wenbo Wang. (2012). Parallel Implementation of
Classification Algorithms Based on Cloud Computing Environment. TELKOMNIKA ,
Vol. 10, pp. 1087~1092.
Parkhill, D. F. (1966). The challenge of the computer utility. , Reading. Addison-
Wesley.
Peter Mell, Timothy Grance. (September 2011). The NIST Definition of Cloud
Computing. Gaithersburg: National Institute of Standards and Technology, U.S.
Department of Commerce.
Qi Zhang, Lu Cheng, Rouf Boutaba. (2010). Cloud computing: state-of-the-art and
research challanges. J Internet Service Applications , 7-18.
Papildomi:
1. http://www.computerweekly.com/feature/Software-
defined-datacentres-demystified 2. http://en.wikipedia.org/wiki/Curiosity_(rover) 3. https://files.ifi.uzh.ch/dbtg/sdbs13/T10.0.pdf 4. http://www.wired.com/wiredenterprise/wp-
content/uploads/2012/10/ff_googleinfrastructure2_large.jpg
5. http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
6. http://www.fi.upm.es/?id=tablon&acciongt=consulta1&idet=707
7. http://ercim-news.ercim.eu 8. http://en.wikipedia.org/wiki/Big_data 9. http://www.computerweekly.com/news/2240173897/CERN
-adopts-OpenStack-private-cloud-to-solve-big-data-challenges
10. http://www.openstack.org/software/ 11. http://csrc.nist.gov/publications/nistpubs/800-
145/SP800-145.pdf 12. http://static.googleusercontent.com/external_content
/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf
13. http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf
MII-DS-07T-13-14 17
14. http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
15. http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/