investigation of cloud computing technology on the visualisation and

INFORMATICS ENGINEERING (07 T)

INVESTIGATION OF CLOUD

COMPUTING TECHNOLOGY ON THE

VISUALISATION AND CLASSIFICATION

ALGORITHMS

Tomas Pranckevičius

October 2013

Technical Report MII-DS-07T-13-14

Institute of mathematics and informatics, Akademijos str. 4, Vilnius LT 08663,

Lithuania

http://www.mii.lt/

Vilnius University

INSTITUTE OF MATHEMATICS AND

INFORMATICS

L I T H U A N I A

MII-DS-07T-13-14 2

Abstract

This technical report reviews the current state issues of Cloud computing technology

using Naive Bayes classifier and examines the predicted future advancements of Cloud

computing technology impact. A brief history of Cloud computing technology is

initially outlined. The report then focuses on the advantages and limitations of the Map

Reduce paradigm and Naive Bayes classifier implementation on Cloud computing

environment using its library Mahout. It is concluded that further Cloud computing

technological advances in field of processing data will continue to improve quality of

the visualisation and classification algorithms. It is also suggested that specialised

algorithms will be increasingly be incorporated into other types of technology such as

modern web service based data mining systems. Science, economics, finance or

medicine is the field of applying tools and algorithms for data visualization and

classification and improve decision making based on data analysis.

Keywords: Cloud computing, Map Reduce, Naive Bayes, classification and

visualisation algorithms.

MII-DS-07T-13-14 3

Contents

Introduction .................................................................................................................... 4 1 Research ................................................................................................................. 5

1.1 Research Methodology .................................................................................... 5 1.2 The Object ....................................................................................................... 5 1.3 The Goal .......................................................................................................... 5 1.4 The Tasks ........................................................................................................ 5 1.5 Main questions ................................................................................................ 6 1.6 Theoretical research ........................................................................................ 6

2 Computing Technology ......................................................................................... 7 2.1 Review of Cloud Computing Technology ...................................................... 7

2.1.1 Architecture .............................................................................................. 8

2.1.2 Characteristics .......................................................................................... 8

2.1.3 Service models ......................................................................................... 9

2.1.4 Deployment models ................................................................................. 9

2.2 Map Reduce Paradigm .................................................................................. 10 2.2.1 Apache Mahout Library ......................................................................... 11

2.2.2 Designing algorithms for Map Reduce .................................................. 12

2.2.3 Selection of algorithms .......................................................................... 12

Conclusions .................................................................................................................. 15 References .................................................................................................................... 16

MII-DS-07T-13-14 4

Introduction

One of our days computing advancements and issues are based on data processing

which challenge with growing data sets in parallel. Science, economics, finance or

medicine is the field of applying such tools and algorithms for data visualization and

classification in order to detect or indicate data clusters and other specific areas which

is needed for decision making. The purpose of this research is to investigate the issues

and advantages of impact of Cloud computing technology on the visualisation and

classification algorithms; that is, this technical report reviews theory, computer

experiment and current state issues of Cloud computing technology using Map Reduce

paradigm and designing algorithms, Mahout library and Naive Bayes classifier.

MII-DS-07T-13-14 5

1 Research

1.1 Research Methodology All methods applied in this research is mainly based on theoretical research,

constructive approach and experiment:

1. The analysis of the scientific and experimental achievements in the field of data

visualization, classification and Cloud computing technologies, the use of

information retrieval, organization, analysis, benchmarking and aggregation

methods.

2. Software development and conceptual methods.

3. Theoretical methods used to prove theorems and algorithms for testing

convergence. Applied the principle of mathematical induction to prove

statements.

4. Based on the experimental method of analysis, the statistical data and analysis

of the results, which are used to evaluate the results of aggregation method. Such

as selection of a random sample, generation of two random “identical” data

groups, variable analysis, measuring of dependent variables according to given

conditions.

1.2 The Object The object – visualization and classification algorithms based on modern Cloud

computing technology solutions.

1.3 The Goal The goal – investigate what is impact of Cloud computing technology on the

visualization and classification algorithms.

1.4 The Tasks These tasks of the research have to be implemented in case the goal of the research will

be reached:

1. Perform Cloud computing technologies and the classification and visualization

algorithms review.

2. Identify scientific issues in arising tasks, while applying visualization and

classification algorithms in Cloud computing technologies.

3. Select and describe the methodology used in the dissertation fulfillment of tasks.

4. Overview research methods of solving realization impact of visualization and

classification algorithms issues in Cloud computing technology.

5. Identify key research issues and formulate concerns for experimental research

and analyzes.

6. Perform principles and methodology of analysis of development visualization

and classification algorithms on Cloud computing technologies.

7. Perform analysis and selection process of visualization and classification

algorithms which are suitable for Cloud computing technologies.

8. Select criteria for evaluation.

9. Investigate the main issues of visualization and classification algorithms applied

to Cloud computing technologies.

10. Perform comparative realization and analysis applied to visualization and

classification algorithms in Cloud computing technologies.

MII-DS-07T-13-14 6

11. Perform application and modification of visualization and classification

algorithms in used in Cloud computing technologies.

12. Perform experimental research and comparative analysis of realized, applied

and modified existing algorithms by using real data sets.

13. Perform statistical analysis of the data obtained.

14. Summaries the conclusions and identify the essential findings.

15. Prepare and publish the findings.

1.5 Main questions The core questions for future research, experimentations and analysis:

1. What are the features of Cloud computing that will enable the execution of

visualization and classification algorithms?

2. What kind of Cloud computing technology/resources should be used in this

investigation?

3. What are visualization and classification algorithms useful for?

4. What are exactly these algorithms?

5. What are the resources for testing and implementation?

6. Why Cloud computing?

7. What kind of real data?

1.6 Theoretical research The realization of visualization and classification algorithms in Cloud computing

technology are based on theoretical research:

1. Overview of distributed parallelism and scalability technologies;

2. Overview of foundation of classification and visualization algorithms;

3. Designing algorithms for distributed parallel and scalable technologies;

4. Computer experimentation;

5. Improving algorithms.

MII-DS-07T-13-14 7

2 Computing Technology

2.1 Review of Cloud Computing Technology

The computer industry in point of technological view witnessed several major fractures.

Firstly, the technology began slowly migrate from universities and public institutions

to the first personal computers. In other meanings – mostly all mainframe computers

have been decentralized to client – server systems, and accelerated to the first personal

computer in the houses of humans, and with the rapid spread of technological progress.

Then there was so called a second technological transformation, when computers were

connected to the global network, which today is called the Internet. The third and crucial

turning point, when users started to share the available IT infrastructure and services in

the form of purchasing or renting it from the Cloud computing service providers. In

technological terms, it is a transition from a client – server to centralized systems with

distributed parallelism (), like a cyclical return to the past, but this is done in order to

optimize time or cost and focus on core of enterprise assets. For example, millions of

users every day open their web mail accounts or edit documents online and store files

on the Internet, they use social networks or media sites, uploads a wide range of data

and everything happens independently of the location.

Mostly all of these services are based on Cloud computing technologies, because after

all the information is out on the Internet or somewhere in the data centre. User based

approach on Cloud computing is simple – just the hardware and software set through

the internet as a service. Users may enter the web address in browser, authenticate to

the information systems and use any services or applications which are provided

through local or remote data centres. To do this mostly there is need to have access to

the network and all remote assets including data, software and hardware capabilities are

available.

But there is professional or scientifically approach on Cloud computing based

technologies which is not such simple because of variety elements included to this

process. One of them is data science. This paper is dedicated to data science which

provided throw Cloud computing with web based software included data visualization

and classification algorithms.

The idea of Cloud computing technology is not a new one. John McCarthy in the 1960s

already envisioned that computing facilities will be provided to the general public like

utility (Parkhill, 1966). The term “Cloud“ has also been used in various context such as

describing the business model of providing services across the internet, that the term

really started to gain popularity by Eric Schmidt in 2006. (Qi Zhang, Lu Cheng, Rouf

Boutaba, 2010). Cloud computing technologies are becoming more reasonable to use

in today’s technology world and there are many advantages such as cost reduction

impact, because less quantity of hardware is needed (virtualization), less energy

consumption, no up-front investment and pay-as-you-go model, lower operating costs,

rapid allocation and de-allocation, scalability when service demands are changing,

accessibility through different devices and reducing business risk while outsourcing

from infrastructure providers.

MII-DS-07T-13-14 8

2.1.1 Architecture

Today Cloud computing architecture can be described in layered model represented as

four abstractions: hardware, infrastructure, platform and application (NIST) (Peter

Mell, Timothy Grance, September 2011).

Fig. 1. Architecture layered model

Hardware layer is responsible for managing the physical resources of the Cloud.

Infrastructure layer creates a pool of resources using virtualization technologies.

Platform layer consists of operating system and application frameworks. Application

layer runs actual Cloud applications under physical or virtual environment.

Virtualization first time was introduced by Gerald J. Popek and Robert P. Goldberg in

their 1974 article “Formal Requirements for Virtualizable Third Generation

Architectures”.

2.1.2 Characteristics

According National Institute of Standards and Technology (NIST) (Peter Mell,

Timothy Grance, September 2011) there are five essential characteristics in Cloud

computing such as:

1. On-demand self-service.

2. Broad network access.

3. Resource pooling.

4. Rapid elasticity and measured service.

The idea behind is to describe variety of technical characteristics, i.e. possibility to

extend computing capabilities without human interaction, ensure availability over the

network, provide computing resources to different consumers with sense of location

independence or automatically monitor, control and optimize resources.

Fig. 2. Physical and virtual infrastructure model

APPLICATION LAYER

PLATFORM LAYER

INFRASTRUCTURE LAYER

HARDWARE LAYER

MII-DS-07T-13-14 9

2.1.3 Service models

Recommendations for service models of Cloud computing was defined by NIST (Peter

Mell, Timothy Grance, September 2011). Service models are describing Cloud

computing as a part of service industry by providing intangible products and services.

Cloud service providers can be divided to infrastructure providers which are

responsible of managing Cloud platforms and lease resources or service providers

which are responsible of renting resources from other infrastructure providers.

Basically these service models are divided to:

1. SaaS (Software as a Service) – capability used as provider’s applications and

are accessible from client devices. The consumer is not responsible for

managing and controlling Cloud infrastructure or even individual application

capabilities except application configuration settings.

2. PaaS (Platform as a Service) – capability provided to the consumer to use

consumer-created applications on Cloud infrastructure. Programming

languages, libraries, services, and tools are supported by the service provider.

3. IaaS (Infrastructure as a Service) – capability provided to the consumer as

computing resource of processing, storage, networks. The consumer is able to

deploy and run arbitrary software, which can include operating systems and

applications, manage network configuration, but consumer does not manage or

control the underlying physical hardware infrastructure.

MODEL LAYER TECHNOLOGY SaaS Application Cloud applications PaaS Platform Operating system and application frameworks IaaS Infrastructure Virtualization technologies

Fig. 3. Service models

2.1.4 Deployment models

According to NIST (Peter Mell, Timothy Grance, September 2011) there are four

deployment models of Cloud computing which are based on private or public Cloud

service availability:

1. Private Cloud – infrastructure provisioned for exclusive use by a single

organization and it may be owned, managed or operated by the same

organization or combination of third party.

2. Community Cloud – infrastructure provisioned for use by a specific

community of consumers from organizations that have shared concerns such as

mission, security, requirements, policy and compliance considerations and it

may be owned, managed or operated by the organizations in the community or

combination of third party.

3. Public Cloud – infrastructure provisioned for open use by the general public

and it may be owned, managed or operated by the business, academic, or

government organization.

4. Hybrid Cloud – infrastructure composed of two or more Cloud models.

Below the matrix of Cloud computing models is presented the same way as described

before.

MII-DS-07T-13-14 10

MODEL ORGANIZATION MANAGING OWNER PREMISES

Private Internal Multiple Multiple On/Off Community Shared Multiple Multiple On/Off Public Public Multiple Multiple On Hybrid Multiple Multiple Multiple On/Off

Fig. 4. Matrix of Cloud computing modelel

2.2 Map Reduce Paradigm

Map Reduce framework is used in order to make classification algorithms applicable to

large scale of data (Lijuan Zhou, Hui Wang, Wenbo Wang, 2012). It is mostly know as

Product of Apache Hadoop foundation and its advantages is:

Leader on Map Reduce paradigm implementation.

High scalability level of parallel computing.

High level of infrastructure tolerance end change.

Created several copies on all cluster and still continues to be fault tolerance.

Uses HDFS.

Disadvantages:

Hadoop is run by a master node, and specifically a namenode, that’s a single

point of failure.

HDFS compression could be better.

HDFS likes to store three copies of everything, whereas many DBMS and file

systems are satisfied with two.

Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.

http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/

Process of Map Reduce:

For instance we have set of documents D1, ..., DN

Map

Analyses document D into conditions T1, ..., TN.

Output (key, value) into pairs

(T1, D1), ..., (TN, DN)

Reduce

Output (key, value) pair with condition T

(T1,D1), …, (TN, DN)

Result (key, value) in pairs

(T,(D1, ..., DN))

Alternatives, but not using HDFS:

BashReduce

Disco Project

Spark

MII-DS-07T-13-14 11

GraphLab

Storm

HPCC Systems (from LexisNexis)

http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/

Fig. 5. Map Reduce process

2.2.1 Apache Mahout Library

The Apache Mahout™ is library with extended learning system capabilities and data

mining algorithms such as: clustering, classification, collaborative filtering and

frequent pattern mining. The core of clustering, classification, collaborative filtering

algorithms realization is based on Map Reduce paradigm.

MII-DS-07T-13-14 12

2.2.2 Designing algorithms for Map Reduce

Nutolusiose kompiuterinių išteklių technologijose taikomų vizualizavimo ir

klasifikavimo algoritmų kūrimo principų ir metodologijos analizė.

The general processing flow is as follows:

Input data is "split" into multiple mapper process which executes in parallel

The result of the mapper is partitioned by key and locally sorted

Result of mapper of the same key will land on the same reducer and

consolidated there

Merge sorted happens at the reducer so all keys arriving the same reducer is

sorted

Fig. 6. Designing algorithms for Map Reduce

2.2.3 Selection of algorithms

Real data of natural and social sciences are often high-dimensional and it is difficult to

understand the data. And more over the human being can comprehend visual

information more quickly than textual one. The goal of the projection (visualization)

methods is to represent the input data items in a lower-dimensional space so that

certain properties of the structure of the data set were preserved as faithfully as

possible. Visualization of multidimensional data is a complex problem followed

by extensive researches because it allows to the investigator (Gintautas Dzemyda,

Olga Kurasova, Julius Žilinskas, 2008):

1. Observe data clusters.

2. Estimate the inter-nearness between the multidimensional points.

3. Make proper decisions.

MII-DS-07T-13-14 13

Most of the data types in real world applications lack the ability to be directly illustrated

by 2-D or 3-D graphics. There are several techniques that have been commonly used to

visualize data types including point plots and histograms. However, these traditional

techniques are too limited for analyzing highly dimensional data. During the last

decades a number of novel techniques have been developed and classified in the

following types (Keim, 2002) (Evangelos Triantaphyllou, Giovanni Felici, 2006):

1. Geometrically transformed displays, such as landscapes and parallel coordinates

as in scalable framework.

2. Icon-based displays, such as needle icons and star icons.

3. Dense pixel displays, such as the recursive pattern, circle segments techniques

and the graph sketches.

4. Stacked display, such as tree maps or dimensional stacking.

There are several mostly known algorithms related to visualization:

1. Multidimensional scaling.

2. Relative multidimensional scaling.

3. Diagonal majorization.

4. Samonn’s projection.

5. Relational perspective map.

And classification such as:

1. Naive Bayes trees.

2. C4.5.

For the analysis and better understanding Naïve Bayes algorithm are selected in

this research.

2.2.3.1 Naive Bayes Classifier

Classification algorithms can be adopted for classification of documents, images, spam

filters and other data sets.

To get more details on how the Naive Bayes Classifier is implemented, you can look at

the mahout wiki page.

This tutorial will give you a step-by-step description on how to create a training set,

train the Naive Bayes classifier and then use it to classify new tweets.

Naive Bayes Classifier algorithm is modelling probabilistic and can be implemented

easy and quickly with Map Reduce paradigm. It is used for probability evaluation and

MII-DS-07T-13-14 14

can be improved after doing fault analysis or using additional algorithms. Naive Bayes

Classifier can be trained according know set. All data is classified according known

similarities, for unknown objects it assigns new class. Mostly Naive Bayes Classifier is

used to classify text entries, such as:

Art (book, music, movie, … )

Event (travel, concert, … )

Health (beauty, SPA, … )

Home (kitchen, furniture, garden, …)

Technology (desktop computer, laptop, smartphone, smart tv)

Tarkime, turime duomenis apie vaisius, kuriuose aprašytos spalvos ir formos.

Bajeso klasifikatorius yra apmokomas taip, kad kuo tiksliau parinktu klases.

Objekto tipas klasifikuojamas remiantis jo savybėmis, pavyzdžiui:

Matome vaisius, kurie yra raudoni ir apvalūs.

Klausimas: kokios labiausiai tikėtinos rūšies tai vaisiai?

Atsakydami, remiamės atrinktų duomenų pavyzdžiu, t.y. raudoni ir

apvalūs.

Taigi, ateityje galime klasifikuoti visus raudonus ir apvalius vaisius, tik

kaip tam tikros rūšies vaisius.

Duomenų sekų diagram 1

Duomenų sekų diagrama 2



MII-DS-07T-13-14 15

Conclusions

This study was conducted for the purposes of clarify findings and do the analysis of

using classification algorithms on Cloud computing technologies.

1. The findings indicate that Cloud computing technology may be adopted and

successfully used to work with Big Data sets including clustering, classification,

collaborative filtering, frequent pattern mining, visualisation and with other

kind of algorithms.

2. The findings indicate that Map Reduce paradigm could be adopted to use

classification and other kind of algorithms with Big Data sets and utilising

computer clusters, but many algorithms still have technical and code realisation

limitations.

3. The critical problem of the massive data mining is the algorithm parallelization

of data mining. Cloud computing uses the new computing model known as

MapReduce, which means that the existing data mining algorithms and parallel

strategies cannot be applied directly to cloud computing platform for massive

data mining, so some transformation must be done. Based on this, for the

characteristics of massive data mining algorithms, the cloud computing model

has been optimized and expanded to make it more suitable for massive data

mining. Therefore, this paper adopts the Hadoop distributedsystem

infrastructure, which provides the storage capacity of HDFS and the computing

capability of MapReduce to implement parallel classification algorithms.

MII-DS-07T-13-14 16

References

Evangelos Triantaphyllou, Giovanni Felici. (2006). Data mining and knowledge

discovery approaches based on rule induction techniques. New York: Spinger.

Gintautas Dzemyda, Olga Kurasova, Julius Žilinskas. (2008). Daugiamačių duomenų

vizualizavimo metodai. Vilnius: Mokslo aidai.

Janna Anderson, Lee Rainie. (2012). Pew Research Center. Future of Internet .

Washington, USA.

Lijuan Zhou, Hui Wang, Wenbo Wang. (2012). Parallel Implementation of

Classification Algorithms Based on Cloud Computing Environment. TELKOMNIKA ,

Vol. 10, pp. 1087~1092.

Parkhill, D. F. (1966). The challenge of the computer utility. , Reading. Addison-

Wesley.

Peter Mell, Timothy Grance. (September 2011). The NIST Definition of Cloud

Computing. Gaithersburg: National Institute of Standards and Technology, U.S.

Department of Commerce.

Qi Zhang, Lu Cheng, Rouf Boutaba. (2010). Cloud computing: state-of-the-art and

research challanges. J Internet Service Applications , 7-18.

Papildomi:

1. http://www.computerweekly.com/feature/Software-

defined-datacentres-demystified 2. http://en.wikipedia.org/wiki/Curiosity_(rover) 3. https://files.ifi.uzh.ch/dbtg/sdbs13/T10.0.pdf 4. http://www.wired.com/wiredenterprise/wp-

content/uploads/2012/10/ff_googleinfrastructure2_large.jpg

5. http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

6. http://www.fi.upm.es/?id=tablon&acciongt=consulta1&idet=707

7. http://ercim-news.ercim.eu 8. http://en.wikipedia.org/wiki/Big_data 9. http://www.computerweekly.com/news/2240173897/CERN

-adopts-OpenStack-private-cloud-to-solve-big-data-challenges

10. http://www.openstack.org/software/ 11. http://csrc.nist.gov/publications/nistpubs/800-

145/SP800-145.pdf 12. http://static.googleusercontent.com/external_content

/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf

13. http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf

http://www.computerweekly.com/feature/Software-defined-datacentres-demystified

http://www.computerweekly.com/feature/Software-defined-datacentres-demystified

http://en.wikipedia.org/wiki/Curiosity_(rover)

https://files.ifi.uzh.ch/dbtg/sdbs13/T10.0.pdf

http://www.wired.com/wiredenterprise/wp-content/uploads/2012/10/ff_googleinfrastructure2_large.jpg



http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

http://www.fi.upm.es/?id=tablon&acciongt=consulta1&idet=707

http://www.fi.upm.es/?id=tablon&acciongt=consulta1&idet=707

http://ercim-news.ercim.eu/

http://en.wikipedia.org/wiki/Big_data

http://www.computerweekly.com/news/2240173897/CERN-adopts-OpenStack-private-cloud-to-solve-big-data-challenges



http://www.openstack.org/software/

http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/mapreduce-osdi04.pdf



http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf

http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf

MII-DS-07T-13-14 17

14. http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

15. http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/



investigation of cloud computing technology on the visualisation and

Documents