airo international research journal july, 2016 volume viii ... · zeeshan ahmed research scholar,...

13
Airo International Research Journal July, 2016 Volume VIII, ISSN: 2320-3714 Impact Factor 0.75 to 3.19

Upload: others

Post on 26-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

Page 2: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

ANALYSIS OF BIG DATA MAP-REDUCE MODEL ON CLOUDS

Zeeshan Ahmed

Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma

Asst. Prof

ABSTRACT

Map Reduce plays a critical role as a leading framework for big data analytics. In this paper, we

consider a cloud architecture that provides Map Reduce services based on the big data collected from

end users all over the world. Existing work handles Map Reduce jobs by a traditional computation-

centric approach that all input data distributed in multiple clouds are aggregated to a virtual cluster

that resides in a single cloud. Its poor efficiency and high cost for big data support motivate us to

propose a novel data-centric architecture with three key techniques, namely, cross-cloud virtual

cluster, data-centric job placement, and network coding based traffic routing. Our design leads to an

optimization framework with the objective of minimizing both computation and transmission cost for

running a set of Map Reduce jobs in geo-distributed clouds. We further design a parallel algorithm by

decomposing the original large-scale problem into several distributive solvable sub problems that are

coordinated by a high-level master problem. Finally, we conduct real-world experiments and

extensive simulations to show that our proposal significantly outperforms the existing works.

KEYWORDS: Big data, Map Reduce, cloud, optimization.

INTRODUCTION

Cloud computing is a gigantic calculation

control on capacity. It exists in 1950 with use

of mainframe computer. Around the globe,

fast advances in innovation are powering

development, propelling monetary

development and molding anesthesia. By

2020, more than 1/3 of data will live in or go

through cloud. Data production will be

multiple times more noteworthy than 2020 that

it was in 2009. Individual makes 70% all

things considered and undertaking store 80%.

"Cloud" is a common asset that is amazingly

powerful in light of the fact that it isn't just

shared by an extensive number of clients, yet

in addition can be progressively gotten to

depending on the requests. It is classified

"Cloud" because of the dynamic difference in

scale, conceptual limit, and questionable area

like a genuine cloud in the nature, be that as it

may, it exists in the real world. Cloud isn't sets

of equipment, software or administrations. It is

the mix and mix of enormous information

advances. Moreover, the span of Cloud is

developing since new creating advances

continue joining the gathering. In addition, the

National Institute of Standards and

Technology of U.S. Branch of Commerce

characterized that "Cloud computing is a

model for empowering ubiquitous,

advantageous, on-request organize access to a

mutual pool of configurable computing assets

(e.g., systems, servers, stockpiling,

applications, and administrations) that can be

quickly provisioned and discharged with

insignificant administration effort or specialist

co-op association. This cloud model is made

out of five basic qualities, three administration

models, and four deployment models.

It is the on-request computing, shared assets

and data information are given to modernize

and different gadgets on interest. It has now

Page 3: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

turned into an exceptionally utility because of

high computing force, shoddy cost

administrations, superior, versatility, openness

and additionally accessibility. Cloud vendors

are encountering development rate of half per

annum. For the most part, Cloud Computing is

the mix of conventional computing techniques

and systems administration Technologies, for

example, Distributed Computing, Parallel

Computing, Utility Computing, Network

Storage Technologies, Virtualization, Load

Balance, High Available and so forth. For

example, Distributed Computing parts an

expansive calculation into little portions and

allocating multiple PCs to ascertain, at that

point gathering the majority of the outcomes

and amassing them together. In the meantime

Parallel Computing totals an extensive number

of computational assets to process a specific

assignment, which is a very efficient answer

for parallel problems.

Cloud computing comes into concentrate just

when you consider what IT in every case

needs: an approach to expand limit or include

abilities the fly without putting resources into

new foundation, preparing new work force, or

permitting new software. To give profoundly

concentrated physical assets to remote

customers on interest, cloud computing is a

technique of data preparing, stockpiling and

conveyance. The advantage of cloud

computing includes on-request self-

administration, ubiquitous system get to, area

independent asset pooling, fast asset

versatility, use based valuing, transference of

hazard, and so forth. The clients can utilize

these administrations to process their business

employments in a compensation as-you-go

trend while sparing tremendous capital interest

in their own IT foundation. Cloud computing

includes any membership based or pay-per-

utilize benefit that, continuously over the

Internet, extends IT's current capacities. The

thought isn't new, however this form of cloud

computing is getting new life from

Amazon.com, Sun, IBM, and other people

who presently offer stockpiling and virtual

servers that IT can access on interest. Early

venture adopters principally utilize utility

computing for supplemental, non-mission-

basic needs, yet one day, they may supplant

parts of the data center.

REVIEW OF LITERETURE

Murugeshwari et al (2013) aimed to mine

data set available with each custodian in a

semi honest model, securely without data

disclosure amongst the custodians involved.

No custodian discloses information. In the new

scheme to reduce computational complexity,

data partitioning was done horizontally. The

proposed research includes a well

skilled/planned architecture implementation to

achieve the proposed privacy preservation in

data mining using another hybrid data mining

model developed to combine commutative

Rivest Shamir and Adleman (RSA) algorithm

and a C5.0 algorithm to generate classification

rules. This study used real world data from a

UCI repository and experiments based on

parameters like time complexity, accuracy and

error rate were conducted. The new model

preserved expected privacy without

information loss, had less computation time

and reduced error rate and improved accuracy.

Bhuyan et al (2012) Data is preserved for

privacy by using a perturbation technique as

alias name. In centralized data evaluation, it

makes data classification/feature selection for

data mining decision model to make structural

information model in the new work.

Application of gain ratio technique for better

feature selection performance performs

centralized computational task. All features

don't preserve privacy for confidential data for

best model. The chi-square test was taken for

data classification by centralized data mining

model using own processing unit PPDM's alias

data model developed data mining technique

to make best model without violating

Page 4: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

individuals privacy. The proposed data miner

process task has made best feature selection

and two type experimental tests were taken in

this study.

Ousseynou et al (2012) Extended k-

anonymity definitions and using them to prove

a given data mining model does not violate k-

anonymity of individuals represented in

learning examples was proposed by. The new

extension provides a tool to measure

anonymity retained during data mining. It

showed the new model was applicable to

various data mining problems like association

rule mining, classification and clustering. It

described two data mining algorithms that

exploit extension guaranteeing generation of

k-anonymous output providing experimental

results for one. Finally, it showed the proposed

method contributes new/efficient ways to

anonymize data and preserve patterns in

anonymization. The k-anonymity protection

model concept was proposed to protect

identities of subjects in disclosed databases.

But, it may be possible to directly infer private

data from a k-anonymous data set. This is

called attribute linkage. K-anonymity also

suffers another attack based on data mining

results. In fact, data mining models and

patterns pose privacy threats regardless of

whether k-anonymity is satisfied. How privacy

requirements characterized by k-anonymity

are violated by data mining results was

discussed by who introduce a privacy breaches

limiting approach. It used adult data set from

UCI Knowledge Discovery in Databases

(KDD) archive and proved its effectiveness.

Kenig et al (2012) presented a practical

approximation algorithm that enables solving

the k-anonymization problem with an

approximation guarantee of O(ln k). That

algorithm improves an algorithm due to

Aggarwal et al that offers an approximation

guarantee of O(k), and generalizes that of Park

and Shim that was limited to the case of

generalization by suppression. The proposed

algorithm used the techniques that it

introduced herein for mining closed frequent

generalized records. The experiments showed

that the significance of the proposed algorithm

is not limited only to the theory of k-

anonymization. The proposed algorithm

achieves lower information losses than the

leading approximation algorithm, as well as

the leading heuristic algorithms. A modified

version of the proposed algorithm that issues -

diverse k-anonymizations also achieves lower

information losses than the corresponding

modified versions of the leading algorithms.

Navarro-Arribas, et al. (2012) A formal

protection model named k-anonymity and

accompanying policies for deployment were

introduced. A release provides k-anonymity

protection if information for every person in

the release is not distinguished from at least k-

1 individuals whose information is also in the

release. It examined re-identification attacks

realized on releases adhering to k-anonymity

unless accompanying policies are respected.

K-anonymity protection model is important as

it is the basis for real-world systems called

Data fly, µ-Argus and k-Similar guarantees

privacy protection. Query logs anonymization

is an important process performed prior to

sensitive data publication. This ensures users

anonymity in logs, a problem already found in

released logs from known companies.

Presented the anonymization of query logs

using micro aggregation the proposal ensures

the k-anonymity of the users in the query log,

while preserving its utility. It provided the

evaluation of the proposal in real query logs,

showed the privacy and utility achieved, as

well as providing estimations for the use of

such data in data mining processes based on

clustering.

Mohan, et al (2012) it is valuable for

organizations to have data analyzed by

external agents. But a program that computes

potentially sensitive data risks leaking

information through output. Differential

Page 5: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

privacy ensures a theoretical framework to

process data while protecting individual

records privacy in data sets. But this saw only

limited use due to loss in output accuracy,

difficulty in making programs differential

private, lack of mechanisms describing

privacy budget in programmer's utilitarian

terms, and challenging requirements that data

owners/analysts manually distribute limited

privacy budget between queries. Design and

evaluation of another system, GUPT, which

overcame these challenges was presented by

unlike current differential private systems like

as Privacy Integrated Queries (PINQ), it

guarantees differential privacy to programs

developed without privacy in mind, makes no

trust assumptions about analysis program, and

ensures security to known classes of side-

channel attacks. GUPT uses another data

sensitivity model that degrades data privacy

over time enabling efficient allocation of

privacy levels for varied user applications and

guaranteeing overall constant privacy levels,

maximizing each application's utility. GUPT

introduced techniques improving output

accuracy while achieving same privacy levels.

Such approaches enable GUPT to execute a

variety of data analysis programs providing

utility and privacy.

Matthew Hall., et al.,(2012) planned

Cumulative Multi-Niching Genetic Algorithm

for Multimodal Function Optimization. A

cumulative multi-niching genetic algorithm

(CMN GA) is described to speed up the

optimization issues containing

computationally-expensive multi modal goal

functions. By rejecting the individuals from

the population, the CMN GA utilizes the

information from goal function estimation

because it discovers the design space. A

fitness-related population density control over

the design space minimizes the needless

objective function assessments. The

algorithm's new agreement of genetic

operations offers fast and robust convergence

to multiple local optima. CMN GA contains

better convergence ability and offers an order-

of-magnitude fall in the number of objective

function calculations needed to attain a

specified level of convergence.

RichaGarg., and Saurabhmittal, et al.,

(2014) designed Optimization by Genetic

Algorithm. The genetic algorithm search and

optimization techniques which creates the

solutions to optimization issues by means of

the methods encouraged by natural

development Optimization is the essential

issues connecting in engineering or in

economics. Genetic algorithm offers more

optimal solution. planned Genetic Algorithms:

Concepts, Design for Optimization of Process

Controllers. Genetic Algorithm is a search

heuristic that imitates the process of

estimation. Genetic Algorithms are based on

the process controllers for optimization by

means of natural operators. The concept and

design procedure of Genetic Algorithm is

described as an optimization tool. The ability

and usability of genetic algorithms are

investigated for the development of control

applications. Genetic Algorithms are used in

the direct torque control of induction motor

drive, speed control of gas turbine, speed

control of DC servo motor for the

development of control parameters.

Basheer M. Al-Maqaleh., and Hamid

Shahbazkia., et al.,(2012) planned a Genetic

Algorithm for Discovering Classification

Rules in Data Mining. Data mining aims to

discover knowledge from large volume of

data. Rule mining is taken as the functional

mining method to attain valuable knowledge

from stored data on database systems. A

genetic algorithm-based approach for mining

classification rules from large database is

presented. Accuracy, coverage and

comprehensibility of the rules are highlighted

and reduced the execution of a genetic

algorithm. The plan of encoding, genetic

operators and fitness function of genetic

Page 6: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

algorithm are planned. Also planned review of

genetic algorithm: an optimization technique

an optimization technique is designed. But,

various optimization techniques like ant

colony, simulated annealing, greedy approach

and genetic algorithm are met heuristics search

optimization technique mainly aims on the

whole optimization. Several optimization

techniques are studied under genetic algorithm

for optimization.

Yang Xu., et al., (2014) designed a Genetic

Algorithm Based Multilevel Association Rules

Mining for Big Datasets. Multilevel

association rules mining is a significant

domain to determine interesting relations

between data elements with multiple levels

ideas. Many algorithms are depending upon

the exhausting search methods like Apriori,

and FP-growth. Though, used in the big data

applications, methods experience for extreme

computational cost in searching association

rules. To speed up the multilevel association

rules searching and neglect the extreme

computation, another genetic-based method

with three key innovations are designed.

Category tree is designed to explain the

multilevel application data sets as the domain

knowledge. A special tree encoding scheme is

depending on the category tree to construct the

heuristic multilevel association mining

algorithm. The genetic algorithm depends on

the tree encoding scheme minimizing the

association rule search space. The method is

helpful in mining multilevel association rules

in big data related applications.

Priyanka Sharma., and Saroj., et al.,(2015)

designed Discovery of Classification Rules

using Distributed Genetic Algorithm. A

distributed genetic algorithm for discovery of

classification rules is designed. The local

selection and reproduction techniques are

employed to change the type in demes, and

diversity is improved by transferring rules

among the selected demes. Sub sumption

operator is used to minimize the difficulty of

the rule set determined. The productivity of

the distributed genetic algorithm for

discovering classification rules is calculated

with traditional crowding GA on UCI and

KEEL repository. Planned mining for

optimized data using clustering along with

fuzzy association rules and genetic algorithms.

Data mining is also called as knowledge

discovery in databases. It is also identified as

an innovative area for database research. The

designed technique enhances the data with

clustering and fuzzy association rules by

means of multi-objective genetic algorithms.

This algorithm is used in two phases. In the

first phase, it enhances the data to minimize

the number of evaluations by means of

clustering. In the second phase it is employed

with multi-objective genetic algorithms to

locate most favorable number of fuzzy

association rules by means of threshold value

and fitness function.

CLOUD ARCHITECTURE

In cloud computing the segments as appeared

in Figure 1 are approximately coupled. It is

extensively isolated into two sections as

pursues

i. Front End-It is a customer part. The

best case for this is internet browser.

ii. Back End-It is cloud itself.

The cloud computing achievement depends

upon how the administrations are gotten to and

executed. It is an imperative test in the

following decades. In PAAS and SAAS, down

to earth problems like permit Management

issues should be settled and momentum

examine is additionally tending to address of

between operability and league of cloud

platform Cloud computing has been imagined

as the cutting edge architecture of IT venture.

It moves to the substantial data centers where

the administration of the data and

administrations may not be completely

dependable.

Page 7: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

Figure 1 Cloud Architecture

It underpins secure and efficient unique task

and data squares, including data update, erase

and append. The proposed plan is

exceptionally efficient and versatile.

ESSENTIAL CHARACTERISTICS OF

CLOUD COMPUTING

Cloud services show five essential

characteristics that demonstrate their relation

to, and contrasts from, traditional computing

approaches

• On-demand self-service A

consumer can unilaterally

provision computing capabilities

as required and automatically,

without human interaction with a

service supplier.

• Broad system access Computing

capabilities are available over the

system and accessed through

standard mechanisms that advance

us by heterogeneous thin or thick

customer platforms (e.g. cell

phones, laptops, and PDAs) as

well as other traditional or cloud

based software services.

• Resource poolingA supplier

pools computing resources to

serve several consumers utilizing

a multi-tenant model, which

dynamically assigns and reassigns

physical and virtual resources

according to consumer demand.

There is a level of location

independence in that the client

generally has no control or

learning over the exact location of

the given resources.

• Rapid elasticity Capabilities can

be rapidly and elastically

provisioned, by and large

automatically and rapidly released

to rapidly scale out and scale in.

For a consumer, the capabilities

appear to be boundless and can be

purchased in any quantity at any

time.

• Measured service Cloud

frameworks automatically control

and streamline resource usage by

leveraging a metering capability

according to the kind of service.

Usage can be monitored,

controlled, and announced, giving

transparency to both the supplier

and the consumer.

CLOUD SERVICE MODELS

In general, clouds offer services as in Figure

4.2 at three unique dimensions: IaaS, PaaS,

Page 8: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

and SaaS. Be that as it may, a few suppliers

can uncover services at multiple dimensions.

• Software as a Service (SaaS)

conveys software that is remotely

accessible by consumers through

the Internet with a usage-based

evaluating model. E.g., Live Mesh

from Microsoft allows documents

and organizers to be shared and

synchronized across multiple

gadgets. It is rearranged too large

application in a remote and

seamless way. It helps service

supplier so that rearranged

software installation and

maintenance and centralized

control over versioning. It helps

end client so that it could access

anywhere anytime to share data

and collaborate all the more

easily. It keeps the data put away

safely in the infrastructure.

• Platform as a Service (PaaS)

offers an abnormal state integrated

environment to fabricate, test, and

send custom applications as in

Google's App Engine. Inside this

layer dwells the middleware

framework, a portable component

for both lattice and cloud

frameworks. Examples

incorporate WSO2 Stratos,

Windows Azure, and our center

ware HIMAN. It gives program

sets of software elements to

assemble large scale application.

• Infrastructure as a Service

(IaaS) provisions hardware,

software, and gear to convey

software application environments

with a resource usage-based

valuing model. Infrastructure can

scale here and there dynamically

based on application resource

needs. Typical examples are

Amazon EC2 (Elastic Cloud

Computing) Service, Eucalyptus,

Microsoft Private Cloud. It allows

access to large scale resources on

which any stack can be installed.

Figure 2 Cloud Services model

Page 9: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

CLOUD DEPLOYMENT MODELS

There are four deployment models for cloud

services, with derivative variations that

address specific requirements:

i. Public Cloud. The cloud is made

available to thegeneral public or a

large industry group and is owned by

an organization selling cloud services.

ii. Private Cloud. The cloud is operated

solely for a singleorganization. It may

be managed by the organization or by

a third party, and may exist on-

premises or off-premises.

iii. Community Cloud. The cloud is

shared by severalorganizations to

support a specific community that has

shared concerns. It may be managed

by the organizations or by a third party

and may exist on-premises or off-

premises.

iv. Hybrid Cloud. The cloud

infrastructure consists of twoor more

clouds (private, community, or public)

that remain unique entities but are

bound together by standardized or

proprietary technology that enables

data and application portability.

CLOUD SECURITY USING DATA

ANONYMIZATION

Although it realizes that a 100-percent secure

cloud infrastructure is impossible. It is

exploring the possibility of anonymizing data

to augment our cloud security infrastructure.

Data anonymization makes data worthless to

others, while still allowing IT to process it in a

useful way. Several formal models of security

can help improve data anonymization,

including k-anonymity and L-Diversity.Cloud

computing success depends upon service

model, deployment model, are accessed and

executed .It is an important challenges in the

next decade. In Paas and Saas, the practical

problem like license management issue need

to resolved and current research also addressed

in question of inter-operability and federation

of cloud problem. Scope of application posted

will be larger than application such as online

games and video processing. It will raise new

research problem such as quality of services

and management.

K-Anonymity attempts to make each record

indistinguishable from a defined number (k) of

other records. For example, consider a data set

that contains two attributes: gender and

birthday. The data set is K- Anonymized if, for

any record, k-1 other records have the same

gender and birthday. In general, the higher the

value of k, the more privacy is achieved.

L-Diversity improves anonymization beyond

what k-anonymity provides. The difference

between the two is that while k-anonymity

requires each combination of quasi identifiers

to have k entries, l-diversity requires that there

are l different sensitive values for each

combination of quasi-identifiers. Other data

anonymization techniques include adding

fictitious records to the data, hashing,

truncation, permutation, and value shifting,

just to name a few.

Cloud computing may adopt the same control

of any IT environment. However, the cloud

service models, the operational models, and

the supporting technologies change the risk

landscape for an organization with respect to

traditional IT. There are possible risks a user

should assess before committing:

• Data Anonymization: Removal

of personally identifiable

information (PII) from a data set

• Information Loss: Due to

Anonymization

• Privacy Preservation: Preserving

privacy of data and its owners

Level of anonymization

• Privileged user access: sensitive

data should be processed outside

Page 10: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

the enterprise only with the

assurance that they are only

accessible and propagated to

privileged users

• Data segregation: is the user data

should be fully segregated from

data of other users

• Regulatory compliance: a cloud

provider should have external

audits and security certifications

and the infrastructure should

comply with regulatory security

requirements

• Data location: the cloud provider

should commit to storing and

processing data in specific

jurisdictions and to obey local

privacy requirements on behalf of

the customer

• Recovery: the provider should

offer an efficient replication and

recovery mechanism to fully

exploit the potentials of a cloud in

the event of a disaster;

• Investigative support: support

should to be ensured for forensics

and investigation with a

contractual commitment

• Long-term viability: a user data

should be accessible even when

the provider is acquired by another

company or the user moves to

another provider.

PROPOSED CLOUD TOPOLOGY

The Figure 3 shows sample cloud topology for

privacy preservation over incremental data

sets.

Figure 3- Sample Cloud Topology

Generally, a cloud framework consist of main

data centers, the main data centers are

connected with each other, each main data

centers have n number of sub-data centers and

each sub-data centers are interconnected with

each other. The sub-data centers may have

another arrangement of sub-data centers or it is

specifically connected with the users. Here in

this Figure 3, the main data center is indicated

as D and the sub-data centers are represented

as SD. The representation SP in the Figure 3 is

the service supplier who gathers enormous

volume of data and stores these privacy

Page 11: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

sensitive data sets on cloud to use the cloud

facilities to process these massive data.

The health service supplier gives health

services to the clients in the cloud through

cloud health service application Abusing. This

cloud health service, clients can manage their

health as well as their family's health. The

client in the cloud can get the supportive

information like side effects analysis, medical

diagnosis and health plans by uploading their

personal health records into this service. Many

hospitals also upload their patient's health data

into this service. So the health service supplier

can gather massive amount of data and store

these privacy sensitive data sets on cloud to

use the cloud facilities. The establishments, for

example, hospitals, pharmacies and

governments are also analyzing or sharing

these data. The storage and computation of

these data takes place in cloud framework.

PROPOSED INCREMENTAL

CLUSTERING OVER INCREMENTAL

CLOUD DATA

It explains the proposed incremental clustering

technique for privacy preservation over

incremental cloud data. A sample square

diagram for the procedure of the proposed

framework is appeared in Figure 4.

Figure 4 Sample Block Diagram of the proposed technique

The Figure 4 explains as pursues: Initially, an

arrangement of records is given and an

arrangement of privacy related features are

picked which is called quasi-Identifiers. The

records are then grouped using k-means

clustering. The k-anonymity constraint is then

checked and rearranges the bunch based on the

k-anonymity constraint. Thereafter, the

information loss is checked for each bunch

after rearranging the group based on the k-

anonymity constraint. After that, another

record is embedded and each record is checked

with each bunch to distinguish which group is

matching the k-anonymity constraint of the

new record. The new record is then assembled

with the bunch which satisfies the k-

anonymity constraint.

The incremental clustering is a technique that

adds another record to a corresponding bunch

Page 12: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

after clustering the initial arrangement of

records. The quasi-identifiers are set for the

new record and the quasi-identifiers ought to

be the same attributes picked before clustering

the arrangement of information records. After

choosing the quasi-identifiers, the data in the

quasi-identifier attributes ought to be

anonymized. The K-anonymity constraint after

anonymizing the quasi-identifiers of the new

record is checked with each group and in the

event that it matches the k-anonymity

constraint with any bunch, the new record will

get converge with that group. The Figure 5

demonstrates the way toward adding another

record concerning the k-anonymity constraint

and demonstrates the algorithm for our

proposed technique.

Figure 5 Process of adding a new record

BIG DATA PROCESSING MODELS:

MAPREDUCE ON CLOUDS

Data handling is another important aspect that

should be considered in the context of Big

Data, as understanding its characteristics plans

the best possible solutions. Nonetheless, with

the diversification of the register scenarios,

distinctive preparing approaches are, ported on

the cloud or on large infrastructures, with the

end goal to address the application difficulties.

The most popular of them is the Map Reduce

handling. Alongside with frameworks that

execute this paradigm. The only drawback to

utilize the coding approach of Hadoop Map

Reduce is that hadoop designers need to

compose several lines of basic java code

requiring extra effort and time for code survey.

Therefore, to disentangle this Apache offers

different options like Pig Latin and Hive SQL

languages that assistance in constructing Map

Reduce programs easily. In any case, the

advantage is that Map Reduce gives more

control to composing complex business

rationale when compared to Pig and Hive.

CONCLUSION

It proposes a flexible, versatile, dynamic and

financially savvy privacy safeguarding system

dependent on Map Reduce on cloud called

PK-Anonymity. The privacy safeguarding

structure can anonymize expansive scale

datasets and deal with the mysterious data set

in an exceptionally flexible, versatile,

productive and practical form. It gives flexible

privacy structure on conventional big data and

not for spilling of data. With the assistance of

STORM device the gushing of data can be

updated successfully. And furthermore the

extraordinary id is anonymize with STROM

.Several data preparing system is incorporated

to perform anonymization. PK-Anonymization

can be utilized to anonymize different big data

set in viable way.Now and then the loss of

information is least however the dimension of

anonymization isn't adequate. With the goal

that the security of the huge data will be lost to

perform great anonymization, keeping up the

dimension of anonymization is vital. It

proposes a flexible, adaptable, dynamic and

practical privacy protecting system dependent

on Map Reduce on cloud.

REFERENCES:

[1]. Murugeshwari, B, Kumar, CJ

&Sarukesi, K 2013, „Preservation Of

The Privacy For Multiple Custodian

Systems With Rule Sharing‟, Journal

Page 13: Airo International Research Journal July, 2016 Volume VIII ... · Zeeshan Ahmed Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma Asst. Prof ABSTRACT Map Reduce

Airo International Research Journal July, 2016

Volume VIII, ISSN: 2320-3714

Impact Factor 0.75 to 3.19

of Computer Science, vol. 9, no. 9, pp

1086.

[2]. Priyadarsini, RP, Valarmathi, ML,

&Sivakumari, S 2011, „Gain Ratio

Based Feature Selection Method For

Privacy Preservation‟, ICTACT

Journal On Soft Computing, vol.

01,no. 04, pp.201-205.

[3]. Bhuyan, HK, Mohanty, M, & Das,

SR, 2012, „Privacy Preserving for

Feature Selection in Data Mining

Using Centralized Network‟,

International Journal of Computer

Science, vol. 9, pp.67-85.

[4]. Ousseynou Sane, Fodé Camara1,

Samba Ndiaye&YahyaSlimani, 2012,‟

An Approach to Overcome Inference

Channels on k-anonymous Data‟.

[5]. Kenig, B, &Tassa, T 2012, „A

practical approximation algorithm for

optimal k-anonymity‟, Data Mining

and Knowledge Discovery, vol. 25,

no. 1, pp. 134-168.

[6]. Navarro-Arribas, G, Torra, V, Erola,

A, &Castellà-Roca, J 2012,‟ User< i>

k-anonymity for privacy preserving

data mining of query logs‟,

Information Processing &

Management, vol. 48, no. 3, pp. 476-

487.

[7]. Bhuyan, HK, Mohanty, M, & Das,

SR, 2012, „Privacy Preserving for

Feature Selection in Data Mining

Using Centralized Network‟,

International Journal of Computer

Science, vol. 9, pp.67-85.

[8]. Matthew Hall.,“Cumulative Multi-

Niching Genetic Algorithm for

Multimodal Function Optimization”

(IJARAI) International Journal of

Advanced Research in Artificial

Intelligence, Vol. 1, No. 9, 2012

[9]. RichaGarg., and Saurabhmittal.,

“Optimization by Genetic Algorithm”

International Journal of Advanced

Research inComputer Science and

Software Engineering, Volume 4,

Issue 4,April 2014

[10]. Basheer M. Al-Maqaleh., and

Hamid Shahbazkia., “A Genetic

Algorithm for Discovering

Classification Rules in Data

Mining”International Journal of

Computer Applications, Volume 41 -

Number 18, 2012

[11]. Yang Xu.,MingmingZeng.,

Quanhui Liu., and Xiaofeng Wang.,

“A Genetic Algorithm Based

Multilevel Association Rules Mining

for Big Datasets”Hindawi Publishing

Corporation, Mathematical Problems

in Engineering, Volume 2014.

[12]. Priyanka Sharma., and Saroj.,

“Discovery of Classification Rules

Using Distributed Genetic Algorithm”

Elsevier, Proceedings of the

International Conference on

Information and Communication,

Volume 46, 2015, Pages 276–284