airo international research journal july, 2016 volume viii ... · zeeshan ahmed research scholar,...
TRANSCRIPT
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
ANALYSIS OF BIG DATA MAP-REDUCE MODEL ON CLOUDS
Zeeshan Ahmed
Research Scholar, Kalinga University Supervisor Name: Dr. Rupak Sharma
Asst. Prof
ABSTRACT
Map Reduce plays a critical role as a leading framework for big data analytics. In this paper, we
consider a cloud architecture that provides Map Reduce services based on the big data collected from
end users all over the world. Existing work handles Map Reduce jobs by a traditional computation-
centric approach that all input data distributed in multiple clouds are aggregated to a virtual cluster
that resides in a single cloud. Its poor efficiency and high cost for big data support motivate us to
propose a novel data-centric architecture with three key techniques, namely, cross-cloud virtual
cluster, data-centric job placement, and network coding based traffic routing. Our design leads to an
optimization framework with the objective of minimizing both computation and transmission cost for
running a set of Map Reduce jobs in geo-distributed clouds. We further design a parallel algorithm by
decomposing the original large-scale problem into several distributive solvable sub problems that are
coordinated by a high-level master problem. Finally, we conduct real-world experiments and
extensive simulations to show that our proposal significantly outperforms the existing works.
KEYWORDS: Big data, Map Reduce, cloud, optimization.
INTRODUCTION
Cloud computing is a gigantic calculation
control on capacity. It exists in 1950 with use
of mainframe computer. Around the globe,
fast advances in innovation are powering
development, propelling monetary
development and molding anesthesia. By
2020, more than 1/3 of data will live in or go
through cloud. Data production will be
multiple times more noteworthy than 2020 that
it was in 2009. Individual makes 70% all
things considered and undertaking store 80%.
"Cloud" is a common asset that is amazingly
powerful in light of the fact that it isn't just
shared by an extensive number of clients, yet
in addition can be progressively gotten to
depending on the requests. It is classified
"Cloud" because of the dynamic difference in
scale, conceptual limit, and questionable area
like a genuine cloud in the nature, be that as it
may, it exists in the real world. Cloud isn't sets
of equipment, software or administrations. It is
the mix and mix of enormous information
advances. Moreover, the span of Cloud is
developing since new creating advances
continue joining the gathering. In addition, the
National Institute of Standards and
Technology of U.S. Branch of Commerce
characterized that "Cloud computing is a
model for empowering ubiquitous,
advantageous, on-request organize access to a
mutual pool of configurable computing assets
(e.g., systems, servers, stockpiling,
applications, and administrations) that can be
quickly provisioned and discharged with
insignificant administration effort or specialist
co-op association. This cloud model is made
out of five basic qualities, three administration
models, and four deployment models.
It is the on-request computing, shared assets
and data information are given to modernize
and different gadgets on interest. It has now
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
turned into an exceptionally utility because of
high computing force, shoddy cost
administrations, superior, versatility, openness
and additionally accessibility. Cloud vendors
are encountering development rate of half per
annum. For the most part, Cloud Computing is
the mix of conventional computing techniques
and systems administration Technologies, for
example, Distributed Computing, Parallel
Computing, Utility Computing, Network
Storage Technologies, Virtualization, Load
Balance, High Available and so forth. For
example, Distributed Computing parts an
expansive calculation into little portions and
allocating multiple PCs to ascertain, at that
point gathering the majority of the outcomes
and amassing them together. In the meantime
Parallel Computing totals an extensive number
of computational assets to process a specific
assignment, which is a very efficient answer
for parallel problems.
Cloud computing comes into concentrate just
when you consider what IT in every case
needs: an approach to expand limit or include
abilities the fly without putting resources into
new foundation, preparing new work force, or
permitting new software. To give profoundly
concentrated physical assets to remote
customers on interest, cloud computing is a
technique of data preparing, stockpiling and
conveyance. The advantage of cloud
computing includes on-request self-
administration, ubiquitous system get to, area
independent asset pooling, fast asset
versatility, use based valuing, transference of
hazard, and so forth. The clients can utilize
these administrations to process their business
employments in a compensation as-you-go
trend while sparing tremendous capital interest
in their own IT foundation. Cloud computing
includes any membership based or pay-per-
utilize benefit that, continuously over the
Internet, extends IT's current capacities. The
thought isn't new, however this form of cloud
computing is getting new life from
Amazon.com, Sun, IBM, and other people
who presently offer stockpiling and virtual
servers that IT can access on interest. Early
venture adopters principally utilize utility
computing for supplemental, non-mission-
basic needs, yet one day, they may supplant
parts of the data center.
REVIEW OF LITERETURE
Murugeshwari et al (2013) aimed to mine
data set available with each custodian in a
semi honest model, securely without data
disclosure amongst the custodians involved.
No custodian discloses information. In the new
scheme to reduce computational complexity,
data partitioning was done horizontally. The
proposed research includes a well
skilled/planned architecture implementation to
achieve the proposed privacy preservation in
data mining using another hybrid data mining
model developed to combine commutative
Rivest Shamir and Adleman (RSA) algorithm
and a C5.0 algorithm to generate classification
rules. This study used real world data from a
UCI repository and experiments based on
parameters like time complexity, accuracy and
error rate were conducted. The new model
preserved expected privacy without
information loss, had less computation time
and reduced error rate and improved accuracy.
Bhuyan et al (2012) Data is preserved for
privacy by using a perturbation technique as
alias name. In centralized data evaluation, it
makes data classification/feature selection for
data mining decision model to make structural
information model in the new work.
Application of gain ratio technique for better
feature selection performance performs
centralized computational task. All features
don't preserve privacy for confidential data for
best model. The chi-square test was taken for
data classification by centralized data mining
model using own processing unit PPDM's alias
data model developed data mining technique
to make best model without violating
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
individuals privacy. The proposed data miner
process task has made best feature selection
and two type experimental tests were taken in
this study.
Ousseynou et al (2012) Extended k-
anonymity definitions and using them to prove
a given data mining model does not violate k-
anonymity of individuals represented in
learning examples was proposed by. The new
extension provides a tool to measure
anonymity retained during data mining. It
showed the new model was applicable to
various data mining problems like association
rule mining, classification and clustering. It
described two data mining algorithms that
exploit extension guaranteeing generation of
k-anonymous output providing experimental
results for one. Finally, it showed the proposed
method contributes new/efficient ways to
anonymize data and preserve patterns in
anonymization. The k-anonymity protection
model concept was proposed to protect
identities of subjects in disclosed databases.
But, it may be possible to directly infer private
data from a k-anonymous data set. This is
called attribute linkage. K-anonymity also
suffers another attack based on data mining
results. In fact, data mining models and
patterns pose privacy threats regardless of
whether k-anonymity is satisfied. How privacy
requirements characterized by k-anonymity
are violated by data mining results was
discussed by who introduce a privacy breaches
limiting approach. It used adult data set from
UCI Knowledge Discovery in Databases
(KDD) archive and proved its effectiveness.
Kenig et al (2012) presented a practical
approximation algorithm that enables solving
the k-anonymization problem with an
approximation guarantee of O(ln k). That
algorithm improves an algorithm due to
Aggarwal et al that offers an approximation
guarantee of O(k), and generalizes that of Park
and Shim that was limited to the case of
generalization by suppression. The proposed
algorithm used the techniques that it
introduced herein for mining closed frequent
generalized records. The experiments showed
that the significance of the proposed algorithm
is not limited only to the theory of k-
anonymization. The proposed algorithm
achieves lower information losses than the
leading approximation algorithm, as well as
the leading heuristic algorithms. A modified
version of the proposed algorithm that issues -
diverse k-anonymizations also achieves lower
information losses than the corresponding
modified versions of the leading algorithms.
Navarro-Arribas, et al. (2012) A formal
protection model named k-anonymity and
accompanying policies for deployment were
introduced. A release provides k-anonymity
protection if information for every person in
the release is not distinguished from at least k-
1 individuals whose information is also in the
release. It examined re-identification attacks
realized on releases adhering to k-anonymity
unless accompanying policies are respected.
K-anonymity protection model is important as
it is the basis for real-world systems called
Data fly, µ-Argus and k-Similar guarantees
privacy protection. Query logs anonymization
is an important process performed prior to
sensitive data publication. This ensures users
anonymity in logs, a problem already found in
released logs from known companies.
Presented the anonymization of query logs
using micro aggregation the proposal ensures
the k-anonymity of the users in the query log,
while preserving its utility. It provided the
evaluation of the proposal in real query logs,
showed the privacy and utility achieved, as
well as providing estimations for the use of
such data in data mining processes based on
clustering.
Mohan, et al (2012) it is valuable for
organizations to have data analyzed by
external agents. But a program that computes
potentially sensitive data risks leaking
information through output. Differential
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
privacy ensures a theoretical framework to
process data while protecting individual
records privacy in data sets. But this saw only
limited use due to loss in output accuracy,
difficulty in making programs differential
private, lack of mechanisms describing
privacy budget in programmer's utilitarian
terms, and challenging requirements that data
owners/analysts manually distribute limited
privacy budget between queries. Design and
evaluation of another system, GUPT, which
overcame these challenges was presented by
unlike current differential private systems like
as Privacy Integrated Queries (PINQ), it
guarantees differential privacy to programs
developed without privacy in mind, makes no
trust assumptions about analysis program, and
ensures security to known classes of side-
channel attacks. GUPT uses another data
sensitivity model that degrades data privacy
over time enabling efficient allocation of
privacy levels for varied user applications and
guaranteeing overall constant privacy levels,
maximizing each application's utility. GUPT
introduced techniques improving output
accuracy while achieving same privacy levels.
Such approaches enable GUPT to execute a
variety of data analysis programs providing
utility and privacy.
Matthew Hall., et al.,(2012) planned
Cumulative Multi-Niching Genetic Algorithm
for Multimodal Function Optimization. A
cumulative multi-niching genetic algorithm
(CMN GA) is described to speed up the
optimization issues containing
computationally-expensive multi modal goal
functions. By rejecting the individuals from
the population, the CMN GA utilizes the
information from goal function estimation
because it discovers the design space. A
fitness-related population density control over
the design space minimizes the needless
objective function assessments. The
algorithm's new agreement of genetic
operations offers fast and robust convergence
to multiple local optima. CMN GA contains
better convergence ability and offers an order-
of-magnitude fall in the number of objective
function calculations needed to attain a
specified level of convergence.
RichaGarg., and Saurabhmittal, et al.,
(2014) designed Optimization by Genetic
Algorithm. The genetic algorithm search and
optimization techniques which creates the
solutions to optimization issues by means of
the methods encouraged by natural
development Optimization is the essential
issues connecting in engineering or in
economics. Genetic algorithm offers more
optimal solution. planned Genetic Algorithms:
Concepts, Design for Optimization of Process
Controllers. Genetic Algorithm is a search
heuristic that imitates the process of
estimation. Genetic Algorithms are based on
the process controllers for optimization by
means of natural operators. The concept and
design procedure of Genetic Algorithm is
described as an optimization tool. The ability
and usability of genetic algorithms are
investigated for the development of control
applications. Genetic Algorithms are used in
the direct torque control of induction motor
drive, speed control of gas turbine, speed
control of DC servo motor for the
development of control parameters.
Basheer M. Al-Maqaleh., and Hamid
Shahbazkia., et al.,(2012) planned a Genetic
Algorithm for Discovering Classification
Rules in Data Mining. Data mining aims to
discover knowledge from large volume of
data. Rule mining is taken as the functional
mining method to attain valuable knowledge
from stored data on database systems. A
genetic algorithm-based approach for mining
classification rules from large database is
presented. Accuracy, coverage and
comprehensibility of the rules are highlighted
and reduced the execution of a genetic
algorithm. The plan of encoding, genetic
operators and fitness function of genetic
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
algorithm are planned. Also planned review of
genetic algorithm: an optimization technique
an optimization technique is designed. But,
various optimization techniques like ant
colony, simulated annealing, greedy approach
and genetic algorithm are met heuristics search
optimization technique mainly aims on the
whole optimization. Several optimization
techniques are studied under genetic algorithm
for optimization.
Yang Xu., et al., (2014) designed a Genetic
Algorithm Based Multilevel Association Rules
Mining for Big Datasets. Multilevel
association rules mining is a significant
domain to determine interesting relations
between data elements with multiple levels
ideas. Many algorithms are depending upon
the exhausting search methods like Apriori,
and FP-growth. Though, used in the big data
applications, methods experience for extreme
computational cost in searching association
rules. To speed up the multilevel association
rules searching and neglect the extreme
computation, another genetic-based method
with three key innovations are designed.
Category tree is designed to explain the
multilevel application data sets as the domain
knowledge. A special tree encoding scheme is
depending on the category tree to construct the
heuristic multilevel association mining
algorithm. The genetic algorithm depends on
the tree encoding scheme minimizing the
association rule search space. The method is
helpful in mining multilevel association rules
in big data related applications.
Priyanka Sharma., and Saroj., et al.,(2015)
designed Discovery of Classification Rules
using Distributed Genetic Algorithm. A
distributed genetic algorithm for discovery of
classification rules is designed. The local
selection and reproduction techniques are
employed to change the type in demes, and
diversity is improved by transferring rules
among the selected demes. Sub sumption
operator is used to minimize the difficulty of
the rule set determined. The productivity of
the distributed genetic algorithm for
discovering classification rules is calculated
with traditional crowding GA on UCI and
KEEL repository. Planned mining for
optimized data using clustering along with
fuzzy association rules and genetic algorithms.
Data mining is also called as knowledge
discovery in databases. It is also identified as
an innovative area for database research. The
designed technique enhances the data with
clustering and fuzzy association rules by
means of multi-objective genetic algorithms.
This algorithm is used in two phases. In the
first phase, it enhances the data to minimize
the number of evaluations by means of
clustering. In the second phase it is employed
with multi-objective genetic algorithms to
locate most favorable number of fuzzy
association rules by means of threshold value
and fitness function.
CLOUD ARCHITECTURE
In cloud computing the segments as appeared
in Figure 1 are approximately coupled. It is
extensively isolated into two sections as
pursues
i. Front End-It is a customer part. The
best case for this is internet browser.
ii. Back End-It is cloud itself.
The cloud computing achievement depends
upon how the administrations are gotten to and
executed. It is an imperative test in the
following decades. In PAAS and SAAS, down
to earth problems like permit Management
issues should be settled and momentum
examine is additionally tending to address of
between operability and league of cloud
platform Cloud computing has been imagined
as the cutting edge architecture of IT venture.
It moves to the substantial data centers where
the administration of the data and
administrations may not be completely
dependable.
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
Figure 1 Cloud Architecture
It underpins secure and efficient unique task
and data squares, including data update, erase
and append. The proposed plan is
exceptionally efficient and versatile.
ESSENTIAL CHARACTERISTICS OF
CLOUD COMPUTING
Cloud services show five essential
characteristics that demonstrate their relation
to, and contrasts from, traditional computing
approaches
• On-demand self-service A
consumer can unilaterally
provision computing capabilities
as required and automatically,
without human interaction with a
service supplier.
• Broad system access Computing
capabilities are available over the
system and accessed through
standard mechanisms that advance
us by heterogeneous thin or thick
customer platforms (e.g. cell
phones, laptops, and PDAs) as
well as other traditional or cloud
based software services.
• Resource poolingA supplier
pools computing resources to
serve several consumers utilizing
a multi-tenant model, which
dynamically assigns and reassigns
physical and virtual resources
according to consumer demand.
There is a level of location
independence in that the client
generally has no control or
learning over the exact location of
the given resources.
• Rapid elasticity Capabilities can
be rapidly and elastically
provisioned, by and large
automatically and rapidly released
to rapidly scale out and scale in.
For a consumer, the capabilities
appear to be boundless and can be
purchased in any quantity at any
time.
• Measured service Cloud
frameworks automatically control
and streamline resource usage by
leveraging a metering capability
according to the kind of service.
Usage can be monitored,
controlled, and announced, giving
transparency to both the supplier
and the consumer.
CLOUD SERVICE MODELS
In general, clouds offer services as in Figure
4.2 at three unique dimensions: IaaS, PaaS,
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
and SaaS. Be that as it may, a few suppliers
can uncover services at multiple dimensions.
• Software as a Service (SaaS)
conveys software that is remotely
accessible by consumers through
the Internet with a usage-based
evaluating model. E.g., Live Mesh
from Microsoft allows documents
and organizers to be shared and
synchronized across multiple
gadgets. It is rearranged too large
application in a remote and
seamless way. It helps service
supplier so that rearranged
software installation and
maintenance and centralized
control over versioning. It helps
end client so that it could access
anywhere anytime to share data
and collaborate all the more
easily. It keeps the data put away
safely in the infrastructure.
• Platform as a Service (PaaS)
offers an abnormal state integrated
environment to fabricate, test, and
send custom applications as in
Google's App Engine. Inside this
layer dwells the middleware
framework, a portable component
for both lattice and cloud
frameworks. Examples
incorporate WSO2 Stratos,
Windows Azure, and our center
ware HIMAN. It gives program
sets of software elements to
assemble large scale application.
• Infrastructure as a Service
(IaaS) provisions hardware,
software, and gear to convey
software application environments
with a resource usage-based
valuing model. Infrastructure can
scale here and there dynamically
based on application resource
needs. Typical examples are
Amazon EC2 (Elastic Cloud
Computing) Service, Eucalyptus,
Microsoft Private Cloud. It allows
access to large scale resources on
which any stack can be installed.
•
Figure 2 Cloud Services model
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
CLOUD DEPLOYMENT MODELS
There are four deployment models for cloud
services, with derivative variations that
address specific requirements:
i. Public Cloud. The cloud is made
available to thegeneral public or a
large industry group and is owned by
an organization selling cloud services.
ii. Private Cloud. The cloud is operated
solely for a singleorganization. It may
be managed by the organization or by
a third party, and may exist on-
premises or off-premises.
iii. Community Cloud. The cloud is
shared by severalorganizations to
support a specific community that has
shared concerns. It may be managed
by the organizations or by a third party
and may exist on-premises or off-
premises.
iv. Hybrid Cloud. The cloud
infrastructure consists of twoor more
clouds (private, community, or public)
that remain unique entities but are
bound together by standardized or
proprietary technology that enables
data and application portability.
CLOUD SECURITY USING DATA
ANONYMIZATION
Although it realizes that a 100-percent secure
cloud infrastructure is impossible. It is
exploring the possibility of anonymizing data
to augment our cloud security infrastructure.
Data anonymization makes data worthless to
others, while still allowing IT to process it in a
useful way. Several formal models of security
can help improve data anonymization,
including k-anonymity and L-Diversity.Cloud
computing success depends upon service
model, deployment model, are accessed and
executed .It is an important challenges in the
next decade. In Paas and Saas, the practical
problem like license management issue need
to resolved and current research also addressed
in question of inter-operability and federation
of cloud problem. Scope of application posted
will be larger than application such as online
games and video processing. It will raise new
research problem such as quality of services
and management.
K-Anonymity attempts to make each record
indistinguishable from a defined number (k) of
other records. For example, consider a data set
that contains two attributes: gender and
birthday. The data set is K- Anonymized if, for
any record, k-1 other records have the same
gender and birthday. In general, the higher the
value of k, the more privacy is achieved.
L-Diversity improves anonymization beyond
what k-anonymity provides. The difference
between the two is that while k-anonymity
requires each combination of quasi identifiers
to have k entries, l-diversity requires that there
are l different sensitive values for each
combination of quasi-identifiers. Other data
anonymization techniques include adding
fictitious records to the data, hashing,
truncation, permutation, and value shifting,
just to name a few.
Cloud computing may adopt the same control
of any IT environment. However, the cloud
service models, the operational models, and
the supporting technologies change the risk
landscape for an organization with respect to
traditional IT. There are possible risks a user
should assess before committing:
• Data Anonymization: Removal
of personally identifiable
information (PII) from a data set
• Information Loss: Due to
Anonymization
• Privacy Preservation: Preserving
privacy of data and its owners
Level of anonymization
• Privileged user access: sensitive
data should be processed outside
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
the enterprise only with the
assurance that they are only
accessible and propagated to
privileged users
• Data segregation: is the user data
should be fully segregated from
data of other users
• Regulatory compliance: a cloud
provider should have external
audits and security certifications
and the infrastructure should
comply with regulatory security
requirements
• Data location: the cloud provider
should commit to storing and
processing data in specific
jurisdictions and to obey local
privacy requirements on behalf of
the customer
• Recovery: the provider should
offer an efficient replication and
recovery mechanism to fully
exploit the potentials of a cloud in
the event of a disaster;
• Investigative support: support
should to be ensured for forensics
and investigation with a
contractual commitment
• Long-term viability: a user data
should be accessible even when
the provider is acquired by another
company or the user moves to
another provider.
PROPOSED CLOUD TOPOLOGY
The Figure 3 shows sample cloud topology for
privacy preservation over incremental data
sets.
Figure 3- Sample Cloud Topology
Generally, a cloud framework consist of main
data centers, the main data centers are
connected with each other, each main data
centers have n number of sub-data centers and
each sub-data centers are interconnected with
each other. The sub-data centers may have
another arrangement of sub-data centers or it is
specifically connected with the users. Here in
this Figure 3, the main data center is indicated
as D and the sub-data centers are represented
as SD. The representation SP in the Figure 3 is
the service supplier who gathers enormous
volume of data and stores these privacy
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
sensitive data sets on cloud to use the cloud
facilities to process these massive data.
The health service supplier gives health
services to the clients in the cloud through
cloud health service application Abusing. This
cloud health service, clients can manage their
health as well as their family's health. The
client in the cloud can get the supportive
information like side effects analysis, medical
diagnosis and health plans by uploading their
personal health records into this service. Many
hospitals also upload their patient's health data
into this service. So the health service supplier
can gather massive amount of data and store
these privacy sensitive data sets on cloud to
use the cloud facilities. The establishments, for
example, hospitals, pharmacies and
governments are also analyzing or sharing
these data. The storage and computation of
these data takes place in cloud framework.
PROPOSED INCREMENTAL
CLUSTERING OVER INCREMENTAL
CLOUD DATA
It explains the proposed incremental clustering
technique for privacy preservation over
incremental cloud data. A sample square
diagram for the procedure of the proposed
framework is appeared in Figure 4.
Figure 4 Sample Block Diagram of the proposed technique
The Figure 4 explains as pursues: Initially, an
arrangement of records is given and an
arrangement of privacy related features are
picked which is called quasi-Identifiers. The
records are then grouped using k-means
clustering. The k-anonymity constraint is then
checked and rearranges the bunch based on the
k-anonymity constraint. Thereafter, the
information loss is checked for each bunch
after rearranging the group based on the k-
anonymity constraint. After that, another
record is embedded and each record is checked
with each bunch to distinguish which group is
matching the k-anonymity constraint of the
new record. The new record is then assembled
with the bunch which satisfies the k-
anonymity constraint.
The incremental clustering is a technique that
adds another record to a corresponding bunch
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
after clustering the initial arrangement of
records. The quasi-identifiers are set for the
new record and the quasi-identifiers ought to
be the same attributes picked before clustering
the arrangement of information records. After
choosing the quasi-identifiers, the data in the
quasi-identifier attributes ought to be
anonymized. The K-anonymity constraint after
anonymizing the quasi-identifiers of the new
record is checked with each group and in the
event that it matches the k-anonymity
constraint with any bunch, the new record will
get converge with that group. The Figure 5
demonstrates the way toward adding another
record concerning the k-anonymity constraint
and demonstrates the algorithm for our
proposed technique.
Figure 5 Process of adding a new record
BIG DATA PROCESSING MODELS:
MAPREDUCE ON CLOUDS
Data handling is another important aspect that
should be considered in the context of Big
Data, as understanding its characteristics plans
the best possible solutions. Nonetheless, with
the diversification of the register scenarios,
distinctive preparing approaches are, ported on
the cloud or on large infrastructures, with the
end goal to address the application difficulties.
The most popular of them is the Map Reduce
handling. Alongside with frameworks that
execute this paradigm. The only drawback to
utilize the coding approach of Hadoop Map
Reduce is that hadoop designers need to
compose several lines of basic java code
requiring extra effort and time for code survey.
Therefore, to disentangle this Apache offers
different options like Pig Latin and Hive SQL
languages that assistance in constructing Map
Reduce programs easily. In any case, the
advantage is that Map Reduce gives more
control to composing complex business
rationale when compared to Pig and Hive.
CONCLUSION
It proposes a flexible, versatile, dynamic and
financially savvy privacy safeguarding system
dependent on Map Reduce on cloud called
PK-Anonymity. The privacy safeguarding
structure can anonymize expansive scale
datasets and deal with the mysterious data set
in an exceptionally flexible, versatile,
productive and practical form. It gives flexible
privacy structure on conventional big data and
not for spilling of data. With the assistance of
STORM device the gushing of data can be
updated successfully. And furthermore the
extraordinary id is anonymize with STROM
.Several data preparing system is incorporated
to perform anonymization. PK-Anonymization
can be utilized to anonymize different big data
set in viable way.Now and then the loss of
information is least however the dimension of
anonymization isn't adequate. With the goal
that the security of the huge data will be lost to
perform great anonymization, keeping up the
dimension of anonymization is vital. It
proposes a flexible, adaptable, dynamic and
practical privacy protecting system dependent
on Map Reduce on cloud.
REFERENCES:
[1]. Murugeshwari, B, Kumar, CJ
&Sarukesi, K 2013, „Preservation Of
The Privacy For Multiple Custodian
Systems With Rule Sharing‟, Journal
Airo International Research Journal July, 2016
Volume VIII, ISSN: 2320-3714
Impact Factor 0.75 to 3.19
of Computer Science, vol. 9, no. 9, pp
1086.
[2]. Priyadarsini, RP, Valarmathi, ML,
&Sivakumari, S 2011, „Gain Ratio
Based Feature Selection Method For
Privacy Preservation‟, ICTACT
Journal On Soft Computing, vol.
01,no. 04, pp.201-205.
[3]. Bhuyan, HK, Mohanty, M, & Das,
SR, 2012, „Privacy Preserving for
Feature Selection in Data Mining
Using Centralized Network‟,
International Journal of Computer
Science, vol. 9, pp.67-85.
[4]. Ousseynou Sane, Fodé Camara1,
Samba Ndiaye&YahyaSlimani, 2012,‟
An Approach to Overcome Inference
Channels on k-anonymous Data‟.
[5]. Kenig, B, &Tassa, T 2012, „A
practical approximation algorithm for
optimal k-anonymity‟, Data Mining
and Knowledge Discovery, vol. 25,
no. 1, pp. 134-168.
[6]. Navarro-Arribas, G, Torra, V, Erola,
A, &Castellà-Roca, J 2012,‟ User< i>
k-anonymity for privacy preserving
data mining of query logs‟,
Information Processing &
Management, vol. 48, no. 3, pp. 476-
487.
[7]. Bhuyan, HK, Mohanty, M, & Das,
SR, 2012, „Privacy Preserving for
Feature Selection in Data Mining
Using Centralized Network‟,
International Journal of Computer
Science, vol. 9, pp.67-85.
[8]. Matthew Hall.,“Cumulative Multi-
Niching Genetic Algorithm for
Multimodal Function Optimization”
(IJARAI) International Journal of
Advanced Research in Artificial
Intelligence, Vol. 1, No. 9, 2012
[9]. RichaGarg., and Saurabhmittal.,
“Optimization by Genetic Algorithm”
International Journal of Advanced
Research inComputer Science and
Software Engineering, Volume 4,
Issue 4,April 2014
[10]. Basheer M. Al-Maqaleh., and
Hamid Shahbazkia., “A Genetic
Algorithm for Discovering
Classification Rules in Data
Mining”International Journal of
Computer Applications, Volume 41 -
Number 18, 2012
[11]. Yang Xu.,MingmingZeng.,
Quanhui Liu., and Xiaofeng Wang.,
“A Genetic Algorithm Based
Multilevel Association Rules Mining
for Big Datasets”Hindawi Publishing
Corporation, Mathematical Problems
in Engineering, Volume 2014.
[12]. Priyanka Sharma., and Saroj.,
“Discovery of Classification Rules
Using Distributed Genetic Algorithm”
Elsevier, Proceedings of the
International Conference on
Information and Communication,
Volume 46, 2015, Pages 276–284