big data issues challenges tools n good practices

27
Avita Katal, Mohammad Wazid, RH Goudar Dept of CSE, Graphics Era University Dehradun, India Published in : CONTEMPORARY COMPUTING (IC3), 2013 SIXTH INTERNATIONAL CONFERENCE ON Big Data : Issues, Challenges, Tools and Good Practices 1 Ravi

Upload: ravi-ganghas

Post on 12-Jul-2015

246 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Big data issues challenges tools n good practices

Avita Katal , Mohammad Wazid, RH Goudar

Dept of C S E , G raphi c s Era Uni vers i tyDehradun , I nd ia

Publ i s hed i n : C O N T E M P O R A R Y C O M P U T I N G ( I C 3 ) , 2 0 1 3 S I X T H I N T E R N A T I O N A L C O N F E R E N C E O N

Big Data : Issues, Challenges, Tools and Good Practices

1

Ravi

Page 2: Big data issues challenges tools n good practices

Motivation

Data stores are growing by 50% each year, and that rate of increase is accelerating[8]

The type of data is also changing. Over 80% of it will be unstructured data which does not work well with relational databases[8]

The main difficulty is because the volume is increasing rapidly in comparison to computing resources

2

Ravi

Page 3: Big data issues challenges tools n good practices

Defining Big Data

It is defined as large amount of data whichrequires new technologies and architecturesso that it becomes possible to extract valueform it by capturing and analysis process.

It is a recent upcoming technology that canbring huge benefits to the businessorganizations.

3

Ravi

Page 4: Big data issues challenges tools n good practices

Properties of Big Data

Variety : Data being produced is not onlytraditional but also semi structured fromvarious sources.

Volume : Data is supposed to increase inzettabytes in near future

Velocity : Speed of data coming fromvarious sources

Big data can be defined with following properties associated with it.

4

Ravi

Page 5: Big data issues challenges tools n good practices

Properties of Big Data...5

Ravi

Page 6: Big data issues challenges tools n good practices

Properties of Big Data...

Variability : It considers the inconsistencies of data flow.

Complexity : It is difficult to link, match cleanse, and transform data across systems coming from various sources.

Value : Queries can be run against the data stored to deduct important results.

6

Ravi

Page 7: Big data issues challenges tools n good practices

Related Work

Collaborative research on methodologies for big data analysis and design.[1]

Databases required for big data [2]

Architectural considerations for big data [3]

Concept of big data with market solutions [4]

Scientific Data Infrastructure (SDI) generic architectural model [5]

7

Ravi

Page 8: Big data issues challenges tools n good practices

Related Work...

How big data analytics is different from traditional analytics [6]

Analysis of social media sites like facebook,flickr,google+ [7]

8

Ravi

Page 9: Big data issues challenges tools n good practices

Importance of Big Data

Log Storage in IT Industries IT industries store large amounts of data as logs

to deal with problems which occur rarely.

Big data analytics is used on the data to pinpoint the point of failures

Traditional Systems are not able to handle these logs.

Sensor DataMassive amount of sensor data is also a big

challenge for Big data

9

Ravi

Page 10: Big data issues challenges tools n good practices

Importance of Big Data...

Risk Analysis It’s important for financial institutions to model data in

order to calculate the risk. A lot of potential data is underutilized because of its

volume and should be integrated to determine the risk patterns more accurately

Social Media The largest use of Big data is for social media and

customer sentiments Keeping an eye on what the customers are saying is like

getting a feedback. The customer feedback can then be used to make

decisions and add value to the business

10

Ravi

Page 11: Big data issues challenges tools n good practices

Big Data Challenges and Issues

Privacy and Security

The most important issue with Big data which includes conceptual, technical as well as legal significance

The personal information of a person when combined with external large data sets leads to the inference of new private facts about that person

Big data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences .

11

Ravi

Page 12: Big data issues challenges tools n good practices

Big Data Challenges and Issues...

Data Access and Sharing of Information If data is to be used to make accurate decisions in

time it becomes necessary that it should be available in accurate, complete and timely manner

Storage and Processing Issues Many companies are struggling to store the large

amount of data they are producingOutsourcing storage to the cloud may seem like an

option but long upload times and constant updates to the data preclude this option

Processing a large amount of data also takes a lot of time

12

Ravi

Page 13: Big data issues challenges tools n good practices

Big Data Challenges and Issues...

Analytical Challenges What if data volume gets so large that we don’t know

how to deal with it

Does all data need to be stored ?

Does all data need to be analyzed?

Which data points are really important ?

How can data be used to best advantages

Skill Requirement : Being a new and emerging technology, it needs to attract organization and youth with diverse new skill sets.

13

Ravi

Page 14: Big data issues challenges tools n good practices

Big Data Challenges and Issues...

Technical Challenges

Fault Tolerance

Scalability

Quality of Data

Heterogeneous Data

14

Ravi

Page 15: Big data issues challenges tools n good practices

Tools and Techniques Available

Hadoop - is an open source project hosted by Apache Software Foundation for managing Big data

Hadoop consists of two main componentsThe Hadoop File System (HDFS) which is a

distributed file-system that stores the data on multiple separate servers (each of which having its own processor(s))

MapReduce the framework that understands and assigns work to the nodes in a cluster[9]

15

Ravi

Page 16: Big data issues challenges tools n good practices

Advantages of Hadoop

Hadoop provides the following advantages[9]

Data read/write performance is increased by distributing the data across the cluster allowing each processor to do work in a parallel fashion

It’s scalable, new nodes can be added as needed without making changes to the existing system

It’s cost effective because it brings parallel computing to commodity servers

16

Ravi

Page 17: Big data issues challenges tools n good practices

Advantages of Hadoop…

It’s flexible, it can absorb any type of data, structured or not from any number of sources

It’s fault tolerant, it handles failures intrinsically by always storing multiple copies of the data and automatically loading a copy when a fault is detected

17

Ravi

Page 18: Big data issues challenges tools n good practices

Hadoop

How do you use Hadoop?

The developer writes a program that conforms to the MapReduce programming model

The developer specifies the format of the data to be processed in their program

18

Ravi

Page 19: Big data issues challenges tools n good practices

Hadoop

How does MapReduce work?[10]

Each Hadoop program performs two tasks:

Map - Breaks all of the data down into key/value pairs

Reduce - Takes the output from the map step as input and combines those data key/value pairs into a smaller set of key/value pairs

19

Ravi

Page 20: Big data issues challenges tools n good practices

Map Reduce - Example

MapReduce example[10]: Assume you have five files, and each file contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome,

33 ,New York, 18

We want to find the maximum temperature for each city across all of the data files

Then we create five map tasks, where each mapperworks on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city Which results in: (Toronto, 20) (New York, 22) (Rome, 33)

20

Ravi

Page 21: Big data issues challenges tools n good practices

Map Reduce – Example…

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33)

(Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30)

All five of these output streams would be fed into the reduce tasks, which combines the input results and outputs a single value for each city, producing a final result set as follows: (Toronto, 32) (New York, 33) (Rome, 38)

21

Ravi

Page 22: Big data issues challenges tools n good practices

Big Data – Good Practices

Creating dimensions of all the data being stored is good practice.

All the dimensions should have durable surrogate keys that can’t be changed and are unique.

Expect to integrate structured and unstructured data

Generality of technology is needed. Building it around key value pairs work.

22

Ravi

Page 23: Big data issues challenges tools n good practices

Big Data – Good Practices…

As value of big data becomes more apparent, privacy concerns grow.

Data quality needs to be better.

Limit on scalability of records.

Business and IT leaders should work together to create more value from data.

Investment in data quality and metadata reduces processing time.

23

Ravi

Page 24: Big data issues challenges tools n good practices

Conclusions

New concept of big data, its importance and existing projects.

Many challenges and issues exist which need to be brought up.

Big data will help business grow.

Hadoop Tool

24

Ravi

Page 25: Big data issues challenges tools n good practices

References

[1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,“Big Data: Issues and Challenges Moving Forward”, IEEE, 46th Hawaii International Conference on System Sciences, 2013.

[2] Sam Madden, “ From Databases to Big Data”, IEEE, Internet Computing, May-June 2012.

[3] Kapil Bakshi, “Considerations for Big Data: Architecture and Approach”,IEEE , Aerospace Conference, 2012.

[4] Sachchidanand Singh, Nirmala Singh, “Big Data Analytics”, IEEE,International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, 2012.

[5] Yuri Demchenko, Zhiming Zhao, Paola Grosso, AdiantoWibisono, Cees de Laat, “Addressing Big Data Challenges for Scientific Data Infrastructure”, IEEE , 4th International Conference on Cloud Computing Technology and Science, 2012.

25

Ravi

Page 26: Big data issues challenges tools n good practices

References...

[6] Martin Courtney, “The Larging-up of Big Data”, IEEE, Engineering & Technology, September 2012.

[7] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von Voigt, “Big Data Privacy Issues in Public Social Media”, IEEE, 6th International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012.

[8] Why Every Database Must Be Broken Soon https://blogs.vmware.com/vfabric/2013/03/why-every-database-must-be-broken-soon.html

[9] What is Hadoop? . http://www-01.ibm.com/software/data/infosphere/hadoop/

[10] What is MapReduce? http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce

26

Ravi

Page 27: Big data issues challenges tools n good practices

Thank You.27

Ravi