big data analytics lecture series - mu
TRANSCRIPT
© 2012 IBM Corporation
IBM Security Systems
1© 2013 IBM Corporation
1
Big Data AnalyticsLecture Series
Kalapriya KannanIBM Research LabsJuly, 2013
© 2012 IBM Corporation
IBM Security Systems
2© 2013 IBM Corporation
2
Small changes/additions done by Dr. Enis Karaarslan, 2014
© 2012 IBM Corporation
IBM Security Systems
3© 2013 IBM Corporation
3
What is the aim of the course
Focus is on “Systems” and applications for cloud-based storage and processing of BIG DATA.
+Big Data - Definition+Big Data - Analytics+Big Data - Storage (HDFS)+Big Data - Computing (Map/Reduce)+Big Data - Database (HBase)+Big Data – Graph DB (Titan)+Big Data - Streaming (Strom)
© 2012 IBM Corporation
IBM Security Systems
4© 2013 IBM Corporation
4
Pre-Requisite
“Nothing” – All of you are equally qualified.A VM machine either through a VMPlayer/Virtual Box
Acknowledgements:– IBM Material/Examples/Machine etc.,
– IBM External talks/publically available material and authors of the same.
– Several Internet material – Thanks to “Internet”
– Apache Documentation and Examples
© 2012 IBM Corporation
IBM Security Systems
5© 2013 IBM Corporation
5
Mantra
“Learning is not just restricted to listening, it is actively asking relevant questions”
© 2012 IBM Corporation
IBM Security Systems
6© 2013 IBM Corporation
6
After 6 hrs of lecture
Get Convinced about “Big Data” Understand why we need a different paradigm. Ascertain with confidence the need to look at data computing in
a different way. Realize the potential of big data
–All of you are skilled enough to get into it.
What we will not do–Do research on why things have evolved into the current
trends as it stands.–Try to be hands-on – But not guaranteed
© 2012 IBM Corporation
IBM Security Systems
7© 2013 IBM Corporation
7
Today’s 1 hr.
© 2012 IBM Corporation
IBM Security Systems
8© 2013 IBM Corporation
8
Introduction to Big Data
Kalapriya KannanIBM Research LabsJuly, 2013
© 2012 IBM Corporation
IBM Security Systems
9© 2013 IBM Corporation
9
What are we going to understand
What is Big Data?
Why we landed up there?To whom does it matter?Where is the money?Are we ready to handle it?What are the concerns?Tools and Technologies
–Is Big Data <=> Hadoop
© 2012 IBM Corporation
IBM Security Systems
10© 2013 IBM Corporation
10
Simple to start
What is the maximum file size you have dealt so far?– Movies/Files/Streaming video that you have used?
– What have you observed?
What is the maximum download speed you get?Simple computation
– How much time to just transfer.
© 2012 IBM Corporation
IBM Security Systems
11© 2013 IBM Corporation
11
640 K ought to be enough for everybody
© 2012 IBM Corporation
IBM Security Systems
12© 2013 IBM Corporation
12
● Google processes 20 PB (10^15 bytes) a day (2008)
● Wayback Machine has 3 PB + 100 TB/month (3/2009)
● Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
● eBay has 6.5 PB of user data + 50 TB/day (5/2009)
● CERN’s Large Hydron Collider (LHC) generates 15 PB a year
© 2012 IBM Corporation
IBM Security Systems
13© 2013 IBM Corporation
13
The Earthscope
The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)
© 2012 IBM Corporation
IBM Security Systems
14© 2013 IBM Corporation
14
What is big data?
“Every day, we create 2.5 quintillion (10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few.
This data is “big data.”
© 2012 IBM Corporation
IBM Security Systems
15© 2013 IBM Corporation
15
Huge amount of data
There are huge volumes of data in the world:+From the beginning of recorded time until 2003,
+ We created 5 billion gigabytes (exabytes) of data.
+In 2011, the same amount was created every two days
+In 2013, the same amount of data is created every 10 minutes.
© 2012 IBM Corporation
IBM Security Systems
16© 2013 IBM Corporation
16
Big data spans three dimensions: Volume, Velocity and Variety
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.
– Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
– Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as
catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
– Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer
churn faster – The latest I have heard is 10 nano seconds delay is too much.
Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
– Monitor 100’s of live video feeds from surveillance cameras to target points of interest
– Exploit the 80% data growth in images, video and documents to improve customer satisfaction
© 2012 IBM Corporation
IBM Security Systems
17© 2013 IBM Corporation
17
Finally….
`Big- Data’ is similar to ‘Small-data’ but bigger
.. But having data bigger it requires different approaches:
Techniques, tools, architecture… with an aim to solve new problems
Or old problems in a better way
© 2012 IBM Corporation
IBM Security Systems
18© 2013 IBM Corporation
18
Whom does it matter
Research Community Business Community - New tools, new capabilities, new infrastructure, new business
models etc., On sectors
Financial Services..
© 2012 IBM Corporation
IBM Security Systems
19© 2013 IBM Corporation
19
How are revenues looking like….
© 2012 IBM Corporation
IBM Security Systems
20© 2013 IBM Corporation
20
The Social Layer in an Instrumented Interconnected World
2+ billion
people on the
Web by end 2011
30 billion RFID tags today
(1.3B in 2005)
4.6 billion camera phones
world wide
100s of millions of GPS
enabled devices
sold annually
76 million smart meters in 2009… 200M by 2014
12+ TBs of tweet data
every day
25+ TBs oflog data
every day
? T
Bs
of
dat
a ev
ery
day
© 2012 IBM Corporation
IBM Security Systems
21© 2013 IBM Corporation
21
What does Big Data trigger?
From “Big Data and the Web: Algorithms for Data Intensive Scalable Computing”, Ph.D Thesis, Gianmarco
© 2012 IBM Corporation
IBM Security Systems
22© 2013 IBM Corporation
22
BIG DATA is not just HADOOP
Manage & store huge volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all data sources
Integration, Data Quality, Security, Lifecycle Management, MDM
Understand and navigate federated big data sources
Federated Discovery and Navigation
© 2012 IBM Corporation
IBM Security Systems
23© 2013 IBM Corporation
23
Types of tools typically used in Big Data Scenario
Where is the processing hosted?–Distributed server/cloud
Where data is stored?–Distributed Storage (eg: Amazon s3)
Where is the programming model?–Distributed processing (Map Reduce)
How data is stored and indexed?–High performance schema free database
What operations are performed on the data?–Analytic/Semantic Processing (Eg. RDF/OWL)
© 2012 IBM Corporation
IBM Security Systems
24© 2013 IBM Corporation
24
When dealing with Big Data is hard
When the operations on data are complex:–Eg. Simple counting is not a complex problem.–Modeling and reasoning with data of different kinds can get extremely complex
Good news with big-data:–Often, because of the vast amount of data, modeling techniques can get simpler (e.g., smart counting can replace complex model-based analytics)…
–…as long as we deal with the scale.
© 2012 IBM Corporation
IBM Security Systems
25© 2013 IBM Corporation
25
Time for thinking
What do you do with the data.– Lets take an example:
• “From application developers to video streamers, organizations of all sizes face the challenge of capturing, searching, analyzing, and leveraging as much as terabytes of data per second—too much for the constraints of traditional system capabilities and database management tools.”
© 2012 IBM Corporation
IBM Security Systems
26© 2013 IBM Corporation
26
Why Big-Data?
Key enablers for the appearance and growth of ‘Big-Data’ are:
+Increase in storage capabilities+Increase in processing power+Availability of data
© 2013 IBM Corporation
IBM Security Systems
27
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBM
big
da
ta
•
IBM
big
da
taIB
M b
i g d
ata • IB
M b
ig d
ata
THINK