cs 789 advanced big data analytics big datamkang.faculty.unlv.edu/teaching/cs789/04.big data.pdf ·...
TRANSCRIPT
Mingon Kang, Ph.D.
Department of Computer Science, University of Nevada, Las Vegas
CS 789 ADVANCED BIG DATA
ANALYTICS
BIG DATA
* The contents are adapted from Dr. Jeongkyu Lee@UB
Era of Big Data
Main Frame
Computer
PC
Internet
Mobile
Computer
IT
everywher
e
1970 1980 1990 2000 2010 2020 2030
www PC BroadbandSNS
Mobile
Virtual
Realty
AI2011: Amount of Digital Information = 1.8 ZB
2020: maybe 50 times more??
Data Size
Data Type
Data
Characteristic
EB (Exa Byte)90’ = 100EB
Structured Data(RDBMS, Office Info)
Organized Data
Beginning ZB2011 = 1.8 ZB
Unstructured Data(MM, SNS, email)
Complex, SNS
Data
ZB Era2020 = x 50 data
Object, Spatial(IoT, RFID, Sensor)
Real-time Data
Big
Data
IoT
What is BIG DATA?
Wiki said in 2012 …
data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time
Wiki says NOW …
A broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, ….. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.
What is BIG DATA?
Wiki said in 2012 …
data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time
Wiki says NOW …
A broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, ….. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.
What is BIG DATA?
Gartner says …
Big data is high volume, high velocity, and/or high
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization
Oxford English Dictionary says …big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.
3V: Volume (Scale)
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
3V: Variety (Complexity)
Various formats, types, and structures
Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types of data
To extract knowledge➔ all these types of
data need to linked together
3V: Velocity (Speed)
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions ➔ missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you
like ➔ send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body ➔ any
abnormal measurements require immediate reaction
4V: Veracity
Who’s Generating Big Data
Social media and networks
(all of us are generating data)Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from
the collected data in a timely manner and in a scalable fashion
How to use big data
What’s driving Big Data
- Ad-hoc querying and reporting
- Basic data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
and huge data mining
- All types of data, and many sources
- Very large datasets
- More of a real-time
How to use Big Data
• Big Data, like Business
Intelligence, can be used to
improve stuff.
• It can also be used to solve
problems (i.e. answer “big
questions”).
How to use Big Data
• Suppose you have a Combine Harvester.
How to use Big Data
• Suppose you have a Combine Harvester.
• Sensors are becoming increasingly cheap, so it would be quite easy to cover the harvester in sensors (temperature, GPS, pressure, capacity, etc…).
• This will generate some big data. Especially if all the harvesters in Europe are equipped with the same sensors.
How to use Big Data
But what would you use this data for?
• Finding the most economical driving style by monitoring driving habits, tracking the position of the harvester and fuel levels in the tank.
• Monitoring vibrations and temperature patterns in the parts to predict when parts might break. This could then tie into a system that automatically orders parts.
• Tracking the harvester’s position and yield, to identify the most fertile areas and those which require fertilisers.
How to use Big Data
How to use Big Data
• How did Google use Big Data?
• They stored search history for every user as well as what every user clicked.
• This data was needless and pointless (Data Exhaust).
• What do you think that Google did with this data?
How to use Big Data
• Google used that data to power a spell-checker. Because if I search for “bansnas” and click on something relating to “bananas”, the chances are that I meant to search for “bananas” in the first place.
• About 2 billion searches a day are made on Google.
How to use Big Data
• LAPD use PredPol to predict crimes.
• https://www.predpol.com/
How to use Big Data
• The LAPD mined 13 million crime reports with a specialised algorithm.
1. Type of Crime
2. Place of Crime
3. Time of Crime
• 13 million arrests is 80 years’ of crime data.
• They then build mission maps covering dangerous areas, and would patrol them to minimise crime.
How to use Big Data
It worked. It reduced:• Property crime by 12% and Burglary by 26%
How to manage Big data
Challenges in Handling Big Data
The Bottleneck is in technology
New architecture, algorithms, techniques are needed
Also in technical skills
Experts in using the new technology and dealing with big data
Storing Big Data
• Here are a few tools that can be used to store Big Data.
Traditional Large-Scale Computation
Distributed System: Problems
Distributed Systems: Data Storage
Data-Driven World
Data Become the Bottleneck
Requirements for a new approach
Partial Failure Support
Data Recoverability
Component Recovery
Consistency
Scalability
Newbie for Big Data
- Hadoop Eco-System
38
Hadoop History
Core Hadoop Concepts
Very High-level Overview
Fault Tolerance
CPSC651- Big Data Systems and Analytics 43
Hadoop in IBM
Hadoop in Oracle
Hadoop in Teradata
Hadoop in Microsoft
Hadoop in EMC