presentación de powerpoint -...
TRANSCRIPT
Contenidos
• Por que BIG DATA?
• Características de Big Data
• Tecnologías y Herramientas Big Data
• Paradigmas fundamentales Big Data
• Data Mining
• Visualización
DIAPOSITIVA 1
Por qué BIG DATA?
DIAPOSITIVA 2
We are drawing on
data but starving on
knowledge !!
Por qué BIG DATA?
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
3
DIAPOSITIVA 3
Quien genera y usa datos?
Social media and networks
(all of us are generating data) Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
DIAPOSITIVA 4
Evolución
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
DIAPOSITIVA 5
Big Data
• “Big data refers to the tools, processes
and procedures allowing an organization
to create, manipulate, and manage very
large data sets and storage
facilities”(zdnet.com)
• The big deal about big data is the potential
for getting more value more quickly from
more data, at a lower cost and with greater
agility. (Brian Hopkins, zdnet)
DIAPOSITIVA 6
Big Data
“Big Data” is data whose scale, diversity,
and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
DIAPOSITIVA 7
Características de Big Data:
Volume • Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
DIAPOSITIVA 9
Características de Big Data:
Varity • Various formats, types, and
structures
• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data
To extract knowledge all
these types of data need to
linked together
DIAPOSITIVA 10
Características de Big Data:
Velocity • Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples – E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
DIAPOSITIVA 11
Big Data Bubble?
© 2013 KDnuggets
Gartner Hype Cycle
Big Data
Gartner VP says Big Data is
Falling into the Trough of
Disillusionment, Jan 2013
DIAPOSITIVA 14
Retos
• The Bottleneck is in technology – New architecture, algorithms, techniques are needed
• Also in technical skills – Experts in using the new technology and dealing with big
data
DIAPOSITIVA 15
Business Intelligence
• Statistics
• Data mining
• Knowledge Discovery in Data (KDD)
• Predictive Analytics
• Business Analytics
• Data Science
• Data Analytics
• …
Same Core Idea:
Finding Useful Patterns in Data
Different Emphasis
DIAPOSITIVA 21
• Lots of data is being collected and warehoused – Web data, e-commerce
– purchases at department/ grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
DIAPOSITIVA 23
¿Por qué?
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
¿Por qué?
DIAPOSITIVA 24
¿Qué es? – Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
DIAPOSITIVA 25
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
Origenes
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
DIAPOSITIVA 26
CRISP-DM
• Why Should There be a Standard
Process?
– The data mining process must be reliable and
repeatable by people with little data mining
background.
DIAPOSITIVA 27
CRISP-DM
• Why Should There be a Standard
Process?
– Allows projects to be replicated
– Aid to project planning and management
– Allows the scalability of new algorithms
DIAPOSITIVA 28
CRoss-Industry Standard
Process
for Data Mining
The CRISP-DM Model: The New Blueprint
for DataMining”, Colin Shearer, JOURNAL
of Data Warehousing, Volume 5, Number 4,
p. 13-22, 2000
DIAPOSITIVA 29
CRISP-DM • Business Understanding:
– Project objectives and requirements understanding, Data mining problem definition
• Data Understanding:
– Initial data collection and familiarization, Data quality problems identification
• Data Preparation:
– Table, record and attribute selection, Data transformation and cleaning
• Modeling:
– Modeling techniques selection and application, Parameters calibration
• Evaluation:
– Business objectives & issues achievement evaluation
• Deployment:
– Result model deployment, Repeatable data mining process implementation
DIAPOSITIVA 31
CRISP-DM
Business
Understanding Data
Understanding
Data
Preparation Modeling Deployment Evaluation
Format
Data
Integrate
Data
Construct
Data
Clean
Data
Select
Data
Determine
Business
Objectives
Review
Project
Produce
Final
Report
Plan Monitering
&
Maintenance
Plan
Deployment
Determine
Next Steps
Review
Process
Evaluate
Results
Assess
Model
Build
Model
Generate
Test Design
Select
Modeling
Technique
Assess
Situation
Explore
Data
Describe
Data
Collect
Initial
Data
Determine
Data Mining
Goals
Verify
Data
Quality
Produce
Project Plan
DIAPOSITIVA 32
CRISP-DM
• Knowledge acquisition techniques
Knowledge Acquisition,
Representation, and
Reasoning
Turban, Aronson, and Liang,
Prentice Hall, Decision Support
Systems and Intelligent
Systems, 7th Edition, 2005
DIAPOSITIVA 34
DM Tools
• Open Source
• Weka
• Orange
• R-Project
• KNIME
• Commercial
• SPSS
• Clementine
• SAS Miner
• Matlab
• …
DIAPOSITIVA 35
DM Tools
• Weka 3.6
– Java
– Excellent library, regular interface
– http://www.cs.waikato.ac.nz/ml/weka/
• Orange
• R-Project
• KNIME
DIAPOSITIVA 36
DM Tools
• Weka 3.6
• Orange
– C++ and Python
– Regular library !, good interface
– http://orange.biolab.si/
• R-Project
• KNIME
DIAPOSITIVA 37
DM Tools
• Weka 3.6
• Orange
• R-Project
– Similar than Matlab and Maple
– Powerfull libraries, Regular interface. Too
slow for file access!
– http://cran.es.r-project.org/
• KNIME
DIAPOSITIVA 38
DM Tools
• Weka 3.6
• Orange
• R-Project
• KNIME
– Java
– Includes Weka, Python and R-Project
– Powerfull libraries, good interface
– http://www.knime.org/download-desktop
DIAPOSITIVA 39