distributed*datamanagement filedistributed*datamanagement summersemester2013 & tukaiserslautern...
TRANSCRIPT
Distributed Data Management Summer Semester 2013
TU Kaiserslautern
Dr.-‐Ing. Sebas4an Michel
[email protected]‐saarland.de
Distributed Data Management, SoSe 2013, S. Michel 1
BIG DATA
Distributed Data Management, SoSe 2013, S. Michel 2
source: Dilbert by Sco0 Adams (cropped)
(The Big data Challenge)
Lecture 2
What is Big Data? • Massive amounts of data from a variety of sources – Web search logs – social networks and blogs – RFID and other sensor data – sales data – scien4fic data
& it is a big buzzword!
Distributed Data Management, SoSe 2013, S. Michel 3
What is Big Data? (Cont’d)
Distributed Data Management, SoSe 2013, S. Michel 4
• Big data is oRen associated with NoSQL and MapReduce tools to process it.
• Processed in and across gigan4c data centers
• The term “Big Data” denotes not only size but things we want to/can do with it (benefits)
Tradi4onal Handling
• Data warehousing, e.g., at Walmart, Ebay, etc. Also super big and constantly growing.
• But you know your data, know what you are looking for
• Schema is “small” enough to allow human input (admin)
• It is “just” YOUR data
Distributed Data Management, SoSe 2013, S. Michel 5
“Simple” Case: Shopping Paderns
• Famous story: – sta4s4cian at target.com (large retailer in US) – task: figure out woman is pregnant even if she doesn’t want them to know
– even more: roughly which week/month – Why? To sell products!
Distributed Data Management, SoSe 2013, S. Michel 6
Read more: e.g., hdp://www.ny4mes.com/2012/02/19/magazine/shopping-‐habits.html?pagewanted=all&_r=0
“Simple” Case: Use of Search Logs
• Swine Flu epidemic of 2009 • Google tracks epidemic by following searches for flu-‐related topics.
Distributed Data Management, SoSe 2013, S. Michel 7 source: Google
What is different now? • Large amounts of heterogeneous data • Take all the PBs together, not only your own one (è From TB to PB and EB)
• Manual input of humans hardly scales • Who anyway understand complex data and schema (if there is one)?
• Far more data than we can handle (with tradi4onal means, and most probably beyond that)
• It is now beyond asking SQL queries.
Distributed Data Management, SoSe 2013, S. Michel 8
The 4th Paradigm
• For scien4fic discovery, tradi4onally – experimental (since thousands of years) – theore4cal (since hundreds of years) – computa4onal (like simula4ons) (since few decades) – Now: data driven (i.e., discovery through analyzing huge amounts of data)
Read on: hdp://research.microsoR.com/en-‐us/collabora4on/fourthparadigm/
Distributed Data Management, SoSe 2013, S. Michel 9
Data Science: What it takes
• many fields touched – math, sta4s4cs – data engineering – padern recogni4on and learning – natural language processing – visualiza4on – uncertainty modeling – data warehousing – high performance compu4ng
Distributed Data Management, SoSe 2013, S. Michel 10
The BIG Data Challenge: The 4 Vs
• Volume – Lots of data
• Velocity – Changing / growing data
• Variety – Heterogeneity
• Verity – True or not?
Distributed Data Management, SoSe 2013, S. Michel 11
Addressed in this lecture
According to Gartner and others.
Example: Trend Mining in Twider
Distributed Data Management, SoSe 2013, S. Michel 12
• Mine trends in text streams (Twider, RSS feeds, etc.)
• No human input. Massive amount of noisy unstructured text data.
• Wand to find trends like:
#benedictXVI #re4rement
#schavan #gudenberg
#armstrong #doping
#cyprus #bankruptcy
Example (Cont’d): Sliding Window Model and Objec4ve
Distributed Data Management, SoSe 2013, S. Michel 13
• Data valid for certain 4me
4me
• Now: Detect change in co-‐occurrence, thus emerging trend!
tag A
tag B tag A
tag B
evolving 4me
Example (Cont’d): Predic4on Model and Trend Ranking
Distributed Data Management, SoSe 2013, S. Michel 14
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Correla4on
Predic4on
Error
§ Intensity of trend as predic4on error
§ Exponen4al smoothing forecast
Data Sources are Heterogeneous 15
super fast not controlled (noisy) text lidle structure
super fast structured
sta4c structured administered
… so is the Data 16
Music
Publica4ons
Health Data
KB of En4re Wikipedia
Why is Big Data Interes4ng? • Novel insights about customers – Beyond pure shopping cart analyses and purchase history
– Beyond running separate surveys/polls
• Social media involvement • Demographic data • (Purchase) trend predic4on in social media (=> investment)
• Why? Money Distributed Data Management, SoSe 2013, S. Michel 17
Need to be Careful
Distributed Data Management, SoSe 2013, S. Michel 18
• Not only are facts oRen wrong • Also sta4s4cs can reveal wrong clues. • With enough data you can “tell” anything