a simple introduction to big data and hadoop
TRANSCRIPT
BIG DATA
BIG DATA EXAMPLE
• Social media (likes, friends, videos, pictures, tweets,…)• Mobile signals , sensors ,
clicks• Online shopping, stocks• Codes• …
BUY A BOOK FROM AMAZON
• Knows what you searched for • What did you buy EVER• How much you are willing to
pay• Ask Facebook (friends, likes,
hangouts,…)• Who else is buying what?
BIG DATA USAGE ?
WHAT IS A BIG DATA?• Any data that you can not store in 1 pc• 3V (Volume, Velocity, Variety)
APACHE HADOOP
• Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
DISTRIBUTED STORAGE HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
SUPER COMPUTER? NORMAL COMPUTER
WHY HDFS?
• What if something goes wrong (hardware failure)?• What is the cost of super
computer?• How easily we can add
capacity?
• Automatically handle hardware failure• Automatically backup data• Just buy new cheap
computers
DISTRIBUTED PROCESSING (MAP REDUCE)
• Count the number of trees in united states?• Solution 1: ask superman?• Solution 2: ask 1000 people?
BIG DATA USAGE IN COMPUTER SCIENCE
• Mining repositories• Ownership (plagiarism, copy
right)• Detecting code smells• Auto commenting• Predicting bugs, bug reports
OTHER TOPICS
• Data scientist• No SQL• Machine learning