introduction to big data hadoop
TRANSCRIPT
Dr. Sandeep G. Deshmukh
Introduction to
Contents
❑ Big Data ❑ Distributed Systems❑ Hadoop
➢ Hadoop Distributed File System (HDFS)➢ MapReduce
2
Show of Hands
Introduction to Big Data
Big data is data that exceeds the processing capacity of
conventional database systems.
The data is too big, moves too fast, or doesn’t fit the strictures of
your database architectures.
To gain value from this data, you must choose an alternative way
to process it.
https://www.oreilly.com/ideas/what-is-big-data
Definition
Quantity of data
Data sets too large to store and analyze using traditional databases
Volume
Velocity
Speed at which data is generated
Speed at which data is moving around and analyzed
Analyze data while it is being generated without even putting it into databases
Variety
Different types of data that we can use
Veracity
Messiness or trustworthiness of the data
Volume makes up for quality
Eg. Tweets with spelling mistakes, short words ( u -> you, thr-> there)
Value
Getting value out of Big Data!!!
Definition
“Big data” is
high-volume, -velocity and -variety information assets
that demand cost-effective, innovative forms of information processing
for enhanced insight and decision making
By Gartner
Definition
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate
Challenges include analysis, capture, data curation, search,sharing, storage, transfer, visualization, querying, updating and information privacy.
The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.
Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.
Wikipedia
Use Case: Big Data in Oil & Gas Drilling
http://analytics-magazine.org/images/stories/novdec12/big-data.jpg
Use Case: Uber - Pay Surge Pricing if Battery is Low
● A Brief History of Big Data Everyone Should Read
● Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity
● What is big data? - OpenSource.com
● What is big data? - O’Reilly
● 5 Big Data Use Cases To Watch
● Best Big Data Analytics Use Cases
● The 5 game changing big data use cases
● Big Data - The 5 Vs Everyone Must Know
● Top SlideShare Presentations on Big Data
● Google Data Center 360° Tour
Further Reading
Distributed Systems
A distributed system is a collection of independent computers that appears to its users as a single coherent system.
Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006
http://www.mypearsonstore.com/bookstore/distributed-systems-principles-and-paradigms-9780132392273?xid=PSED
Definition
Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006
Transparency Description
Access Hide differences in data representation and how a resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Relocation Hide that a resource may be moved to another location while in use
Replication Hide that a resource is replicated
Concurrency Hide that a resource may be shared by several competitive users
Failure Hide the failure and recovery of a resource
Forms of Transparency in Distributed Systems
● A distributed system consists of components (i.e., computers) that are autonomous
● Users (be they people or programs) think they are dealing with a single system. This means that one way or the other the autonomous components need to collaborate. How to establish this collaboration lies at the heart of developing distributed systems.
A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.
The components interact with each other in order to achieve a common goal.
Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.
Wikipedia
https://www.oreilly.com/ideas/what-is-big-data
Definition
● Distributed Computing - Wikipedia
● Distributed computing
● Characteristics of distributed system
Further Reading
Miscellaneous Concepts
Big Data Primers: Size does matter
Big Data Primers: Vertical Vs Horizontal Scaling
Vertical Scaling Horizontal Scaling
Big Data Primers: The scale of infrastructure
Resources
27
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
ᵒ https://www.datatorrent.com/product/startup-accelerator/
We Are Hiring
28
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders