big data
DESCRIPTION
Pengantar tentang teknologi Big Data, Hadoop, dan data.TRANSCRIPT
-
Oleh :
Andrew B. Osmond
-
About Me
FB : http://facebook.com/ab.osmond
Kantor : Ged. N202, Fakultas Teknik Elektro, Universitas Telkom
Gmail : [email protected]
Tel-U : [email protected]
-
Why Data is so Big?
-
Why Data is so Big?
-
Why Data is so Big?
-
Data Anywhere
Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.
Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.
Traditional relational database management systems cannot deal with such large masses of data.
-
Data Anywhere
-
What can we do with the big data?
-
Big Data Architecture
-
Nature of Data
-
Working With Data
Datasource
Data Scrubbing
Data Formats
-
Datasource
-
Open Data
Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose.
Example : World Health Organization is available at
http://www.who.int/research/en/
Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2
The World Bank is available at http://data.worldbank.org/
Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1
-
Text Files
commonly used for storage of data, because it is easy to transform into different formats, and it is often easier to recover and continue processing the remaining contents than with other formats.
-
SQL Database
-
NoSQL Database
Document Store
http://www.mongodb.com mongodb, http://couchdb.apache.org/ couchdb
Key value store
Apache Cassandra, Dynamo, Hbase, Amazon SimpleDB
Graph-based store
Neo4j, InfoGrid, Horton
-
Document Store
-
Key Value Store
-
Graph Store
-
Leading Technologies
Relational databases failed to store and process Big Data.
As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.
The technology include : Hadoop, MapReduce, NoSQL
-
Hadoop
Opensource framework
Java based programming framework
Processing and storing large of datasets
Distributed Computing Environment
Components : HDFS, MapReduce
-
Hadoop SQL
Data is stored in form of compressed files across n number of commodity servers
Data is stored in form of tables and columns with relation in them
Fault tolerant if one node fails ,system still work
If any one node crashes ,it gives error so as to maintain consistency
-
Map Reduce
programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
Hadoop is the physical implementation of Mapreduce .
It is combination of 2 java functions : Mapper() and Reducer()
-
Map Reduce Algorithm