Download - Hadoop Presentation
04/11/2023 Pham Thai Hoa
Hadoop Presentation 2012
Presenter : Pham Thai HoaEmail : [email protected] : http://mobion.com/hoa
04/11/2023 Pham Thai Hoa
TopicIntroduce to HadoopIntroduce to HiveIntroduce to LoggerUsing Hadoop at MobionWarehouse at MobionQ&A
04/11/2023 Pham Thai Hoa
What is HadoopIt’s a framework for the distributed
processingInspired by Google’s architecture:
Map Reduce and GFSA top-level Apache projectHadoop is the open sourceHadoop have the two important
elements+ Map – Reduce core+ Hadoop Distributed File System
04/11/2023 Pham Thai Hoa
Why use HadoopFault-tolerant hardware is expensiveHadoop is designed to run on cheap
commodity hardwareIt automatically handles data
replication and node failureIt does the hard work – you can focus
on processing dataIt has the three supported modes :
Local, Pseudo-Distributed, Fully-Distributed Mode
04/11/2023 Pham Thai Hoa
Data Flow into Hadoop
04/11/2023 Pham Thai Hoa
Who use HadoopAmazon's product search indices
using the streaming API and pre-existing C++, Perl, and Python tools
Yahoo : More than 100,000 CPUs in >40,000 computers running Hadoop
Facebook use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning
04/11/2023 Pham Thai Hoa
What is HiveHive is a data warehouse system
for HadoopUsing Map-Reduce for executionUsing HDFS for storageMetadata in an RDBMSScalability and performanceInteroperabilityUsing a SQL-like language called
HiveQL
04/11/2023 Pham Thai Hoa
Data Flow into Hive
04/11/2023 Pham Thai Hoa
Hive Data ModelTables
+ Typed columns (int, float, string,…)+ Also, array/map/struct for JSON-like data
Partitions+ e.g., to range-partition tables by date
Buckets+ Hash partitions within ranges (useful for sampling, join optimization)
04/11/2023 Pham Thai Hoa
Hive MetastoreDatabase: namespace containing
a set of tablesHolds Table/Partition definitions
(column types,mappings to HDFS directories)
StatisticsImplemented with DataNucleus
ORM. Runs on Derby, MySQL, and many other relational databases
04/11/2023 Pham Thai Hoa
Introduce to LoggerA logging system has three broad
components+ Client Code Interface+ Distribution System+ Do Something Usefullizer
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures
04/11/2023 Pham Thai Hoa
Why use ScribeScalability and performanceEvent Notification libraryThrift frameworkHadoop is optionalClient usingDistributed scribe systemOver 1 million messages per
second for loggingHierarchy stores
04/11/2023 Pham Thai Hoa
Warehouse at MobionLog CollectorLog/Data TransformerData AnalyzerWeb ReporterLog defineLog integrate (into application)Log/Data analyzeReport develop (API, Mobion,
Music …)
04/11/2023 Pham Thai Hoa
Warehouse at MobionData miningMusic RecommendationSpam DetectionApplication performanceExport data and import into
MySQL for web reportAnalytic system
04/11/2023 Pham Thai Hoa
Q&AWhy use hadoop ?Why use Hive ?Why need a logging system ?What is the warehouse system
architecture ?Do we use these system for
voting, chat, message and feed ??How can we use them for
recommendation, suggestion ?
04/11/2023 Pham Thai Hoa
Following Linkhttp://facebook.comhttp://highscalability.com/product
-scribe-facebooks-scalable-logging-system
http://hadoop.apache.org/http://hive.apache.org/http://wiki.apache.org/hadoop/Po
weredByhttp://www.apache.org/foundatio
n/thanks.html
04/11/2023 Pham Thai Hoa
THANK YOU