java one 2017: open source big data in the cloud: hadoop, m/r, hive, spark and kafka

Open Source Big Data in OPC

Edelweiss KammermannFrank MunzJava One 2017

munz & more #2

About Meà Computer Engineer, BI and Data Integration Specialist

à Over 20 years of Consulting and Project Management experience in Oracle technology.

à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)

à Director of Community of LAOUC

à Head of BI Team CMS at ITConvergence

à Writer and frequent speaker at international conferences:

à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum

à Oracle ACE Director

Uruguay

Dr. Frank Munz

•Founded munz & more in 2007

•17 years Oracle Middleware,Cloud, and Distributed Computing

•Consulting and High-End Training

•Wrote two Oracle WLS andone Cloud book

Hadoop

What is Big Data?à Volume: The high amount of dataà Variety: The wide range of different data formats and schemas.

Unstructured and semi-structured data

à Velocity: The speed which data is created or consumedà Oracle added another V in this definition

à Value: Data has intrinsic value—but it must be discovered.

What is Oracle Big Data Cloud Compute Edition?à Big Data Platform that integrates Oracle Big Data solution with

Open Source tools à Fully Elastic

à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service

à Access, Data and Network Security

à REST access to all the funcitonality

Big Data Cloud Service – Compute Edition (BDCS-CE)

BDCS-CE Notebook: Interactive Analysisà Apache Zeppelin Notebook (version0.7) to interactively work with data

What is Hadoop?à An open source software platform for distributed storage and

processing à Manage huge volumes of unstructured data

à Parallel processing of large data set

à Highly scalable

à Fault-tolerant

à Two main components:à HDFS: Hadoop Distributed File System for storing information

à MapReduce: programming framework that process information

Hadoop Components: HFDSà Stores the data on the cluster

à Namenode: block registry

à DataNode: block containers themselves (Datanode)

à HDFS cartoon by Mvarshney

Hadoop Components: MapReduceà Retrieves data from HDFS à A MapReduce program is composed by

à Map() method: performs filtering and sorting of the <key, value> inputs

à Reduce() method: summarize the <key,value> pairs provided by the Mappers

à Code can be written in many languages (Perl, Python, Java etc)

MapReduce Example

Code Example

#2Hive

What is Hive?à An open source data warehouse software on top of Apache Hadoop

à Analyze and query data stored in HDFS

à Structure the data into tables

à Tools for simple ETL

à SQL- like queries (HiveQL)

à Procedural language with HPL-SQL

à Metadata storage in a RDBMS

Hadoop & Hive Demo

Revisited: Map Reduce I/O

munz & more #23Source:HadoopApplicationArchitectureBook

• Orders of magnitude(s) faster than M/R

• Higher level Scala, Java or Python API

• Standalone, in Hadoop, or Mesos

• Principle: Run an operation on all data

-> ”Spark is the new MapReduce”• See also: Apache Storm, etc

• Uses RDDs, or Dataframes, or Datasets

munz & more #24https://stackoverflow.com/questions/31508083/difference-between-dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Resilient Distributed Datasets

Where do they come from?

Collection of data grouped into named columns.Supports text, JSON, Apache Parquet, sequence.

ReadinHDFS,LocalFS,S3,Hbase

ParallelizeexistingCollection

TransformotherRDD->RDDsareimmutable

Lazy Evaluation

munz & more #26

Nothingisexecuted Execution

Transformations:map(), flatMap(),reduceByKey(), groupByKey()

Actions:collect(), count(), first(), takeOrdered(), saveAsTextFile(), …

http://spark.apache.org/docs/2.1.1/programming-guide.html

map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.

flatMap(func) Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(so funcshouldreturnaSeq ratherthanasingleitem).

reduceByKey(func,[numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunction func,whichmustbeoftype(V,V)=>V.

groupByKey([numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.

Transformations

Spark Demo

munz & more #30

Apache Zeppelin Notebook

munz & more #31

Word Count and Histogram

munz & more #32

res = t.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

res.takeOrdered(5, key = lambda x: -x[1])

Zeppelin Notebooks

munz & more #33

Big Data Compute Service CE

munz & more #34

Partitioned, replicated commit log

munz & more #36

0 1 2 3 4 … n

Immutablelog:Messageswithoffset

Producer

ConsumerA

ConsumerBhttps://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that

Broker1

Broker2

Broker3

TopicA(1)

TopicA(2)

TopicA(3)

Partition/Leader

Repl A(1)

Repl A(2)

Repl A(3)

Producer

Replication/Follower

Zoo-keeper

State/HA

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/

- 1 topic- 1 partition- Contains every article published

since 1851- Multiple producers / consumers

ExampleforStream/TableDuality

Kafka Clients

SDKs Connect Streams

- OOTB:Java,Scala- Confluent:Python,C,C++

Confluent:- HDFSsink,- JDBCsource,- S3sink- Elasticsearchsink

- Plugin.jarfile- JDBC:Changedata

capture(CDC)

- Real-timedataingestion- Microservices- KSQL:SQLstreaming

engineforstreamingETL,anomalydetection,monitoring

- .jarfilerunsanywhere

High/lowlevelKafkaAPI ConfigurationonlyIntegrateexternalSystems

DatainMotionStream/Tableduality

- Languageagnostic

- Easyformobileapps

- EasytotunnelthroughFWetc.

Lightweight

Oracle Event Hub Cloud Service

• PaaS: Managed Kafka 0.10.2

• Two deployment modes

– Basic (Broker and ZK on 1 node)

– Recommended (distributed)

• REST Proxy

– Separate sever(s) running REST Proxy

munz & more #40

Event Hub

munz & more #41

Event Hub Service

munz & more #42

You must open ports to allow access for external clients

• Kafka Broker (from OPC connect string)

• Zookeeper with port 2181

munz & more #43

Scaling

munz & more #44

horizontal (up)vertical

Event Hub REST Interface

munz & more #45

https://129.151.91.31:1080/restproxy/topics/a12345orderTopic

Service = Topic

Interesting to Know

• Event Hub topics are prefixed with ID domain

• With Kafka CLI topics with ID Domain can be created

• Topics without ID domain are not shown in OPC console

Conclusion

TL;DR #bigData #openSource #OPCOpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub:Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools

@EdelweissK@FrankMunz

www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog

facebook.com/cloudcomputingbookfacebook.com/weblogicbook

@frankmunz

youtube.com/weblogicbook

-> more than 50 web casts

Don’t be

email:ekammermann@itconvergence.com

Twitter:@EdelweissK

3MembershipTiers• OracleACEDirector• OracleACE• OracleACEAssociate

bit.ly/OracleACEProgram

500+TechnicalExpertsHelpingPeersGlobally

Connect:

Nominateyourselforsomeoneyouknow:acenomination.oracle.com

@oracleace

Facebook.com/oracleaces

oracle-ace_ww@oracle.com

Sign up for Free Trial

http://cloud.oracle.com

java one 2017: open source big data in the cloud: hadoop, m/r, hive, spark and kafka

Internet

hopsworks - self-service spark/flink/kafka/hadoop

teradata hadoop data archival strategy with hadoop and hive

hadoop m/r pig hive

business intelligence on hadoop hive - community archive ·...

clogeny's hadoop training series - apache hive

debugging hive with hadoop in the cloud - hadoop summit 2014

tomorrow’s enterprise - delivered...

hadoop hive

apache hadoop and hive

interactive sql poc on hadoop (hive, presto and hive-on-tez)

languages for hadoop: pig & hive - brown...

hadoop hive tutorial | hive fundamentals | hive architecture

hive et hadoop usage chez square

hadoop deck map reduce hive

hive: data warehousing & analytics on hadoop

hadoop and hive at orbitz, hadoop world 2010

hadoop, hbase and hive- bay area hadoop user group

processing relational data with hive lecture bigdata...

kafka & hadoop - for nyc kafka meetup

hadoop hive talk at iit-delhi