using oracle r advanced analytics for hadoop (oraah) · • hadoop/big data cluster administrators...
Post on 20-May-2020
17 Views
Preview:
TRANSCRIPT
Hello and welcome to this online, self-paced course titled Administering and Managing the
Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled
Using Oracle R Advanced Analytics for Hadoop (ORAAH). My name is Lauran Serhal. I
am a curriculum developer at Oracle and I have helped educate customers on Oracle
products since 1995. I'll be guiding you through this course, which consists of lectures,
demos, and review sessions.
The goal of this lesson is to describe Oracle R Advanced Analytics for Hadoop (ORAAH) and
identify the benefits of using simple R functions.
Using Oracle R Advanced Analytics for Hadoop - 1
Introduction
Before we begin, take a look at some of the features of this course player. If you’ve viewed a
similar self-paced course in the past, feel free to skip this slide.
Menu
This is the Menu tab. It’s set up to automatically progress through the course in a linear
fashion, but you can also review the material in any order. Just click a slide title in the outline
to display its contents.
Notes
Click the Notes tab to view the audio transcript for each slide.
Search
Use the Search field to find specific information in the course.
Player Controls
Use these controls to pause, play, or move to the previous or next slide. Use the interactive
progress bar to fast forward or rewind the current slide. Some interactive slides in this course
may contain additional navigation and controls. The view for certain slides may change so
that you can see additional details.
Resources (Optional)
Click the Resources button to access any attachments associated with this course.
Glossary (Optional)
Click the Glossary button to view key terms and their definitions.
Using Oracle R Advanced Analytics for Hadoop - 2
So, you know the title of the course, but you may be asking yourself, “Is this the right course
for me?” Click the bars to learn about the course objectives, target audience, and
prerequisites.
Using Oracle R Advanced Analytics for Hadoop - 3
What can you expect to get out of this course? Here are the core learning objectives.
After completing this course, you should be able to do the following:
1. Define the Hadoop ecosystem and its components including Hadoop’s Distributed File
System (HDFS), MapReduce, Spark, YARN, and some other related projects.
2. Complete the BDA Site Checklists.
3. Run the Oracle BDA Configuration Utility.
4. Install the Oracle BDA Mammoth software on the Oracle BDA.
5. Learn about how to secure data on the Oracle BDA.
6. Work with the Oracle Big Data Connectors.
Using Oracle R Advanced Analytics for Hadoop - 4
Who is this course for? Here is the intended audience.
• Application Developers
• Database Administrators
• Hadoop/Big Data Cluster Administrators
• Hadoop Programmers
Using Oracle R Advanced Analytics for Hadoop - 5
Before taking this course, you should have some exposure to Big Data, and optionally some
basic database knowledge.
Using Oracle R Advanced Analytics for Hadoop - 6
In this course, we'll talk about the following lessons:
In the Introduction to the Hadoop Ecosystem, you define the Hadoop ecosystem and describe the Hadoop core components and some of the related projects. You will also learn about the components of HDFS and review MapReduce, Spark, and YARN.
In the Introduction to the Oracle BDA lesson, you identify the Oracle Big Data Appliance (BDA) and its hardware and software components.
In the Oracle BDA Pre-Installation Steps lesson, you learn how to download and complete the BDA Site Checklists. You also learn how to download and run the Oracle BDA Configuration Utility and then review the generated configuration files.
In the Working With Mammoth lesson, you learn how to download the Oracle BDA Mammoth Software Deployment Bundle from My Oracle Support. You also learn how to install a CDH or NoSQL cluster based on your specifications. You then learn how to install the Oracle BDA Mammoth Software Deployment Bundle using the Mammoth utility.
In the Securing the Oracle BDA lesson, you learn how to secure data on the Oracle Big Data Appliance.
In the Working With the Oracle Big Data Connectors lessons, you learn how to use Oracle SQL Connector for Hadoop Distributed File System, Oracle Loader for Hadoop, Oracle Data Integrator, Oracle XQuery for Hadoop, and Oracle R Advanced Analytics for Hadoop.
Using Oracle R Advanced Analytics for Hadoop - 7
Now that you’ve learned about the other Oracle Big Data Connectors, let’s take a look at
Oracle R Advanced Analytics for Hadoop (ORAAH) Oracle Big Data Connector.
After completing this lesson, you should be able to:
• Describe Oracle Advanced Analytics, Oracle Data Mining, and Oracle R Enterprise at a
high level
• Describe Oracle R Advanced Analytics for Hadoop (ORAAH) and identify the benefits of
using simple R functions
Let's get started.
Using Oracle R Advanced Analytics for Hadoop - 8
Oracle SQL Connector for Hadoop Distributed File System (previously Oracle Direct Connector
for HDFS): Enables an Oracle external table to access data stored in Hadoop Distributed File System
(HDFS) files or a table in Apache Hive. The data can remain in HDFS or the Hive table, or it can be
loaded into an Oracle database.
Oracle Loader for Hadoop: Provides an efficient and high-performance loader for fast movement of
data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop pre-partitions
the data if necessary and transforms it into a database-ready format. It optionally sorts records by
primary key or user-defined columns before loading the data or creating output files.
Oracle Data Integrator: Extracts, loads, and transforms data from sources such as files and databases
into Hadoop and from Hadoop into Oracle or third-party databases. Oracle Data Integrator provides a
graphical user interface to utilize the native Hadoop tools and transformation engines such as Hive,
HBase, Sqoop, Oracle Loader for Hadoop, and Oracle SQL Connector for HDFS.
Oracle XQuery for Hadoop (OXH): Runs transformations expressed in the XQuery language by
translating them into a series of MapReduce jobs, which are executed in parallel on the Hadoop cluster.
Oracle R Advanced Analytics for Hadoop (ORAAH): Provides a general computation framework, in
which you can use the R language to write your custom logic as mappers or reducers.
OSCH, OLH, ODI, and OXH are covered in two other lessons. ORAAH is covered in this lesson.
Using Oracle R Advanced Analytics for Hadoop - 9
• Is a collection of R packages that enable Big Data analytics from an R environment.
• It enables a Data Scientist /Analyst to:
- Work on data from multiple platforms from the R environment
- Benefit from the R ecosystem while leveraging the advantages of massively
distributed Hadoop computational infrastructure
• Leverages the cluster compute infrastructure for parallel distribution computation while
shielding the R user from Hadoop’s complexity through a small number of easy to use
APIs.
Note: For a complete description of the ORAAH features, refer to the latest ORAAH 2.5.0
Release Notes at:
http://download.oracle.com/otn/other/bigdata/ORAAH-2.5.0-ReleaseNotes.pdf
Using Oracle R Advanced Analytics for Hadoop - 10
Before we get started, to use Oracle R Advanced Analytics for Hadoop, you should be familiar
with the following:
• MapReduce and Spark programming concepts
• R programming
• Statistical methods
Using Oracle R Advanced Analytics for Hadoop - 11
Oracle Advanced Analytics (OAA) is an Option to Oracle Database Enterprise Edition, which
extends the database into a comprehensive advanced analytics platform for big data.
• The Oracle Advanced Analytics Option is a comprehensive advanced analytics platform
comprising Oracle Data Mining and Oracle R Enterprise.
• Oracle Data Mining delivers in-Database predictive analytics, data mining, and text
mining.
• Oracle R Enterprise enables R programmers to leverage the power and scalability of
Oracle Database, while also delivering R’s world-class statistical analytics, advanced
numerical computations, and interactive graphics.
With these two technologies, OAA brings powerful computations to the database, resulting in
dramatic improvements in information discovery, scalability, security, and savings.
Data analysts, data scientists, statistical programmers, application developers, and DBAs can
develop and automate sophisticated analytical methodologies inside the database and gain
competitive advantage by leveraging the OAA Option.
Using Oracle R Advanced Analytics for Hadoop - 12
ORAAH is a set of packages that provides an interface between the local R environment,
Oracle Database, and Cloudera Distribution for Hadoop. From the ORAAH interface, you can
use simple R to do the following:
• Execute Machine Learning algorithms directly on the Hadoop cluster using either
MapReduce or Spark
• Sample data in HDFS
• Copy data between Oracle Database and HDFS
• Schedule R programs to execute as MapReduce and Spark jobs
• Return the results to Oracle Database, HDFS, or your client
Notes:
• ORAAH is one of the Oracle Big Data Connectors. It is a separate product from ORE.
• ORAAH is available only as part of the Oracle Big Data Connectors.
Using Oracle R Advanced Analytics for Hadoop - 13
ORAAH includes a collection of R packages that provides:
• Interfaces to work with the:
- Apache Hive tables
- Apache Hadoop compute infrastructure
- Local R environment
- Oracle Database tables
• Predictive analytic techniques
- ORAAH implements the techniques in either Java or R as distributed, parallel
MapReduce jobs, thereby leveraging all nodes of your Hadoop cluster
Using Oracle R Advanced Analytics for Hadoop - 14
Using simple R functions, you can perform the following tasks:
• Access and transform HDFS data using a Hive-enabled transparency layer
• Use the R language for writing mappers and reducers
• Copy data between R memory, the local file system, HDFS, Hive, and Oracle Database
instances
• Manipulate Hive data transparently from R
• Execute R programs as Hadoop MapReduce jobs and return the results to any of those
locations
• Submit MapReduce jobs from R for both non-cluster (local) execution and Hadoop
cluster execution
We will provide some examples of the various ORAAH functions in this lesson.
Using Oracle R Advanced Analytics for Hadoop - 15
This architecture diagram depicts ORAAH as an interface between the local R environment,
Oracle Database, and Cloudera CDH.
This architecture enables:
• Expanded user population that can build models on Hadoop
• Accelerated rate at which business problems are tackled
• Analytics that scale with data volumes, variables, and techniques
• Transparent access to the Hadoop Cluster
• The ability to:
- Manipulate data in HDFS, database, and the file system—all from R
- Write and execute MapReduce jobs with R
- Leverage CRAN R packages to work on HDFS—resident data
- Run specialized algorithms from R (Logistic Regression and Neural Networks)
using a Spark in-memory Context from data in HDFS, for up to 200x in added
performance compared to the same algorithms in MapReduce.
- Move from lab to production without requiring knowledge of Hadoop internals,
Hadoop CLI, or IT infrastructure
Using Oracle R Advanced Analytics for Hadoop - 16
ORAAH provides access from a local R client to Apache Hadoop using functions with these
prefixes:
• hadoop: Identifies functions that provide an interface to Hadoop MapReduce
• hdfs: Identifies functions that provide an interface to HDFS
• orch: Identifies a variety of functions; orch is a general prefix for ORCH functions
• ore: Identifies functions that provide an interface to a Hive data store
• spark: Identifies a variety of functions that provide an interface between the local R
instance and a Spark cluster
On the next few pages, we will look at some of the functions from the hadoop, hdfs, orch,
and ore categories.
Using Oracle R Advanced Analytics for Hadoop - 17
This table describes functions that you use when creating and running MapReduce programs.
Using Oracle R Advanced Analytics for Hadoop - 18
ORAAH contains a variety of functions that enable an R user to interact with an HDFS
system. You can explore files in HDFS, including the following activities: change directories,
list files in a directory, show the current directory, create new directories, move files, copy
files, determine the size of files, and sample files.
This table describes some of the functions that execute HDFS commands from within the R
environment.
Using Oracle R Advanced Analytics for Hadoop - 19
In addition, ORAAH contains a variety of functions that enable an R user to interact with an
HDFS system. You can interact with HDFS content in the ORAAH environment, including the
following: discovery (or creation) of metadata and easy access to in-memory R objects,
database objects, and local files.
This table describes the functions for copying data between platforms, including R data
frames, HDFS files, local files, and tables in an Oracle database.
Using Oracle R Advanced Analytics for Hadoop - 20
ORAAH includes a few commands to connect to a Spark Distributed in-memory cluster
(usually through YARN), and create a Spark context that can be used for machine learning algorithms, in particular Multi-Layer Neural Networks (orch.neural) and Logistic
Regression (orch.glm2).
For an example of usage on a cluster with 1Bi records and the detailed steps, refer to the
following link: https://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for
Using Oracle R Advanced Analytics for Hadoop - 21
This table describes the functions for establishing a connection to Oracle Database.
Using Oracle R Advanced Analytics for Hadoop - 22
ORAAH also includes functions that expose statistical algorithms through the orch API.
These predictive algorithms are available for execution on a Hadoop cluster.
This table describes the native analytic functions.
Using Oracle R Advanced Analytics for Hadoop - 23
You can connect to Hive and analyze and transform Hive table objects using R functions that have an ore prefix, such as ore.connect. If you are also using Oracle R Enterprise, then
you will recognize these functions. The ore functions in Oracle R Enterprise create and
manage objects in an Oracle database, and the ore functions in Oracle R Advanced Analytics
for Hadoop create and manage objects in a Hive database.
Using Oracle R Advanced Analytics for Hadoop - 24
Without Oracle R Advanced Analytics for Hadoop, you will need Java skills to write
MapReduce programs in order to access the Hadoop data.
For example, you need the mapper and reducer programs shown on this page to perform a
simple word-count task on text data that is stored in Hadoop.
Using Oracle R Advanced Analytics for Hadoop - 25
With ORAAH, the same word-count task is performed with a simple R script. ORAAH
functions greatly simplify access to Hadoop data by leveraging the familiar R interface.
In this example, the R script uses several common ORAAH functions to perform the task, including hdfs.put(), hadoop.exec(), and orch.keyval().
The R script:
• Loads the R data into HDFS and creates a function named wordcount by using the
input data
• Specifies and invokes the MapReduce job with the hadoop.exec() function
• Splits words and outputs each word in the mapper step
• Sums the count of each word in the reducer step
• Specifies the job configuration and returns the MapReduce output as the result
With ORAAH, mapper and reducer functions can be written in R, greatly simplifying access to
Hadoop data.
Using Oracle R Advanced Analytics for Hadoop - 26
Using ORAAH with HIVE tables is transparent. You write R as if you are writing the function to
work on an R dataframe, and ORAAH translates the commands into the proper HQL (HIVE
query language) commands. The commands are executed in the entire cluster, returning to
the end user only the results, so it can operate on a very large scale data.
This example shows how to use the newly defined variable in any type of computation and
how to check the content of this new variable. The example also shows how to review the first
six records of the new variable.
Using Oracle R Advanced Analytics for Hadoop - 27
In this example, ORAAH is used to launch open-source R algorithms and functions including
plots in MapReduce. It returns an ORAAH packed object that can be unpacked to show the
graphical contents to the user after the execution.
Using Oracle R Advanced Analytics for Hadoop - 28
This example illustrates how to use ORAAH with a MapReduce Linear Regression algorithm.
It is as simple as attaching the file on HDFS (first line of code) and then invoking the orch.lm() model with the formula (exactly like in R) and data settings. The Mappers and
Reducers settings are optional.
The result is an R Model object that can be queried with summary() for example, to give the
results exactly as expected if we were using open-source R lm().
Using Oracle R Advanced Analytics for Hadoop - 29
The new Spark-based algorithms can be executed in Spark by simply adding one line with the settings to spark.connect(), and then invoking the correct algorithm that supports Spark.
Currently, the available algorithms are either Logistic Regression (orch.glm2) or Neural
Networks (orch.neural). They will require a formula and an attached HDFS file as inputs,
and will run from 100x to 200x faster than their MapReduce counterparts, depending on data
volumes.
You can find examples of performance on those the two algorithms at:
https://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for
Using Oracle R Advanced Analytics for Hadoop - 30
In this lesson, you should have learned how to:
• Describe Oracle Advanced Analytics, Oracle Data Mining, and Oracle R Enterprise at a
high level
• Describe Oracle R Advanced Analytics for Hadoop (ORAAH) and identify the benefits of
using simple R functions
Using Oracle R Advanced Analytics for Hadoop - 31
In this course, we discussed the following lessons:
• Introduction to the Hadoop Ecosystem
• Introduction to the Oracle BDA
• Oracle BDA Pre-Installation Steps
• Working With Mammoth
• Securing the Oracle BDA
• Working With the Oracle Big Data Connectors:
- Oracle SQL Connector for Hadoop Distributed File System (HDFS)
- Oracle Loader for Hadoop (OLH)
- Oracle Data Integrator (ODI)
- Oracle XQuery for Hadoop (OXH)
- Oracle R Advanced Analytics for Hadoop (ORAAH)
Using Oracle R Advanced Analytics for Hadoop - 32
The Oracle Learning Library offers other self-paced courses about the Oracle Big Data
Appliance and other related topics. Visit the Oracle Learning Library to learn about the
courses.
Using Oracle R Advanced Analytics for Hadoop - 33
Oracle University offers In-Class courses about Oracle Big Data and other related topics.
Visit Oracle University to learn about the following courses:
• Oracle Big Data Fundamentals
• XML Fundamentals
• Oracle Database 12c: Use XML DB
• Oracle NoSQL Database for Developers
• Oracle NoSQL Database for Administrators
• Oracle R Enterprise Essentials
Using Oracle R Advanced Analytics for Hadoop - 34
The Oracle Learning Library offers many free demonstrations and tutorials.
And, of course, the Oracle Big Data Appliance documentation and online help embedded
within the product are also valuable resources.
Using Oracle R Advanced Analytics for Hadoop - 35
Using Oracle R Advanced Analytics for Hadoop - 36
Using Oracle R Advanced Analytics for Hadoop - 37
top related