atlanta r users group - hpcccdn.hpccsystems.com/presentations/r_usergroup_06222012.pdf · atlanta r...
TRANSCRIPT
Risk Solutions
http://hpccsystems.com Welcome
1
Atlanta R Users Group HPCC Systems Architecture Overview & R Integration
Arjuna Chala, Architect Integrations, HPCC Systems / LexisNexis
Agenda
12:00-12:30pm: Welcome Lunch / Meet & Greet
12:30-1:30pm: HPCC Systems Architecture Overview & R Integration Demo
1:30-1:50pm: Q&A / Open Discussion
1:50-2:00pm: Raffle / Kindle Fire giveaway / Close
Twitter event hashtag:
#hpccmeetup
hpccsystems.com
Risk Solutions
http://hpccsystems.com
Contents -Introducing HPCC -How does LexisNexis use HPCC? -ECL -R and HPCC – A match made in Heaven?
2
Risk Solutions
http://hpccsystems.com What is HPCC?
3
Risk Solutions
http://hpccsystems.com Thor Architecture
4
Risk Solutions
http://hpccsystems.com
5
Thor Architecture (contd..)
Risk Solutions
http://hpccsystems.com Roxie Architecture
6
• Distributed Architecture • Highly Concurrent • Low Latency • Highly Scalable • Highly Redundant
Risk Solutions
http://hpccsystems.com HPCC Trivia
You have several million records that needs to be cleaned, linked and mined. Which HPCC component will you use?
7
Risk Solutions
http://hpccsystems.com
Massively Parallel Extract Transform and Load (ETL) engine
– Built from the ground up as a parallel data environment. Leverages inexpensive locally attached storage. Doesn’t require a SAN infrastructure.
• Enables data integration on a scale not previously available:
– Current LexisNexis person data build process generates 350 Billion intermediate results at peak
• Suitable for:
– Massive joins/merges
– Massive sorts & transformations
– Programmable using ECL
HPCC Data Refinery (Thor)
HPCC Data Delivery Engine (Roxie)
A massively parallel, high throughput, structured query response engine
Ultra fast low latency and highly available due to its read-only nature.
Allows indices to be built onto data for efficient multi-user retrieval of data
Suitable for
Volumes of structured queries
Full text ranked Boolean search
Programmable using ECL
Enterprise Control Language (ECL)
An easy to use, declarative data-centric programming language optimized for large-scale data management and query processing
Highly efficient; automatically distributes workload across all nodes.
Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing
Large library of efficient modules to handle common data manipulation tasks
1
2
3
To Summarize - Three main HPCC components
8
Risk Solutions
http://hpccsystems.com
How does LN use HPCC?
9
Risk Solutions
http://hpccsystems.com “Getting Caught in the Act” -
A LexisNexis Use Case
10
Risk Solutions
http://hpccsystems.com “Getting Caught in the Act” -
A LexisNexis Use Case
11
Risk Solutions
http://hpccsystems.com “Where is John Smith Now?”
- A LexisNexis Use Case
12
Risk Solutions
http://hpccsystems.com
Demo Time - SALT
13
Risk Solutions
http://hpccsystems.com
Insurance Collusion in Louisiana - A (yet another) LN Use Case
14
Risk Solutions
http://hpccsystems.com
Insurance Collusion in Louisiana - A (yet another) LN Use Case
15
BEFORE AFTER HPCC
Risk Solutions
http://hpccsystems.com
16
We do have some fun once in a while - A (fun) LN Use Case
Risk Solutions
http://hpccsystems.com HPCC Trivia
Name two attributes that make Roxie a great data delivery engine?
17
Risk Solutions
http://hpccsystems.com And Finally…..
Big Data
Open Source Components
4 Petabytes of Data
30,000 Data Sources
12 million background checks a year
Supporting 90 percent of the Fortune 500 companies
99% of all U.S. auto insurance claims
50 billion records
Several million records daily
250 million unique identities
Risk Solutions
http://hpccsystems.com
ECL
19
Risk Solutions
http://hpccsystems.com ECL is SQL on Steroids
20
ECL SQL
SELECT persons Select * from persons
FILTER persons(firstName=‘Jim’) Select * from persons where firstName=‘Jim’
SORT SORT(persons, firstName) Select * from persons order by firstName
COUNT COUNT( Person(firstName=‘TOM’))
Select COUNT(*) from Person where firstName=‘TOM’
GROUP DEDUP(persons, firstName, ALL) Select * from persons group by firstName
AGGREGATE SUM(persons, age) Select SUM(age) from persons
Cross Tab TABLE(persons, {state; stateCount:= COUNT(GROUP);}, state)
Select persons.state, COUNT(*) from persons group by state
JOIN JOIN(persons,state,LEFT.state=RIGHT.code)
Select * from persons,states where persons.state=states.code
Risk Solutions
http://hpccsystems.com ECL for ETL
21
Basic Data Structure PersonRec := RECORD STRING50 firstName; STRING50 lastName; UNSIGNED1 age; END;
Transformations PersonRec personTransform(PersonRec person) := TRANSFORM SELF.upperFirstName := UPPER(person.firstName); SELF := person; END; upperPersons := PROJECT(persons, personTransform(LEFT) ); OUTPUT(upperPersons);
Functions Used in context of Transformations
PROJECT, ROLLUP, JOIN, ITERATE, NORMALIZE, DENORMALIZE
All Functions http://hpccsystems.com/community/docs/ecl-language-reference/html/built-in-functions-and-actions
Risk Solutions
http://hpccsystems.com
• Declarative programming language: Describe what needs to be done and not how to do it
• Powerful: Unlike Java, high level primitives as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means fewer programmers & shortens time to delivery
• Extensible: As new attributes are defined, they become primitives that other programmers can use
• Implicitly parallel: Parallelism is built into the underlying platform. The programmer needs not be concerned with it
• Maintainable: A high level programming language, no side effects and attribute encapsulation provide for more succinct, reliable and easier to troubleshoot code
• Complete: Unlike Pig and Hive, ECL provides for a complete programming paradigm.
• Homogeneous: One language to express data algorithms across the entire HPCC platform, including data ETL and high speed data delivery.
Enterprise Control Language (ECL)
22
Risk Solutions
http://hpccsystems.com
Demo Time - ECL
23
Risk Solutions
http://hpccsystems.com HPCC Trivia
What does “ECL” stand for? Is ECL meant to be imperative?
24
Risk Solutions
http://hpccsystems.com
Finally…..R and HPCC
A match made in Heaven?
25
Risk Solutions
http://hpccsystems.com Seen this Before?
26
“Data don’t make any sense, we will have to resort to statistics”
Risk Solutions
http://hpccsystems.com And the next thing you know…
27
Risk Solutions
http://hpccsystems.com With HPCC and R you can….
R
Unstructured Data
RDBMS
DW
Structured Data
Visualization
HPCC
Data Sources Analyze, Mine, Model Big Data Processing Business Intelligence
JDBC ECL
Input Data Results Status
ECL SQL
Results Results
28
…provide an end to end modeling/analytical solution
Risk Solutions
http://hpccsystems.com Use the power of HPCC in R
29
Risk Solutions
http://hpccsystems.com How did we do it in R?
30
S4 Classes -> Generates ECL code -> Executes on HPCC -> Results back to R
Risk Solutions
http://hpccsystems.com
Q&A Thank You
Web: http://hpccsystems.com Email : [email protected]
Contact us: 877.316.9669
31