atlanta r users group - hpcccdn.hpccsystems.com/presentations/r_usergroup_06222012.pdf · atlanta r...

31
Risk Solutions http://hpccsystems.com Welcome 1 Atlanta R Users Group HPCC Systems Architecture Overview & R Integration Arjuna Chala, Architect Integrations, HPCC Systems / LexisNexis Agenda 12:00-12:30pm: Welcome Lunch / Meet & Greet 12:30-1:30pm: HPCC Systems Architecture Overview & R Integration Demo 1:30-1:50pm: Q&A / Open Discussion 1:50-2:00pm: Raffle / Kindle Fire giveaway / Close Twitter event hashtag: #hpccmeetup hpccsystems.com

Upload: others

Post on 25-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com Welcome

1

Atlanta R Users Group HPCC Systems Architecture Overview & R Integration

Arjuna Chala, Architect Integrations, HPCC Systems / LexisNexis

Agenda

12:00-12:30pm: Welcome Lunch / Meet & Greet

12:30-1:30pm: HPCC Systems Architecture Overview & R Integration Demo

1:30-1:50pm: Q&A / Open Discussion

1:50-2:00pm: Raffle / Kindle Fire giveaway / Close

Twitter event hashtag:

#hpccmeetup

hpccsystems.com

Page 2: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Contents -Introducing HPCC -How does LexisNexis use HPCC? -ECL -R and HPCC – A match made in Heaven?

2

Page 3: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com What is HPCC?

3

Page 4: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com Thor Architecture

4

Page 5: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

5

Thor Architecture (contd..)

Page 6: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com Roxie Architecture

6

• Distributed Architecture • Highly Concurrent • Low Latency • Highly Scalable • Highly Redundant

Page 7: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com HPCC Trivia

You have several million records that needs to be cleaned, linked and mined. Which HPCC component will you use?

7

Page 8: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Massively Parallel Extract Transform and Load (ETL) engine

– Built from the ground up as a parallel data environment. Leverages inexpensive locally attached storage. Doesn’t require a SAN infrastructure.

• Enables data integration on a scale not previously available:

– Current LexisNexis person data build process generates 350 Billion intermediate results at peak

• Suitable for:

– Massive joins/merges

– Massive sorts & transformations

– Programmable using ECL

HPCC Data Refinery (Thor)

HPCC Data Delivery Engine (Roxie)

A massively parallel, high throughput, structured query response engine

Ultra fast low latency and highly available due to its read-only nature.

Allows indices to be built onto data for efficient multi-user retrieval of data

Suitable for

Volumes of structured queries

Full text ranked Boolean search

Programmable using ECL

Enterprise Control Language (ECL)

An easy to use, declarative data-centric programming language optimized for large-scale data management and query processing

Highly efficient; automatically distributes workload across all nodes.

Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing

Large library of efficient modules to handle common data manipulation tasks

1

2

3

To Summarize - Three main HPCC components

8

Page 9: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

How does LN use HPCC?

9

Page 10: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com “Getting Caught in the Act” -

A LexisNexis Use Case

10

Page 11: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com “Getting Caught in the Act” -

A LexisNexis Use Case

11

Page 12: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com “Where is John Smith Now?”

- A LexisNexis Use Case

12

Page 13: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Demo Time - SALT

13

Page 14: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Insurance Collusion in Louisiana - A (yet another) LN Use Case

14

Page 15: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Insurance Collusion in Louisiana - A (yet another) LN Use Case

15

BEFORE AFTER HPCC

Page 16: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

16

We do have some fun once in a while - A (fun) LN Use Case

Page 17: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com HPCC Trivia

Name two attributes that make Roxie a great data delivery engine?

17

Page 18: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com And Finally…..

Big Data

Open Source Components

4 Petabytes of Data

30,000 Data Sources

12 million background checks a year

Supporting 90 percent of the Fortune 500 companies

99% of all U.S. auto insurance claims

50 billion records

Several million records daily

250 million unique identities

Page 19: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

ECL

19

Page 20: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com ECL is SQL on Steroids

20

ECL SQL

SELECT persons Select * from persons

FILTER persons(firstName=‘Jim’) Select * from persons where firstName=‘Jim’

SORT SORT(persons, firstName) Select * from persons order by firstName

COUNT COUNT( Person(firstName=‘TOM’))

Select COUNT(*) from Person where firstName=‘TOM’

GROUP DEDUP(persons, firstName, ALL) Select * from persons group by firstName

AGGREGATE SUM(persons, age) Select SUM(age) from persons

Cross Tab TABLE(persons, {state; stateCount:= COUNT(GROUP);}, state)

Select persons.state, COUNT(*) from persons group by state

JOIN JOIN(persons,state,LEFT.state=RIGHT.code)

Select * from persons,states where persons.state=states.code

Page 21: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com ECL for ETL

21

Basic Data Structure PersonRec := RECORD STRING50 firstName; STRING50 lastName; UNSIGNED1 age; END;

Transformations PersonRec personTransform(PersonRec person) := TRANSFORM SELF.upperFirstName := UPPER(person.firstName); SELF := person; END; upperPersons := PROJECT(persons, personTransform(LEFT) ); OUTPUT(upperPersons);

Functions Used in context of Transformations

PROJECT, ROLLUP, JOIN, ITERATE, NORMALIZE, DENORMALIZE

All Functions http://hpccsystems.com/community/docs/ecl-language-reference/html/built-in-functions-and-actions

Page 22: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

• Declarative programming language: Describe what needs to be done and not how to do it

• Powerful: Unlike Java, high level primitives as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means fewer programmers & shortens time to delivery

• Extensible: As new attributes are defined, they become primitives that other programmers can use

• Implicitly parallel: Parallelism is built into the underlying platform. The programmer needs not be concerned with it

• Maintainable: A high level programming language, no side effects and attribute encapsulation provide for more succinct, reliable and easier to troubleshoot code

• Complete: Unlike Pig and Hive, ECL provides for a complete programming paradigm.

• Homogeneous: One language to express data algorithms across the entire HPCC platform, including data ETL and high speed data delivery.

Enterprise Control Language (ECL)

22

Page 23: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Demo Time - ECL

23

Page 24: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com HPCC Trivia

What does “ECL” stand for? Is ECL meant to be imperative?

24

Page 25: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Finally…..R and HPCC

A match made in Heaven?

25

Page 26: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com Seen this Before?

26

“Data don’t make any sense, we will have to resort to statistics”

Page 27: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com And the next thing you know…

27

Page 28: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com With HPCC and R you can….

R

Unstructured Data

RDBMS

DW

Structured Data

Visualization

HPCC

Data Sources Analyze, Mine, Model Big Data Processing Business Intelligence

JDBC ECL

Input Data Results Status

ECL SQL

Results Results

28

…provide an end to end modeling/analytical solution

Page 29: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com Use the power of HPCC in R

29

Page 30: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com How did we do it in R?

30

S4 Classes -> Generates ECL code -> Executes on HPCC -> Results back to R

Page 31: Atlanta R Users Group - HPCCcdn.hpccsystems.com/presentations/R_UserGroup_06222012.pdf · Atlanta R Users Group . HPCC Systems Architecture Overview & R Integration . Arjuna Chala,

Risk Solutions

http://hpccsystems.com

Q&A Thank You

Web: http://hpccsystems.com Email : [email protected]

Contact us: 877.316.9669

31