metascale case study: hadoop extends datastage etl capacity

3
CH Cl a ca Cl prsy lic its de thSO Minv inc gu da all su mcli an thha da va Using The cl langua would DataSt IBM p Creati times, need t HALLENGE ient was proc day. This vo apability of i ient was nea oprietary sys ystems, client ensing fees. s processing ependent upo eir high licens OLUTION etaScale e vestigate o cluding Hado uided client i ata hub mo owed client ubstantially leinimizes, or ent was able nd re-use as n e client not o ardware costs ata in its othalue across th g the Stren lient firm de age for the d run much tage group latform. ng a data h , is becomin to copy or m cessing millio olume was ets ETL bas aring capacit stems. To ewould need Client sougcapability w n proprietary sing costs. encouraged open souroop and PiG in the develodel on Hato process m ss time. Beca even elimina e to source th needed. With nly reduced s, it was able er businesse e enterprise. ngths of Da ecided to ex transforma faster on H looked at p hub model o ng a new sta move data is CASE STUD ns of records xceeding the ed systems. ty on current xpand these to pay more ht to expand without being systems and client to ce options G. MetaScale opment of a doop, which more data in ause Hadoop tes ETL, the he data once this process, licensing and to re-use the es, spreading Ha Ca Know open perfo show using Runn struc pow to p adde Data the c licen The well impl and proc The expa an incre hard ataStage a xplore a new ation proces Hadoop tha erformance on Hadoop, andard. Perf s now a best DY s e t e e d g d o s e a h n p e e e g doop E pacity a wing how to n source ormance an wed with a c g IBM DataS ning on pr ctures, Dat erful ETL pla rocess millio ed more dat aStage perfo client to buy nses from th ETL workty beyond w ementation load worke essing DataS client was and processi expensive eased with dware as volu nd Hadoo w option usi ssing, becau n on DataS and cost to where data forming tran t practice for Extends at Lowe o mix hardw software c d reduced c lient who w Stage for ETL roprietary I aStage has atform. Meta ons of recor ta and trans ormance fel y bigger ser hem to han ype and vol what the cli was design ed fine, bu Stage was be also conce ing on the s system. Th regular up umes grew. p ng Hadoop use MetaSca tage. At the o expand ca a is sourced nsformations r big data, b s DataSt er Cost ware with pr can lead osts, as a M as running o L. BM softwa s been ma aScale’s clien rds daily. W sformations l apart. Ven rvers and m ndle its grow umes under ient’s DataS ned to do. T ut in the t eyond its cap rned about system since he initial pgrades to and the op ale showed e same time pacity on th d once but s or analytic but that does tage ET roprietary a to improv etaScale tea out of capac are and da arketed as nt was using hen the clie however, t ndors advis ore DataSta wing volume r review we Stage existi The extractio ransformatio pability. the costs e DataStage license cos software a en source P them that e, the clien he proprieta reused ma cs without t sn’t mean it 1 TL nd ed am ity ata a g it ent he ed ge es. ent ng on on to is sts nd PIG it t’s ary ny he t is

Upload: metascale

Post on 18-Dec-2014

409 views

Category:

Technology


2 download

DESCRIPTION

Knowing how to mix hardware with proprietary and open source software can lead to improved performance and reduced costs, as a MetaScale team showed with a client who was running out of capacity using IBM DataStage for ETL.

TRANSCRIPT

Page 1: MetaScale Case Study: Hadoop Extends DataStage ETL Capacity

CH

Cla caClprosylicitsdethe

SO

Meinvincgudaallsumiclianthehadava

Using The cllanguawouldDataStIBM p Creatitimes,need t

HALLENGE

ient was procday. This vo

apability of iient was neaoprietary sys

ystems, client ensing fees.

s processing ependent upoeir high licens

OLUTION

etaScale evestigate ocluding Hadouided client iata hub moowed client

ubstantially lesinimizes, or eent was able

nd re-use as ne client not oardware costsata in its othealue across th

g the Stren

lient firm deage for the

d run much tage group latform.

ng a data h, is becominto copy or m

cessing millioolume was exts ETL basaring capacitstems. To ex would need Client soughcapability w

n proprietary sing costs.

encouraged open sourcoop and PiGin the develo

odel on Hadto process mss time. Becaeven elimina

e to source thneeded. With nly reduced l

s, it was able er businessee enterprise.

ngths of Da

ecided to extransformafaster on H

looked at p

hub model ong a new stamove data is

CASE STUD

ns of recordsxceeding theed systems.ty on currentxpand theseto pay more

ht to expandwithout being

systems and

client toce options

G. MetaScaleopment of adoop, whichmore data inause Hadooptes ETL, thehe data oncethis process,

licensing andto re-use the

es, spreading

HaCa Knowopenperfoshowusing Runnstrucpowto paddeDatathe clicenThe well impland proc The expaan increhard

ataStage a

xplore a newation procesHadoop thaerformance

on Hadoop,andard. Perfs now a best

DY

s e

t

e e d g d

o s e a h n p e e

e g

doop Epacity a

wing how ton source ormance anwed with a cg IBM DataS

ning on prctures, Daterful ETL plarocess millioed more dataStage perfoclient to buy

nses from thETL work‐tybeyond w

ementation load worke

essing DataS

client was and processiexpensive

eased with dware as volu

nd Hadoo

w option usissing, becaun on DataSand cost to

where dataforming trant practice for

Extendsat Lowe

o mix hardwsoftware cd reduced client who w

Stage for ETL

roprietary IaStage has

atform. Metaons of recorta and transormance fely bigger serhem to hanype and vol

what the cliwas design

ed fine, buStage was be

also conceing on the ssystem. Thregular up

umes grew.

p

ng Hadoop use MetaScatage. At the

o expand ca

a is sourcednsformationsr big data, b

s DataSter Cost

ware with prcan lead osts, as a Mas running o

L.

BM softwas been maaScale’s clienrds daily. Wsformations l apart. Ven

rvers and mndle its growumes underient’s DataS

ned to do. Tut in the teyond its cap

rned aboutsystem sincehe initial

pgrades to

and the opale showed e same timepacity on th

d once but s or analytic

but that does

tage ET

roprietary ato improvetaScale tea

out of capac

are and daarketed as nt was usinghen the cliehowever, t

ndors advisore DataSta

wing volumer review weStage existiThe extractioransformatiopability.

the costs e DataStagelicense cossoftware a

en source Pthem that

e, the clienhe proprieta

re‐used macs without tsn’t mean it

1

TL

nd ed

am ity

ata a

g it ent he ed ge es. ent ng on on

to is sts nd

PIG it

t’s ary

ny he t is

Page 2: MetaScale Case Study: Hadoop Extends DataStage ETL Capacity

2

always right. MetaScale knew that this was one of those good use‐cases. But the client teams weren’t sure,so they kept their options open by studying the DataStage solution while experimenting with the Hadoop approach in parallel. Software engineers at MetaScale reverse‐engineered the ETL jobs from DataStage into PIG and sourced the data onto Hadoop, followed by extensive testing to ensure that record counts and transformations came out the same. The data was then dropped back into DataStage, like a puzzle piece, for the data to be consumed as usual. On three data sets of a billion records each, Hadoop did the transformations in approximately an hour, sometimes less, compared to 10 hours in DataStage. While moving the data back into DataStage takes time, the total elapsed time required was still substantially less than leaving the entire workload on the IBM platform. Hadoop and PIG were also shown to be able to scale to 2 billion records with only a marginal increase in processing time and a modest cost for additional hardware. Poised for Growth After Just Five Weeks Because the data is now on Hadoop and staged for re‐use, the client and MetaScale have the opportunity to use it again in other areas of the business. The whole project and process of incorporating Hadoop took five weeks. Now the business has plenty of capacity with many of the transformations now running on Hadoop. “This was a real business problem that presented itself,” said a director of IT at the client. “With MetaScale staff, we had just started looking into PIG on Hadoop and we had a theory that Hadoop could be an ETL killer: This was an opportunity to prove it. We had already had good success doing something similar ‐‐ dropping processing off one platform and into Hadoop and then putting it back onto the original platform and still getting it done significantly faster than the old process. This could be for anybody who has intensive processing spread across multiple environments, anybody who has big data and has to keep on adding more infrastructure capacity.” MetaScale thinks the Hadoop approach to ETL transformation could be used in any vertical and use‐case where conventional ETL is used and particularly when heavy processing is required. MetaScale is also working with the client to create a data hub, a project which will enable extensive data re‐use and offload the processing of it to Hadoop. “Hadoop is good for transformation of bigger and bigger data, and the price is right,” the client added. “If you do more processing in DataStage, you add more licenses for more capacity. Hadoop is open source, so adding capacity is massively cheaper and performance

Page 3: MetaScale Case Study: Hadoop Extends DataStage ETL Capacity

3

is many times faster. If we build new applications and need batch transformation, we don’t need to go to third‐party packages. We are disrupting the traditional approach to software solutions. With a private cloud infrastructure, it becomes easier not to put money upfront on big hardware or licenses.” The client stayed with DataStage temporarily for some of the ETL process because the system was already in place, but over time all of the workload is moving to Hadoop. If a client were starting from scratch, MetaScale would advise bypassing the ETL process and running the analytics on Hadoop.

At MetaScale, we leverage our Fortune 100 heritage to help enterprises create value from a suite of innovative products and services. Part of the Sears Holdings family of companies (Sears Holdings Corporation-NASDAQ SHLD), MetaScale offers a compelling mix of scale, speed, skills, and end-to-end consulting services.

Visit us:

www.metascale.com

© 2013 Sears Brands LLC. All Rights Reserved