CH
Cla caClprosylicitsdethe
SO
Meinvincgudaallsumiclianthehadava
Using The cllanguawouldDataStIBM p Creatitimes,need t
HALLENGE
ient was procday. This vo
apability of iient was neaoprietary sys
ystems, client ensing fees.
s processing ependent upoeir high licens
OLUTION
etaScale evestigate ocluding Hadouided client iata hub moowed client
ubstantially lesinimizes, or eent was able
nd re-use as ne client not oardware costsata in its othealue across th
g the Stren
lient firm deage for the
d run much tage group latform.
ng a data h, is becominto copy or m
cessing millioolume was exts ETL basaring capacitstems. To ex would need Client soughcapability w
n proprietary sing costs.
encouraged open sourcoop and PiGin the develo
odel on Hadto process mss time. Becaeven elimina
e to source thneeded. With nly reduced l
s, it was able er businessee enterprise.
ngths of Da
ecided to extransformafaster on H
looked at p
hub model ong a new stamove data is
CASE STUD
ns of recordsxceeding theed systems.ty on currentxpand theseto pay more
ht to expandwithout being
systems and
client toce options
G. MetaScaleopment of adoop, whichmore data inause Hadooptes ETL, thehe data oncethis process,
licensing andto re-use the
es, spreading
HaCa Knowopenperfoshowusing Runnstrucpowto paddeDatathe clicenThe well impland proc The expaan increhard
ataStage a
xplore a newation procesHadoop thaerformance
on Hadoop,andard. Perfs now a best
DY
s e
t
e e d g d
o s e a h n p e e
e g
doop Epacity a
wing how ton source ormance anwed with a cg IBM DataS
ning on prctures, Daterful ETL plarocess millioed more dataStage perfoclient to buy
nses from thETL work‐tybeyond w
ementation load worke
essing DataS
client was and processiexpensive
eased with dware as volu
nd Hadoo
w option usissing, becaun on DataSand cost to
where dataforming trant practice for
Extendsat Lowe
o mix hardwsoftware cd reduced client who w
Stage for ETL
roprietary IaStage has
atform. Metaons of recorta and transormance fely bigger serhem to hanype and vol
what the cliwas design
ed fine, buStage was be
also conceing on the ssystem. Thregular up
umes grew.
p
ng Hadoop use MetaScatage. At the
o expand ca
a is sourcednsformationsr big data, b
s DataSter Cost
ware with prcan lead osts, as a Mas running o
L.
BM softwas been maaScale’s clienrds daily. Wsformations l apart. Ven
rvers and mndle its growumes underient’s DataS
ned to do. Tut in the teyond its cap
rned aboutsystem sincehe initial
pgrades to
and the opale showed e same timepacity on th
d once but s or analytic
but that does
tage ET
roprietary ato improvetaScale tea
out of capac
are and daarketed as nt was usinghen the cliehowever, t
ndors advisore DataSta
wing volumer review weStage existiThe extractioransformatiopability.
the costs e DataStagelicense cossoftware a
en source Pthem that
e, the clienhe proprieta
re‐used macs without tsn’t mean it
1
TL
nd ed
am ity
ata a
g it ent he ed ge es. ent ng on on
to is sts nd
PIG it
t’s ary
ny he t is
2
always right. MetaScale knew that this was one of those good use‐cases. But the client teams weren’t sure,so they kept their options open by studying the DataStage solution while experimenting with the Hadoop approach in parallel. Software engineers at MetaScale reverse‐engineered the ETL jobs from DataStage into PIG and sourced the data onto Hadoop, followed by extensive testing to ensure that record counts and transformations came out the same. The data was then dropped back into DataStage, like a puzzle piece, for the data to be consumed as usual. On three data sets of a billion records each, Hadoop did the transformations in approximately an hour, sometimes less, compared to 10 hours in DataStage. While moving the data back into DataStage takes time, the total elapsed time required was still substantially less than leaving the entire workload on the IBM platform. Hadoop and PIG were also shown to be able to scale to 2 billion records with only a marginal increase in processing time and a modest cost for additional hardware. Poised for Growth After Just Five Weeks Because the data is now on Hadoop and staged for re‐use, the client and MetaScale have the opportunity to use it again in other areas of the business. The whole project and process of incorporating Hadoop took five weeks. Now the business has plenty of capacity with many of the transformations now running on Hadoop. “This was a real business problem that presented itself,” said a director of IT at the client. “With MetaScale staff, we had just started looking into PIG on Hadoop and we had a theory that Hadoop could be an ETL killer: This was an opportunity to prove it. We had already had good success doing something similar ‐‐ dropping processing off one platform and into Hadoop and then putting it back onto the original platform and still getting it done significantly faster than the old process. This could be for anybody who has intensive processing spread across multiple environments, anybody who has big data and has to keep on adding more infrastructure capacity.” MetaScale thinks the Hadoop approach to ETL transformation could be used in any vertical and use‐case where conventional ETL is used and particularly when heavy processing is required. MetaScale is also working with the client to create a data hub, a project which will enable extensive data re‐use and offload the processing of it to Hadoop. “Hadoop is good for transformation of bigger and bigger data, and the price is right,” the client added. “If you do more processing in DataStage, you add more licenses for more capacity. Hadoop is open source, so adding capacity is massively cheaper and performance
3
is many times faster. If we build new applications and need batch transformation, we don’t need to go to third‐party packages. We are disrupting the traditional approach to software solutions. With a private cloud infrastructure, it becomes easier not to put money upfront on big hardware or licenses.” The client stayed with DataStage temporarily for some of the ETL process because the system was already in place, but over time all of the workload is moving to Hadoop. If a client were starting from scratch, MetaScale would advise bypassing the ETL process and running the analytics on Hadoop.
At MetaScale, we leverage our Fortune 100 heritage to help enterprises create value from a suite of innovative products and services. Part of the Sears Holdings family of companies (Sears Holdings Corporation-NASDAQ SHLD), MetaScale offers a compelling mix of scale, speed, skills, and end-to-end consulting services.
Visit us:
www.metascale.com
© 2013 Sears Brands LLC. All Rights Reserved