hpcc presentation

19
RED/082311 RED/082311 Solving Business Problems with High Performance Computing

Upload: subrata-debnath

Post on 19-Jun-2015

346 views

Category:

Documents


5 download

DESCRIPTION

Dr. John Holt\'s Power Point Presentation from the 9/13/12 Joint CBIG & DAMA Event

TRANSCRIPT

  • 1. Solving Business Problems with High Performance ComputingRED/082311

2. HPCC Architecture http://hpccsystems.com HPCC is based on a distributed shared-nothing architecture, and uses commodityPC servers HPCC uses a consolidated central network switch for best for performance, but canuse a decentralized network of satellite switches. Computing efficiencies in HPCCsubstantially reduce the number of required nodes and, hence, network portscompared to other distributed platforms like Hadoop. HPCC Distributed File System is record oriented (supporting fixed length, variablelength field delimited and XML records) HPCC DFS is tightly coupled with the processing layer, exploiting data locality toa higher degree to minimize network transfers and maximize throughput HPCC applications are built using ECL, a declarative data flow oriented languagesupporting the automation of algorithm parallelization and work distributionWHT/082311 Risk Solutions 3. Three Main Components http://hpccsystems.com Massively Parallel Extract Transform and Load (ETL) engine HPCC Data Refinery (Thor) Built from the ground up as a parallel data environment. Leverages inexpensive locallyattached storage. Doesnt require a SAN infrastructure. Enables data integration on a scale not previously available: Current LexisNexis person data build process generates 350 Billion intermediate results atpeak Suitable for: Massive joins/merges Massive sorts & transformations Any N2 problem A massively parallel, high throughput, structured query response engineHPCC Data Delivery Engine (Roxie) Ultra fast due to its read-only nature. Allows indices to be built onto data for efficient multi-user retrieval of data Suitable forVolumes of structured queriesFull text ranked Boolean search An easy to use , data-centric programming language optimized for large-scale Enterprise Control Language (ECL) data management and query processing Highly efcient; Automatically distributes workload across all nodes. Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing Large library of efcient modules to handle common data manipulation tasksWHT/082311Risk Solutions HPCC Systems3 4. The HPCC Platform Delivers Benefitsin Speed, Capacity and Cost Savings http://hpccsystems.comSpeed Scales to extreme workloads quickly and easily Increase speed of development leads to faster production/delivery Improved developer productivityCapacity Enables massive joins, merges, sorts, transformations or tough O(n2) problems Increases business responsiveness Accelerates creation of new services via rapid prototyping capabilities Offers a platform for collaboration and innovation leading to better resultsCost Savings Commodity hardware and fewer people can do much more in less time Uses IT resources efficiently via sharing and higher system utilizationWHT/082311 Risk Solutions 4 5. Scalability: Little Degradation in Performancehttp://hpccsystems.comScalability Scales to support 1000+TB (up to petabytes) of data Purposely built system to do massive I/ORapidly performs complex queries on structured and unstructured data to link to a variety of data sources Suitable for massive joins/mergers, beyond limits of relational DBs Scale increases with the addition of low-cost, commodity serversHPCC in production Current production systems range from 20 to 2000 nodes Currently supports over 150,000 customers, millions of end users Currently handling over 20 million transactions per day for our online and batch products. and innovationleading to better resultsComplex query example demonstrates below Transaction latencies increase logarithmically while data sizes grow linearlyQuery Time13 sec10 sec 50100 150200 250 300WHT/082311 TB of Storage Risk Solutions 5 6. Enterprise Control Language (ECL)http://hpccsystems.comEnterprise Control Language (ECL) is a programming language developed over 10 years by a group of ex- Borland compiler writers.Designed specifically for algorithmic development on top of large data sets.Functional language optimized for large-scale data processing at scale.Abstracts the underlying platform configuration from thealgorithm development allowing efficient, expressivedata manipulationAutomatic parallelization and optimization of sequentialalgorithms for parallel and distributed processing (i.e.,the programmer does not need to understand how tomanage the parallel processing environment)Large library of efficient modules to handle common data manipulation tasksInbuilt data primitives (sort, distribute, join) highly optimized for distributed data processing tasks WHT/082311 Risk Solutions 6 7. Many platforms, fragmented, heterogeneous :Complex & Expensive http://hpccsystems.com WHT/082311Risk Solutions 8. One Platform End-to-End: Simple http://hpccsystems.comWHT/082311 Consistent and elegant HW&SW architecture across the complete platformRisk Solutions 9. Performance comparison: PigMix style http://hpccsystems.com PigMix Benchmark Using Subset (Based on 1M Row Base) (times in seconds less is better) Hadoop 10-node (4GB Memory, 2-250GB per node), HPCC Thor 10-way, community_3.0.4-10 ECL vs. Pig (times ECL vs. Java (times PigMix Script Pig Runtime (secs)Java Runtime (secs)ECL Thor Runtime (secs)faster)faster) L2 73 68 0.96 76.36 71.13 L3156229 1.99 78.51115.25 L4 9510414.02 6.78 7.42 L5 85224 7.0212.1131.90 L6 96100 1.80 53.30 55.52 L7 90110 1.40 64.29 78.57 L8 66 79 0.89 74.07 88.66 L916212822.39 7.24 5.72 L10 23714618.0813.11 8.08 L11 414503 342.65 1.21 1.47 L12 108218 2.74 39.47 79.68 L1387 73 2.09 41.69 34.98 L14 1427 1.29109.745.41 L15986 108.58 0.90 0.06 L16977 1.40 69.194.99 L17 1536 5.41 28.291.11 Total 21592008 532.70 4.05 3.77WHT/082311 Risk Solutions 10. Platform is now Open Source http://hpccsystems.comHPCC Systems: http://hpccsystems.comComparison with Hadoop: http:// hpccsystems.com/Why-HPCC/HPCC-vs- HadoopDownloads: http://hpccsystems.com/downloadWHT/082311Risk Solutions 11. HPCC Systems Case StudiesRED/082311WHT/082311 12. Case Example 1: Choice Point Migrationhttp://hpccsystems.comScenarioReed Elsevier buys ChoicePoint for $4.1 billion and integrates it into LexisNexis Risk Solutions Spun off from Equifax, Inc. in 1997, ChoicePoint grew quickly through acquisition. ChoicePoint kept its business decentralized and individual units operated different via duplicative, technology platforms. Its largest division, Insurance, operated on an expensive and outdated IBM mainframe-based architecture, while its public records data center used an old, inflexible relational database built off an Oracle/ Webmethods, Sun Microsystems-based architecture. ChoicePoint struggled under high information technology (IT) management costs and its growth was constricted due to its inability to adapt its technology and develop new solutions to meet evolving business needs.TaskMigrate ChoicePoint, including the Insurance business to LexisNexis HPCC platform without disrupting businessResult LexisNexis anticipates 70% reduction in new product time-to-market as a result of moving ChoicePoint to HPCC platform. 2 years into the integration, LexisNexis is on track to generate substantial savings within the first 5 years. LexisNexis has seen an immediate 8-10% improvement in hit rates as a function of better matching and linking technology, which is a derivative of the HPCC technology The response time of interactive transactions for certain products has improved by a factor of 200% on average, simply as a function of moving to the HPCC.WHT/082311 Risk Solutions12 13. Case Example 2: Insurance Scoring, From 100 Days to 30Minuteshttp://hpccsystems.comScenarioOne of the top 3 insurance providers using traditional analytics products on multiple statistical modelplatforms with disparate technologies Insurer issues request to re-run all past reports for a customer: 11.5M reports since 2004 Using their current technology infrastructure it would take 100 days to run these 11.5M reportsTaskMigration of 75 different models, 74,000 lines of code and approximately 700 definitions to the HPCC Models were migrated in 1 man month. Using a small development system (and only one developer), we ran 11.5 million reports in 66 minutes Performance on a production-size system: 30 minutes Testing demonstrated our ability to work in batch or onlineResult Reduced manual work, increases reliability, and created capability to do new scores Decreased development time from 1 year to several weeks; decreased run time from 100 days to 30minutes One HPCC (one infrastructure) translates into less people maintaining systems.WHT/082311 Risk Solutions13 14. Case Example 3: Detecting Insurance Collusion inLouisianahttp://hpccsystems.comScenarioThis view of carrier data showsseven known fraud claims andan additional linked claim.The Insurance company dataonly finds a connectionbetween two of the sevenclaims, and only identified oneother claim as being weaklyconnected.WHT/082311 Risk Solutions HPCC Systems 14 15. Case Example 3: Continued http://hpccsystems.comTaskAfter adding the Link ID to thecarrier Data, LexisNexisHPCC technology then addedFamily 12 additional degrees ofrelativeResultThe results showed twofamily groupsinterconnected on all ofthese seven claims.Family 2The links were much strongerthan the carrier datapreviously supported.WHT/082311 Risk Solutions HPCC Systems 15 16. Case Example 4: Network Traffic Analysis in Seconds http://hpccsystems.comScenarioConventional network sensor and monitoring solutionsare constrained by inability to quickly ingest massivedata volumes for analysis- 15 minutes of network traffic can generate 4Terabytes of data, which can take 6 hours to process- 90 days of network traffic can add up to 300+TerabytesTaskDrill into all the data to see if any US governmentsystems have communicated with any suspect systemsof foreign organizations in the last 6 months- In this scenario, we look specifically for trafficoccurring at unusual hours of the dayResultIn seconds, the HPCC sorted through months ofnetwork traffic to identify patterns and suspiciousbehaviorWHT/082311 Risk Solutions16 17. Case Example 5: Suspicious Equity Stripping Clusterhttp://hpccsystems.comScenarioIs mortgage fraud just an activity of isolatedindividuals, or an industry?TaskRank the nature, connectedness, and proximity ofsuspicious property transactions for every identityin the U.S.ResultFlorida Proof of ConceptHighest ranked influencersIdentified known ringleaders in flippingand equity stripping schemes.Typically not connected directly tosuspicious transactions.Clusters with high levels of potential collusion.Clusters offloading property, generating defaultsWHT/082311Risk Solutions 17 18. Case Example 6: Mining Wikipedia Interest http://hpccsystems.comSenario21 Billion Rows of Wikipedia Hourly PageviewStatistics for a year.TaskGenerate meaningful statistics to understandaggregated global interest in all Wikipedia pages acrossthe 24 hours of the day built off all english wikipediapageview logs for 12 months.ResultProduces page statistics that can be queried in secondsto visualize which key times of day each Wikipediapage is more actively viewed.Helps gain insight into both regional and time of daykey interest periods in certain topics.This result can be leveraged with Machine Learning tocluster pages with similar 24hr Fingerprints.WHT/082311 Risk Solutions HPCC Systems 18 19. Conclusion http://hpccsystems.com The HPCC platform is built from inexpensive commodity hardware. ECL, a declarative data flow language, provides automaticparallelization and work distribution. Data Linking, Clustering, and Mining applications leveraging billions ofrecords are very feasible. The HPCC Platform is now Open Source The HPCC Platform is Enterprise ready, with many years of use as theBig Data platform for LexisNexis Risk SolutionsQuestions?WHT/082311 Risk Solutions19