david lyle big data aac v16

39
Big Data (Reference) Architectures David Lyle, VP Product Strategy, Informatica Products

Upload: ajay-jha

Post on 16-Dec-2015

11 views

Category:

Documents


0 download

DESCRIPTION

Big Data for the IW2015 presentations

TRANSCRIPT

Presentation Title

Big Data (Reference) ArchitecturesDavid Lyle, VP Product Strategy, Informatica ProductsBig Data and Informatica: Level-setHow does Informatica play with Hadoop, appliances, cloud and nosql?

2The Value of a Virtual Data Machine (like Vibe): Integration Flexibility: Same skills, multiple deployment modes.3

HADOOPCloudServerDesktopDataFederationEmbedded data quality in appsDataIntegration HubDataVirtualizationEmbedded DQ in apps

DevelopmentDeploymentSkills leverageFuture proof investmentDevelopment Acceleration

Vibe is the industrys first and only embeddable virtual data machine to access, aggregate and manage data regardless of data type, source, volume, compute platform or user. It lets you map once, and deploy anywhere. So you can take your logic that may have defined on-premise, then move it to the cloud. And then move it to Hadoop, or embed it in an application without recoding.

This makes your architecture faster, more flexible, and futureproof.

Business BenefitFive time faster turn-around from business idea to solutionAdapt the technology to your business, not vice-versaUtilize all your data, regardless of location, type or volume

IT BenefitFive times faster project deliveryEliminate skills gaps for adopting new technologies and approachesReduce cost of maintaining complex assortment of technologies

3

No-code visual development environmentPreview results at any point in the data flowPowerCenter developers are now Hadoop developers

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM(SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTxFROM lineitemGROUP BY L_ORDERKEY) T1JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)WHERE nation.N_NAME = 'UNITED STATES') T2 INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;Data Integration & Quality on HadoopHive-QLEntire Informatica mapping translated to Hive Query Language

Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).

Advanced mapping transformations executed on Hadoop through User Defined Functions using VibeMapReduce

UDF

Informatica enables you to define the data processing flow (e.g. ETL, DQ, etc) with transformations and rules using a visual design UI. We also call these mappingsWhen these data flows or mappings are deployed and run, Informatica optimizes the end-to-end flow from source to target to generate Hive-QL scriptsTransformations that dont map to HQL will be run as User Defined Functions (UDF) via the VibeTM virtual data machine that resides on each of the Hadoop nodesBecause we have separated the design from the deployment you can take existing PowerCenter mappings and run them on HadoopIn fact the source and target data dont have to reside in Hadoop. Informatica will stream the data from the source into Hadoop for processing and then deliver it to the target whether on Hadoop or another systemPC BDE is more complete and better performing than Talend (6x) , Syncsort (2x) & PIG hand coding (3.5x)Product & sales 2 years ahead of IBM5Why Informatica for Big Data & HadoopInformatica on HadoopWhy Customers CareVisual development environmentIncrease productivity up to 5x over hand-coding100K+ trained Informatica developers globallyUse existing & readily available skills for big data200+ high-performance connectors (legacy & new)Move all types of customer data into Hadoop faster100+ pre-built transforms for ETL & data qualityProvide broadest out-of-box transformations on Hadoop100+ pre-built parsers for complex data formatsAnalyze and integrate all types of data fasterVibe Map Once, Deploy Anywhere virtual data machineAn insurance policy as new data types and technologies changeReference architectures to get startedAccelerate customer success with proven solutionModern Data ArchitectureRight Technology for Right Use CaseTransactions,OLTP, OLAPSocial Media, Web LogsMachine Device, Scientific

Documents and Emails

One skillManage the Hairball

Traditional Grid

DataWarehouseKVNoSQLColumn-Family NoSQL

EDW &DW ApplsDocumentNoSQLHadoopCloud data sources7Modern Data ArchitectureRight Technology for Right Use CaseTransactions,OLTP, OLAPSocial Media, Web LogsMachine Device, Scientific

Documents and Emails

One skillManage the Hairball

Traditional Grid

DataWarehouseKVNoSQLColumn-Family NoSQL

EDW &DW ApplsDocumentNoSQLHadoopCloud data sources

8Trendy Quote #1: ETL is dead SOA created a new integration hairball faster than the original hairball we are trying to replace.9Big Data Isnt the challenge/opportunity New Data?

10Slam Data together w/o Data Integration

11How do we get the best of all worlds?

12How do we get the best of all worlds?

13Trendy Quote #2: Just use Schema on ReadLoad the data as-is and apply your own lens to the data to read it back outSomething to consider, but hmmmmOr, when is it appropriate to use a NoSQL datastore:An e-commerce style model is required? (Think Amazon.com)A staging area model? What are the implications of no schema? 14NoSQL Modeling Implications

15NoSQL Modeling Implications

16NoSQL Modeling Implications

17NoSQL Modeling Implications

18NoSQL Modeling Implications

19NoSQL Modeling ImplicationsBut even in TV Shows model:Click on an actor and see their entire television careerClick on a timeframe and see all the other shows released the same weekEtc.In other words, RELATIONS! IE., the only thing mongoDB is good at is storing arbitrary pieces of JSON. Arbitrary means you dont care AT ALL whats inside the JSON. You dont even look. There is no schema, not even an implicit schema as there was in TV show example.20Do you have a decision tree for data stores?A common approach to making consistent decisions about technology choices? 21Modern Data ArchitectureData Warehouse/ApplianceStructured Data, Causal, Clean, Slow-ish model changesAd-hoc and interactive queriesKV, Document, Column-family, Graph NoSQLFrequent model changesDocument accessNo joinsDoesnt have SQL, no joins, no constraints, no transactions.Large quantities, no relationships between elements, frequent structure changes, validation not crucial, etc.Columnar Databases (Sybase IQ, Vertica)Structured, frequent queries mostly the same, etc. 22Modern Data ArchitectureKeys for values, documents, etc., held in MDMMDM is the index for everything, tying the environment together.Then create a virtualization layer on top of this MDM/index to values to data world, giving the single view of everything.Everything is federated.

How about what folks are calling the Managed Data Lake? 23One Environment to Manage Big Data/NoSQLDevelopment, Management, Security & Provisioning

ArchiveProfileParseCleanseETLMatchStreamLoadLoadServicesEventsReplicateTopicsMachine Device, CloudDocuments and EmailsRelational, MainframeSocial Media, Web LogsHadoopAnalytics Teams

Document NoSQLColumn-Family NoSQLKV NoSQLData WarehouseMaster Data (Key)Managaement SystemUse the Two slide version of this

Lower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support

24Topic 1: Data Warehouse OptimizationWhere is my bottleneck?How much is this costing me?Is this a good candidate for Hadoop?What are other good candidates for Hadoop data preparation?

25EXISTING: Ab Initio, Teradata, ExadataData Preparation done on expensive platform ArchiveStreamLoadLoadServicesEventsReplicateTopicsMachine Device, CloudDocuments and EmailsRelational, MainframeSocial Media, Web LogsExpensive Data WarehouseProcessing/Staging AreaData Warehouse

Use the Two slide version of this

Lower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support

26Duplicate transformation processing on HadoopThen compare data between DW and Test DWArchiveStreamLoadLoadServicesEventsReplicateTopicsMachine Device, CloudDocuments and EmailsRelational, MainframeSocial Media, Web Logs

Data WarehouseProcessing AreaData WarehouseTest Data WarehouseUse the Two slide version of this

Lower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support

27When tested, cut over to Hadoop preparationArchiveStreamLoadLoadServicesEventsReplicateTopicsMachine Device, CloudDocuments and EmailsRelational, MainframeSocial Media, Web Logs

Data WarehouseUse the Two slide version of this

Lower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support

28DataWarehouseMDMApplications

Data Ingestion and ExtractionMoving terabytes of data per hourReplicateStreamingBatch Load

ExtractArchiveExtractLow-Cost StoreTransactions,OLTP, OLAP

Social Media,Web Logs

Documents, Email

IndustryStandardsMachine Device,Scientific

Informatica provides several ways to get data into and out of Hadoop depending on the data types, data volumes, and data latenciesYou can use PowerCenter with PowerExchange to batch load or trickle-feed data into HadoopWith Informatica Data Replication and CDC you can replicate hundreds of gigabytes to terabytes of data per hour from thousands of tables or entire databases into lower cost appliances and Hadoop.For big data generated in real-time such as market trade data, log files, and machine device data you can use Informatica Ultra Messaging to stream millions of transactions per second into Hadoop or appliances like EMC GreenplumYou can also use PowerCenter with PowerExchange to extract data from Hadoop and move it into other systems such as a datawarehouse in batch or near real-time

29Transactions,OLTP, OLAPSocial Media, Web LogsMachine Device, Scientific

Documents and Emails

Analytics & Op DashboardsMobile AppsReal-Time AlertsData SourcesApplicationsData WarehouseMDM / PIMData IngestionVisualizationData GovernanceData SecurityArchivingReplicationData StreamingChange DataCaptureBatch LoadData VirtualizationEvent-BasedProcessingData Integration HubData Integration & Data Quality

Agile AnalyticsAdvanced AnalyticsMachine LearningVirtual Data MachineData ManagementData DeliveryData Ingestion30Topic: Hybrid DW, on-premise and cloudWhat should I do? Where?What are others doing?How do I manage the whole thing?

31ETL Logic/Code321. On-premise Data Warehouse: ConceptualSiloed Data Sources

ComplexHeterogeneousGrowing & Changing

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW AppliancePre-cloud Data Warehouse architectures pull data from sources into a staging area for processing prior to loading. This processing should involve the most efficient possible extraction logic (like CDC), cleansing algorithms, conformance rules that integrate disparate processes together into a canonical corporate representation, relating master data, events, transactions, etc., to their proper multi-system counterparts, and finally loading changes efficiently into the EDW.

However, all to many data warehouses use hand-code for this heavy-lifting, which has always represented 70-80% of the effort. This creates several serious problems:Maintenance and changes are incredibly expensive and time-consumingAll too often highly value-add processing is NOT DONE in the first place (CDC, cleansing, proper conformance, etc) because it is too difficult in hand-code.32332. On-premise Data Warehouse: with DI PlatformSiloed Data Sources

ComplexHeterogeneousGrowing & Changing

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW ApplianceOptimizer, ExecutorTransformations, Business RulesAccess Control, Encryption, MaskingMaster & Reference Data ManagementMetadata Management

Investing in a Data Integration platform helps to solve those two problems, making changes to existing logic and creation of new logic much more agile, reducing errors, and allowing for the addition of high value-add CDC, cleansing, etc., logic to the flow, making the data in the data warehouse far more valuable, useful, and trusted for the business.33342. On-premise Data Warehouse: with DI Platform

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Legacy, RDBMS

Optimizer, ExecutorTransformations, Business RulesAccess Control, Encryption, MaskingMaster & Reference Data ManagementMetadata ManagementThe value of a platform assists in managing the complexity, heterogeneity, growth and rate of change of business systems allowing the DW to change faster, better, and cheaper.34353. New Cloud Apps and Big Data Sets

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Files

Legacy, RDBMS

Logs, JSONs, SocialSaaS AppsOptimizer, ExecutorTransformations, Business RulesAccess Control, Encryption, MaskingMaster & Reference Data ManagementMetadata ManagementNew data from cloud applications, interaction data from social media or machine logs, or other large data sets we used to ignore, are now seen as highly valuable information to help us understand new business opportunities, process bottlenecks, or cost efficiencies.35363. are costly for scalability and agility

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Files

Legacy, RDBMS

Logs, JSONs, SocialSaaS Apps

Optimizer, ExecutorTransformations, Business RulesAccess Control, Encryption, MaskingMaster & Reference Data ManagementMetadata ManagementHowever, adding these to existing data warehouse systems or appliances on-premise can be extremely expensive. The hardware and storage for these systems would grow very quickly. (Note the size change in the DW applicance and storage, as well as the pipes required. ) As well, it can be difficult to plan for seasonal capacity changes required for these systems. Is there a better way?36374a. Consider Hybrid DW Augmentation Strategy

Processing AreaEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Files

Legacy, RDBMS

Logs, JSONs, SocialSaaS Apps

CloudData Warehouse

Processing AreaAn approach better suited to this style of new data is to use the elasticity and pay-as-you-go advantages of the cloud. However, to help this work with similar agility and quality advantages found in on-premise data warehouse architectures would suggest a similar data integration platform technology be used along with the processing and storage technology for the data warehouse in the cloud.

Metadata, master data and reference data should be shareable between the two environments.37384b. Flexibly Add Other Analytic Technologies

Processing AreaEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Files

Legacy, RDBMS

Logs, JSONs, SocialSaaS Apps

Cloud DW

Hadoop

NoSQLProcessing AreaTechnologies in the cloud offer other opportunities for new storage and processing approaches for different types of applications and analysis. A data warehouse offers strong relational analysis benefits, but sometimes other analysis is beneficial. For instance, array processing and other statistical approaches may be more suited to Hadoop. NoSQL systems may also be appropriate sources or targets to store data for scale-out and easy access of enormous amounts of information whose model changes frequently.

These also have significant agility and cost advantages by deploying them in the cloud, but to avoid creating silos and greater fragmentation, using a holistic approach to managing the data, quality, and change management of the overall environment is crucial. 38394c. Over time, perhaps migrate DW to the cloud

Processing AreaExtractCleanseConformRelateLoadEnterpriseData Warehouse

DW ApplianceERP, CRM Apps

Files

Legacy, RDBMS

Logs, JSONs, SocialSaaS Apps

Cloud DW

Hadoop

NoSQLProcessing AreaWhere appropriate, perhaps moving to a completely cloud-based architecture may provide the most agility and cost-effectiveness.39

See informations potential. And put it to work.Join a unique community focused on the power and potential of information to transform your career, your organization and your world. 40

www.informatica.com/potential-at-work/pawpresIts a unique community focused on unleashing the power and potential of information to transform your career, your organization, and your world. Get inspired by top technology leaders who share insights and ideas about how to be more successful by using information in innovative ways. You should join! Visit www.informatica.com/potential-at-work/pawpres

40