big data

1

JuanMa Rebes – IBM Systems and SolutionsArrow [email protected]

• 2.5 quintillion bytes of data created every day• 90% of data was created in last 2 years• 35 zettabytes in 2020 !

BI / Reporting Exploration / Visualization

FunctionalApp

IndustryApp

Predictive Analytics

Content Analytics

Analytic Applications

IBM Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Law Enforcement - Identify criminals and threats from video, audio feeds

MFG - Analyze & correlate log records to improve service and predict failures Video

Telco - Address customer satisfaction, Predict churn, and match promotions in real time

Healthcare - Detect life-threatening conditions at hospitals in time to intervene

Retail - Multi-channel customer sentiment and experience analysis

Financial Services - Make risk decisions based on real-time transactional data

Handling the large Volume, Variety, Velocity, and Veracity of data to find new insights and improve business outcome.

http://www.youtube.com/watch?v=_E6MTZclvnc

How to know if a big data solution is right for your organization

– Before making the decision to invest in a big data solution, evaluate the data available for analysis; the insight that might be gained from analyzing it; and the resources available to define, design, create, and deploy a big data platform. Asking the right questions is a good place to start.

3

Can my current data environment be expanded?

– Are the current datasets very large — on the order of terabytes or petabytes?

– Does the existing warehouse environment contain a repository of all data generated or acquired?

– Is there a significant amount of cold or low-touch data that is not being analyzed to derive business insight?

– Do you have to throw data away because you are unable to store or process it?

– Do you want to be able to perform data exploration on complex and large amounts of data?

– Do you want to be able to do analysis of non-operational data?

– Are you interested in using your data for traditional and new types of analytics?

– Are you trying to delay an upgrade to your existing data warehouse?

– Are you looking for ways to lower your overall cost of doing analytics?

4

Ask the following questions to determine if you can augment the existing data warehouse platform

Is the data complexity increasing?

The variety of data might demand a big data solution if:

– The data content and structure cannot be anticipated or predicted.

– The data format varies, including structured, semi-structured, and unstructured data.

– The data can be generated by users and machines in any format, for example: Microsoft® Word files, Microsoft Excel® spreadsheets, Microsoft PowerPoint presentations, PDF files, social media, web and software logs, email, photos and video footage from cameras, information-sensing mobile devices, aerial sensory technologies, genomics, and medical records.

– New types of data have emerged from sources that weren't previously mined for insight.

– Domain entities take on different meanings in different contexts.

5

Has the variety of data increased?


You may want to consider a big data solution if:

– The data is sized in petabytes and exabytes, and in the near future, might grow to zetabytes.

– The data volume is posing technical and economic challenges to store, search, share, analyze, and visualize using traditional methods, such as relational database engines.

– The data processing can currently make use of massive parallel processing power on available hardware.

6

Has the volume of data increased?


Consider whether your data:

– Is changing rapidly and must be responded to immediately

– Has overwhelmed traditional technologies and methods, which are no longer adequate to handle data coming in real time

7

Has the velocity of the data increased or changed?


Consider a big data solution if:

– The authenticity or accuracy of the data is unknown.

– The data includes ambiguous information.

– It's unclear whether the data is complete.

8

Is your data trustworthy?

A big data solution might be appropriate if there is reasonable complexity in the volume, variety, velocity, or veracity of the data.

For more complex data, assess any risks associated with implementing a big data solution. For less complex data, traditional

solutions should be assessed.

What is the impact on existing IT governance?

– Security and privacy— In keeping with local regulations, what data can the solution access? What data can be stored? What data should be encrypted during motion? At rest? Who is allowed to see the raw data and the insights?

– Standardization of data— Are there standards governing the data? Is the data in a proprietary format? Is some of the data in a non-standard format?

– Timeframe in which the data is available— Is the data available in a timeframe that allows action to be taken in a timely fashion?

– Ownership of data— Who owns the data? Does the solution have appropriate access and permission to use the data?

– Allowable uses: How is the data allowed to be used?

9

Consider the following governance-related issues in the context of your situation

Are the right skills on board and the right people aligned?

Before undertaking a new big data project, make sure the right people are on board:

– Do you have buy-in from stakeholders and other business sponsors who are willing to invest in the project?

– Are data scientists available who understand the domain, who can look at the massive quantity of data and who can identify ways to generate meaningful and useful insights from the data?

10

Specific skills are required to understand and analyze the requirements and maintain the big data solution. These skills include industry

knowledge, domain expertise, and technical knowledge on big data tools and technologies. Data scientists with expertise in modeling, statistics, analytics, and math are key to the success of any big data initiative.

Using big data type to classify big data characteristics

– Analysis type. Whether the data is analyzed in real time or batched for later analysis. Give careful consideration to choosing the analysis type, since it affects several other decisions about products, tools, hardware, data sources, and expected data frequency. A mix of both types may be required by the use case:

– Fraud detection; analysis must be done in real time or near real time

– Trend analysis for strategic business decisions; analysis can be in batch mode

– Processing methodology. The type of technique to be applied for processing data (e.g., predictive, analytical, ad-hoc query, and reporting). Business requirements determine the appropriate processing methodology. A combination of techniques can be used. The choice of processing methodology helps identify the appropriate tools and techniques to be used in your big data solution

– Data frequency and size. How much data is expected and at what frequency does it arrive. Knowing frequency and size helps determine the storage mechanism, storage format, and the necessary preprocessing tools. Data frequency and size depend on data sources:

– On demand, as with social media data

– Continuous feed, real-time (weather data, transactional data)

– Time series (time-based data)

11

Using big data type to classify big data characteristics

– Data type. Type of data to be processed — transactional, historical, master data, and others. Knowing the data type helps segregate the data in storage.

– Content format. Format of incoming data — structured (RDMBS, for example), unstructured (audio, video, and images, for example), or semi-structured. Format determines how the incoming data needs to be processed and is key to choosing tools and techniques and defining a solution from a business perspective.

– Data source. Sources of data (where the data is generated) — web and social media, machine-generated, human-generated, etc. Identifying all the data sources helps determine the scope from a business perspective. The figure shows the most widely used data sources.

– Data consumers — A list of all of the possible consumers of the processed data:– Business processes

– Business users

– Enterprise applications

– Individual people in various business roles

– Part of the process flows

– Other data repositories or enterprise applications

12

13

Big data classification

Logical layers of a big data solution

– Big data sources

– Data massaging and store layer

– This layer is responsible for acquiring data from the data sources and, if necessary, converting it to a format that suits how the data is to be analyzed. For example, an image might need to be converted so it can be stored in an Hadoop Distributed File System (HDFS) store or a Relational Database Management System (RDBMS) warehouse for further processing. Compliance regulations and governance policies dictate the appropriate storage for different types of data.

– Analysis layer

– Decisions must be made with regard to how to manage the tasks to: – Produce the desired analytics

– Derive insight from the data

– Find the entities required

– Locate the data sources that can provide data for these entities

– Understand what algorithms and tools are required to perform the analytics.

– Consumption layer. – The consumers can be visualization applications, human beings, business

processes, or services.

14

Logical layers of a big data solution

15

Big Data Sources - Enterprise Systems

– Enterprise Legacy Systems

– Customer relationship management systems

– Billing operations

– Mainframe applications

– Enterprise resource planning

– Web applications and other data sources augment the enterprise-owned data. Such applications can expose the data using custom protocols and mechanisms

– Data Management Systems (DMS)

– DMS store legal data, processes, policies, and various other kinds of documents: Microsoft® Excel® spreadsheets, Microsoft Word documents

– These documents can be converted into structured data that can be used for analytics. The document data can be exposed as domain entities or the data massaging and storage layer can transform it into the domain entities.

16

Big Data Sources

– Data stores— Includes enterprise data warehouses, operational databases, and transactional databases. Typically structured and consumed directly or transformed easily to suit requirements. May or may not be stored in the distributed file system, depending on the context of the situation

– Smart devices— Smartphones, meters, healthcare devices... For the most part, smart devices do real-time analytics, but the information stemming from smart devices can be analyzed in batch, as well.

– Aggregated data providers— Huge volumes of data pour in, in a variety of formats, produced at different velocities, and made available by various data providers, and sensors: Geographical information, Human-generated content, Sensor data (see NOTE)

17

Data massaging and store layer

Because incoming data characteristics can vary, components in the data massaging and store layer must be capable of reading data at various frequencies, formats, sizes, and on various communication channels:

– Data acquisition— Acquires data from various data sources and sends the data to the data digest component or stores it in specified locations. This component must be intelligent enough to choose whether and where to store the incoming data. It must be able to determine whether the data should be massaged before it can be stored or if the data can be directly sent to the business analysis layer.

– Data digest— Responsible for massaging the data in the format required to achieve the purpose of the analysis. This component can have simple transformation logic or complex statistical algorithms to convert source data. The analysis engine determines the specific data formats that are required. The major challenge is accommodating unstructured data formats, such as images, audio, video, and other binary formats.

– Distributed data storage— Responsible for storing the data from data sources. Often, multiple data storage options are available in this layer, such as distributed file storage (DFS), cloud, structured data sources, NoSQL, etc.

18

Analysis layer

This is the layer where business insight is extracted from the data:

– Analysis-layer entity identification— Responsible for identifying and populating the contextual entities. This is a complex task that requires efficient high-performance processes. The data digest component should complement this entity identification component by massaging the data into the required format. Analysis engines will need the contextual entities to perform the analysis.

– Analysis engine— Uses other components (specifically, entity identification, model management, and analytic algorithms) to process and perform the analysis. The analysis engine can have various workflows, algorithms, and tools that support parallel processing.

– Model management— Responsible for maintaining various statistical models and for verifying and validating these models by continuously training the models to be more accurate. The model management component then promotes these models, which can be used by the entity identification or analysis engine components.

19

Consumption layer

The outcome of the analysis is consumed by customers, vendors, partners, suppliers..

Business processes can also be triggered. For example, the process to create a new order if the customer has accepted an offer can be triggered automatically, or the process to block the use of a credit card if a customer has reported fraud

Other option is a recommendation engine (RE) that can match customers with the products they like. RE analyzes available information and provides personalized and real-time recommendations.

For the internal consumers, the ability to build reports and dashboards for business users enables the stakeholders to make informed decisions and to design appropriate strategies. To improve operational effectiveness, real-time business alerts can be generated from the data and operational key performance indicators can be monitored:

20

Vertical layers

The aspects that affect all of the components of the logical layers (big data sources, data massaging and storage, analysis, and consumption) are covered by the vertical layers:

– Information integration

– Big data governance

– Systems management

– Quality of service

21

Vertical layers – Information Integration

Big data applications acquire data from various data origins, providers, and data sources and are stored in data storage systems such as HDFS, NoSQL, and MongoDB. This vertical layer is used by various components (data acquisition, data digest, model management, and transaction interceptor, for example) and is responsible for connecting to various data sources. Integrating information across data sources with varying characteristics (protocols and connectivity, for example) requires quality connectors and adapters. Accelerators are available to connect to most of the known and widely used sources. These include social media adapters and weather data adapters. This layer can also be used by components to store information in big data stores and to retrieve information from big data stores for processing. Most of the big data stores have services and APIs available to store and retrieve the information.

22

Vertical layers – Big Data Governance

Strong guidelines and processes are required to monitor, structure, store, and secure the data entering the enterprise, getting processed, stored, analyzed, and purged or archived. In addition to normal data governance considerations, governance for big data includes additional factors:

– Managing high volumes of data in variety of formats.

– Continuously training and managing the statistical models required to pre-process unstructured data and analytics. Keep in mind that this is an important step when dealing with unstructured data.

– Setting policy and compliance regulations for external data regarding its retention and usage.

– Defining the data archiving and purging policies.

– Creating the policy for how data can be replicated across various systems.

– Setting data encryption policies.

23

Vertical layers – Quality of Service (Data)

– Completeness in identifying all of the data elements required

– Timeliness for providing data at an acceptable level of freshness

– Accuracy in verifying that the data respects data accuracy rules

– Adherence to a common language (data elements fulfill the requirements expressed in plain business language)

– Consistency in verifying that the data from multiple systems respects the data consistency rules

– Technical conformance in meeting the data specification and information architecture guidelines

– Data frequency. How frequently is fresh data available? Is it on-demand, continuous, or offline?

– Size of fetch. This attribute helps define the size of data that can be fetched and consumed per fetch

– Filters. Standard filters remove unwanted data and noise in the data and leave only the data required for analysis.

24

Vertical layers – Quality of Service (privacy and security)

– Data can originate from different regions and countries and must be treated accordingly. Decisions must be made about data masking and the storage of such data. Consider the following data access policies:

– Data availability / Criticality / Authenticity

– Data sharing and publishing

– Data storage and retention, including questions such as can the external data be stored? If so, for how long? What kind of data can be stored?

– Constraints of data providers (political, technical, regional)

– Social media terms of use

25

Vertical layers – Systems Management

– Critical for big data because it involves many systems across clusters and boundaries of the enterprise:

– Managing the logs of systems, virtual machines, applications, and other devices

– Correlating the various logs and helping investigate and monitor the situation

– Monitoring real-time alerts and notifications

– Using a real-time dashboard showing various parameters

– Referring to reports and detailed analysis about the system

– Setting and abiding by service-level agreements

– Managing storage and capacity

– Archiving and managing archive retrieval

– Performing system recovery, cluster management, and network management

– Policy management

26

What insights are possible with big data technologies?

27


28


29


– Compliance and regulatory reporting

– Risk analysis and management

– CRM and customer loyalty programs

– Credit risk, scoring, and analysis

– High-speed arbitrage trading

– Trade surveillance

– Abnormal trading pattern analysis

30

Financial services


– Large-scale click-stream analytics

– Ad targeting, analysis, forecasting, and optimization

– Abuse and click-fraud prevention

– Social graph analysis and profile segmentation

– Campaign management and loyalty programs

31

Web and digital media


– Fraud detection

– Threat detection

– Cyber-security

– Compliance and regulatory analysis

– Energy consumption and carbon footprint management

32

Public sector


– Health insurance fraud detection

– Campaign and sales program optimization

– Brand management

– Patient care quality and program analysis

– Medical device and pharmaceutical supply-chain management

– Drug discovery and development analysis

33

Health and life sciences


– Mashups: Mobile user location and precision targeting

– Machine-generated data

– Online dating: A leading online dating service uses sophisticated analysis to measure the compatibility between individual members, so it can suggest good matches

– Online gaming

– Predictive maintenance of aircraft and automobiles

34

Miscellaneous

35

Introducing…A NEW GENERATION OF IBM Power Systems

Designed for Big Data

Superior Cloud Economics

Open Innovation Platform

Open Innovation to put data to work

35

Processors flexible, fast execution of analytics

algorithms

Memory large, fast workspace to maximize

business insight

Data Bandwidth bring massive amounts of information to

compute resources in real-time

4X threads per core vs. x86

4Xmemory bandwidth vs. x86

2.4X more I/O bandwidth than POWER7

Designed for Big Data - optimized performance

Delivering insights 82x faster

Optimized for a broad range of data and analytics:

Industry Solutions

5XFaster

36

82X is based on IBM internal tests as of April 17, 2014 comparing IBM DB2 with BLU Acceleration on Power with a comparably tuned competitor row store database server on x86 executing a materially identical 2.6TB BI workload in a controlled laboratory environment. Test measured 60 concurrent user report throughput executing identical Cognos report workloads. Competitor configuration: HP DL380p, 24 cores, 256GB RAM, Competitor row-store database, SuSE Linux 11SP3 (Database) and HP DL380p, 16 cores, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). IBM configuration: IBM S824, 24 cores, 256GB RAM, DB2 10.5, AIX 7.1 TL2 (Database) and IBM S822L, 16 of 20 cores activated, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment.

POWER8 processor & architecture

IBM Solution for BLU Acceleration

IBM Solution for Analytics

IBM Solution for Hadoop

First generation of systems built on POWER8 innovative design, optimized for big data & analytics

Next generation in-memory database technology for analytics at the speed of thought

Enable rapid deployment of business and predictive analytics

Innovation that optimizes unstructured big data performance

Announcing new open innovation to put data to work

37

82X faster insights2

24:1consolidation3

IBM FlashSystem

Next Generation In-Memory

75% less storage1

Power Systems can deliver insight to the point of impactwith big data & analytics accelerators

POWER8 with CAPI Flash Accelerators

1- Source: COCC Cast Study http://bit.ly/1iQemuu2- 82X is based on IBM internal tests as of April 17, 2014 comparing IBM DB2 with BLU Acceleration on Power with a comparably tuned competitor row store database server on x86 executing a materially identical 2.6TB BI workload in a controlled laboratory environment. Test measured 60 concurrent user report throughput executing identical Cognos report workloads. Competitor configuration: HP DL380p, 24 cores, 256GB RAM, Competitor row-store database, SuSE Linux 11SP3 (Database) and HP DL380p, 16 cores, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). IBM configuration: IBM S824, 24 cores, 256GB RAM, DB2 10.5, AIX 7.1 TL2 (Database) and IBM S822L, 16 of 20 cores activated, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment.3- 24:1 system consolidation ratio (12:1 rack density improvement) based on a single IBM S824, (24 cores, POWER8 3.5 GHz), 256GB RAM, AIX 7.1 with 40 TB memory based Flash replacing 24 HP DL380p, 24 cores, E5-2697 v2 2.7 GHz), 256GB RAM, SuSE Linux 11SP3

39

IBM can help you build your solution on the platform that was designed for big data & analytics

All Data

Key Business

Processes

Unstructured Data

Structured Data

Industry Solutions

IBM WatsonCognitive

Business & Predictive Analytics

IBM Solution for Hadoop - Power Systems Edition

Speed MattersHigher ingest rates delivers 37% faster insights than competitive Hadoop solutions with 31% fewer data nodes.1

Availability MattersBetter reliability and resiliency with 73% fewer outages and 92% fewer performance problems over x86.2

40

+

7R2 servers DCS3700

Powered by

NEW. A storage-dense integrated platform optimized to simplify and accelerate unstructured big data & analytics

Integrated platform solution for Hadoop ready for

analytics software

1) Based on STG Performance testing comparing to Cloudera/HP published benchmark2)CLAIMS: Solitaire Interglobal Paper - Power Boost Your Big Data Analytics Strategy – http://www.ibm.com/systems/power/solutions/assets/bigdata-analytics.html?LNK=wf

http://www.ibm.com/systems/power/solutions/assets/bigdata-analytics.html?LNK=wf

X86 Classic

PowerLinux PowerLinux Advantages

Arch. Type Monolithic (compute +

storage combined)

Modular (compute +

storage separate)

1. Tailor compute/storage mix to achieve ideal match for workload characteristics

2. Monitor storage performance

Compute/storage ratio

Fixed Variable 1. Achieve top performance by tailoring compute/storage mix for actual workload characteristics

Storage Internal External (DCS3700P)

1. Central monitoring of I/O performance2. Upgrade storage and compute separately

Storage connectivity

SAS SAS/FC

Hadoop Copies

3 2 1. Use less raw storage2. Fewer data nodes needed for equivalent

capacity/performance

RAID RAID0 RAID5 1. Better failure resiliency2. Better cluster performance during disk failure3. Use fewer Hadoop copies

Network 10 Gb Eth/TOR switches

10 Gb Eth/TOR switches

PowerLinux vs. x86 Big Data Architecture – Overview of Differences

37% Faster Business Insights with BigInsights on Linux on Power

– 10 TB TeraSort Benchmark

–Achieved 4216 sec result for 10 TB TeraSort beating HP/Cloudera absolute result by 22% with a smaller cluster

–Achieved 1.37x normalized performance advantage over HP/Cloudera

Get Better Storage Efficiency with Big Data Running on Linux on Power

x86 PowerLinux

User Space (user data after compression)

1 PB 1 PB

Hadoop copies 3 2

Temp space 25% 25%

Raw storage required 3.25 PB 2.25 PB

Data nodes required (4 TB disks, 12 data disks/node)

68 40

Example sizing comparison:

For equivalent data volume and compute performance, big data running on Linux on Power solution requires less raw storage and fewer data nodes, leading to significant up front and TCO savings.

41%

fewer

data

nodes