nextgen infrastructure for big - jboss · big data characteristics manage the complexity of data in...

20

Upload: others

Post on 11-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured
Page 2: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

NextGen Infrastructure for Big DATA Analytics.

Page 3: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

So What is Big Data?

Data  that  exceeds  the  processing  capacity  of  conven4onal  database  systems.  The  data  is  too  big,  moves  too  fast,  or  doesn’t  fit  the  structures  of  your  database  architectures.  To  gain  value  from  this  data,  you  must  choose  an  alterna4ve  way  to  process  it

       Ed  Dumbill,  program  chair  for  the  O’Reilly  Strata  Conference  

Page 4: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured (logs, business transactions). Semi-structured and unstructured Streaming data and large volume data movement. Moves at very high rates Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs). Valuable for mining patterns, trends and relationships

Variety: Velocity: Volume:

Source : IBM Research

Extracting Business Insights from large volume, variety and velocity of data, beyond what was previously possible!!

Page 5: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Big Data Product Metrics Choices

Page 6: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Big Data Enriches the Information Management Ecosystem

Who Ran What, Where, and When?

Audit MapReduce Jobs and tasks

Managing a Governance Initiative

OLTP Optimization

(SAP, checkout, +++)

Master Data Enrichment via Life Events, Hobbies, Roles, +++

Establishing Information

as a Service

Active Archive Cost Optimization

Page 7: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

The Infrastructure •  The  internet  has  spawned  an  EXPLOSION  in  data  growth  in  the  form  of  data  sets,  called  Big  Data,  

which  are  so  large  they  are  difficult  to  store  ,  manage  and  analyze  using  tradi4onal  DB  and  Storage  architecture.    

•  Not  only  this  new  data  heavily  unstructured,  voluminous  and  streams  rapidly  and  difficult  to  harness,  But  also  if  we  look  into  scale  of  “Volume”  Only  from  The  3  “V”  of  the  Big  Data  around  Volume,  Variety  and  Velocity  then  someone  should  imagine  about  the  massive  infrastructure  requirement  around  this  calcula4on  -­‐:    –  .Jet  Engine  produces  ~10PB  of  data  every  30  minutes  of  flight  4me.  –  .Google  processes  ~20  PB  of  data  per  day  –  .If  one  Exabyte’s  worth  of  data  has  to  be  placed  on  to  DVD  and  stored  in  thin  jewellary  boxes  and  subsequently  loaded  in  to  Boeing  747  aircra[,  it  

would  take  13,513  planes  to  transport  this  one  Exabyte.  

•  To  capitalize  on  the  Big  Data  trend  ,  a  new  breed  of  Big  Data  technologies  such  as  Hadoop,  HIVE,PIG,AVRO  and  NoSQL  have  emerged  which  are  leveraging  new  parallelized  processing,  commodity  hardware  to  capture  and  analyze  these  new  data  sets  and  provide  a  price/performance  that  is  10  4mes  beber  than  exis4ng  Database/Data  warehousing/Business  Intelligence  Systems.  

•  Her  ,  we  will  understand  With  higher  end  systems,  there  is  a  lot  of  data  coming  from  all  of  the  business  processes,  from  managing  inventories  to  analyzing  the  data  for  trends  for  future  products.  So  in  these  systems,  there  are  a  lot  of  different  applica4ons  –  a  lot  of  different  usages  of  the  same  massive  amount  of  data  –  and  how  all  of  these  pieces  go  together  with  evolving  need  for  new  infrastructure  around  Storage  ,  Networking  ,  Virtualiza4on  and  Cloud.  

Page 8: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Key Technologies Infrastructure Required for Big Data

•  Cloud Infrastructure •  Virtualization •  Networking •  Storage

–  In Memory Database –  Tiered Storage Software –  De-Duplication –  Data Protection

Page 9: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Cloud Infrastructure

Page 10: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Virtualization Infrastructure: Workload Consolidation / TCO Savings

Page 11: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Big Data Infrastructure - Map Reduce

•  20-­‐40  Nodes/Racks  •  16  cores  •  48  G  RAM  •  6  –  12  *  2  TB  disk  •  1-­‐2  GigE  to  node  

•  Easy  to  use  ,  developer  writes  few  func4ons  

•  Moves  compute  to  data  •  Schedules  work  on  

HDFC  node  with  data  •  Scans  through  the  data  

Page 12: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Big Data Infrastructure - HDFS

•  Immutable  file  structure  –  Read,  write,  synch  –No  Random  writes  

•  Storage  server  used  for  computa>on  –  Move  computa>on  to  data  

•  Fault  tolerant  and  easy  management  –  Built  in  redundancy,  Tolerates  disk  &  Node  failure,  Auto  managing  addi>on/removal  of  disks,  One  operator/8k  nodes  

•  Not  a  SAN  but  high  bandwidth  network  access  to  data  via  Ethernet  

•  Used  typically  to  solve  problems  not  feasible  with  tradi>onal  systems:  with  Large  storage  capacity  >  100PB  raw  

•  Large  I/O  computa>onal  BW  >  4k  node/cluster  ,  scale  by  adding  commodity  HW,  MR  Cluster  

Page 13: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Oracle Big Data System

Page 14: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

EMC Big Picture

Page 15: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Storage Infrastructure

Storage Efficiency

•  Virtualization •  Mapping P>V,VM Management

•  Performance •  In-Memory DB, Auto -Tiering-SSD/HDD

•  Costs Reduction •  Thin Provisioning •  De -Duplication

•  Availability •  RAID/Auto-Discover HA , Snapshots ,CDP , Cloning , DRS

•  Security •  Encryption/DLP

Service Efficiency

•  Storage -as -a service •  Service Catalogs by Workload etc. •  Policy Infrastructure

•  Service Level Attributes •  Service Measurements

•  Performance Analytics •  IOPS/Response Time , Bandwidth

•  Automation •  Unified SAN/NAS Protocols •  Auto learning workload forensics •  Provisioning to match workloads •  Assured auto Recovery.

Page 16: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Problem: seeks are expensive

§  CPU & transfer speed, RAM & disk size – double every 18-24 months §  Seek time nearly constant (~5%/year) §  Time to read entire drive is growing – scalable computing must go at transfer

rate §  Example: Updating a terabyte DB, given: 10MB/s transfer, 10ms/seek, 100B/

entry (10Billion entries), 10kB/page (1Billion pages) §  Updating 1% of entries (100Million) takes:

– 1000 days with random B-Tree updates – 100 days with batched B-Tree updates – 1 day with sort & merge

§  To process 100TB datasets § on 1 node: – scanning @ 50MB/s = 23 days § on 1000 node cluster: – scanning @ 50MB/s = 33 min

§  – MTBF = 1 day §  Need framework for distribution – efficient, reliable, easy to use

Page 17: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

New Data and Management Economics

Page 18: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Storage Infrastructure – Big Data Targets

Value Potential of Using Big Data by Data Intensive Verticals

Page 19: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured

Key Take ways •  Big data creating paradigm shift in IT industry

o  Leverage the opportunity to optimize your computing infrastructure after making a due diligence in selection of vendors/products, industry testing and interoperability.

•  Optimize Big data analytics for query response time vs # of Users o  Improving query response time for a given number of users (IOPs) or

Serving for a given query response time.

•  Select Automated Storage Management Software o  Data Forensics and Tiered Placement o  Every workload has unique I/O access signature. o  Historical performance data for a LUN can identify performance skews.

•  Optimize infrastructure to meet the need of Applications/SLAs

Page 20: NextGen Infrastructure for Big - JBoss · Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured