nextgen infrastructure for big - jboss · big data characteristics manage the complexity of data in...
TRANSCRIPT
NextGen Infrastructure for Big DATA Analytics.
So What is Big Data?
Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alterna4ve way to process it
Ed Dumbill, program chair for the O’Reilly Strata Conference
Big Data Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured (logs, business transactions). Semi-structured and unstructured Streaming data and large volume data movement. Moves at very high rates Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs). Valuable for mining patterns, trends and relationships
Variety: Velocity: Volume:
Source : IBM Research
Extracting Business Insights from large volume, variety and velocity of data, beyond what was previously possible!!
Big Data Product Metrics Choices
Big Data Enriches the Information Management Ecosystem
Who Ran What, Where, and When?
Audit MapReduce Jobs and tasks
Managing a Governance Initiative
OLTP Optimization
(SAP, checkout, +++)
Master Data Enrichment via Life Events, Hobbies, Roles, +++
Establishing Information
as a Service
Active Archive Cost Optimization
The Infrastructure • The internet has spawned an EXPLOSION in data growth in the form of data sets, called Big Data,
which are so large they are difficult to store , manage and analyze using tradi4onal DB and Storage architecture.
• Not only this new data heavily unstructured, voluminous and streams rapidly and difficult to harness, But also if we look into scale of “Volume” Only from The 3 “V” of the Big Data around Volume, Variety and Velocity then someone should imagine about the massive infrastructure requirement around this calcula4on -‐: – .Jet Engine produces ~10PB of data every 30 minutes of flight 4me. – .Google processes ~20 PB of data per day – .If one Exabyte’s worth of data has to be placed on to DVD and stored in thin jewellary boxes and subsequently loaded in to Boeing 747 aircra[, it
would take 13,513 planes to transport this one Exabyte.
• To capitalize on the Big Data trend , a new breed of Big Data technologies such as Hadoop, HIVE,PIG,AVRO and NoSQL have emerged which are leveraging new parallelized processing, commodity hardware to capture and analyze these new data sets and provide a price/performance that is 10 4mes beber than exis4ng Database/Data warehousing/Business Intelligence Systems.
• Her , we will understand With higher end systems, there is a lot of data coming from all of the business processes, from managing inventories to analyzing the data for trends for future products. So in these systems, there are a lot of different applica4ons – a lot of different usages of the same massive amount of data – and how all of these pieces go together with evolving need for new infrastructure around Storage , Networking , Virtualiza4on and Cloud.
Key Technologies Infrastructure Required for Big Data
• Cloud Infrastructure • Virtualization • Networking • Storage
– In Memory Database – Tiered Storage Software – De-Duplication – Data Protection
Cloud Infrastructure
Virtualization Infrastructure: Workload Consolidation / TCO Savings
Big Data Infrastructure - Map Reduce
• 20-‐40 Nodes/Racks • 16 cores • 48 G RAM • 6 – 12 * 2 TB disk • 1-‐2 GigE to node
• Easy to use , developer writes few func4ons
• Moves compute to data • Schedules work on
HDFC node with data • Scans through the data
Big Data Infrastructure - HDFS
• Immutable file structure – Read, write, synch –No Random writes
• Storage server used for computa>on – Move computa>on to data
• Fault tolerant and easy management – Built in redundancy, Tolerates disk & Node failure, Auto managing addi>on/removal of disks, One operator/8k nodes
• Not a SAN but high bandwidth network access to data via Ethernet
• Used typically to solve problems not feasible with tradi>onal systems: with Large storage capacity > 100PB raw
• Large I/O computa>onal BW > 4k node/cluster , scale by adding commodity HW, MR Cluster
Oracle Big Data System
EMC Big Picture
Storage Infrastructure
Storage Efficiency
• Virtualization • Mapping P>V,VM Management
• Performance • In-Memory DB, Auto -Tiering-SSD/HDD
• Costs Reduction • Thin Provisioning • De -Duplication
• Availability • RAID/Auto-Discover HA , Snapshots ,CDP , Cloning , DRS
• Security • Encryption/DLP
Service Efficiency
• Storage -as -a service • Service Catalogs by Workload etc. • Policy Infrastructure
• Service Level Attributes • Service Measurements
• Performance Analytics • IOPS/Response Time , Bandwidth
• Automation • Unified SAN/NAS Protocols • Auto learning workload forensics • Provisioning to match workloads • Assured auto Recovery.
Problem: seeks are expensive
§ CPU & transfer speed, RAM & disk size – double every 18-24 months § Seek time nearly constant (~5%/year) § Time to read entire drive is growing – scalable computing must go at transfer
rate § Example: Updating a terabyte DB, given: 10MB/s transfer, 10ms/seek, 100B/
entry (10Billion entries), 10kB/page (1Billion pages) § Updating 1% of entries (100Million) takes:
– 1000 days with random B-Tree updates – 100 days with batched B-Tree updates – 1 day with sort & merge
§ To process 100TB datasets § on 1 node: – scanning @ 50MB/s = 23 days § on 1000 node cluster: – scanning @ 50MB/s = 33 min
§ – MTBF = 1 day § Need framework for distribution – efficient, reliable, easy to use
New Data and Management Economics
Storage Infrastructure – Big Data Targets
Value Potential of Using Big Data by Data Intensive Verticals
Key Take ways • Big data creating paradigm shift in IT industry
o Leverage the opportunity to optimize your computing infrastructure after making a due diligence in selection of vendors/products, industry testing and interoperability.
• Optimize Big data analytics for query response time vs # of Users o Improving query response time for a given number of users (IOPs) or
Serving for a given query response time.
• Select Automated Storage Management Software o Data Forensics and Tiered Placement o Every workload has unique I/O access signature. o Historical performance data for a LUN can identify performance skews.
• Optimize infrastructure to meet the need of Applications/SLAs