one billion objects in 2gb: big data analytics on small clusters with doradus olap randy guck...
Post on 31-Dec-2015
214 Views
Preview:
TRANSCRIPT
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP
Randy GuckPrincipal EngineerDell Software Group
What is Doradus? Storage and query service
Leverages Cassandra NoSQL DB
Pure Java
- Stateless
- Embeddable or standalone
Open source: Apache 2.0 License
30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope
Why Use Doradus? Easy to use: no client driver
Spider storage manager
- Good for unstructured data
OLAP storage manager
- Near real time data warehousing
Compared to Cassandra alone:
- Data model, searching, analytics
Compared to Hadoop:
- Fast data loads and queries
- Dense storage: less hardware
Cassandra
Data
Applications
REST API
OLAPSpide
r
Doradus
A Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Data
Doradus
Cassandra
Data
Node 1 Node 3
Applications
REST API
Secondary Doradusinstances are optional
Why Did We Build Doradus OLAP? Some tough customer requirements:
- Statistical queries most important
- Need to scan millions of objects/second
- User-customizable “insights” = millions of possible queries
Couldn’t use indexes, pre-computed queries, etc.
Disk physics
- ~100's of random reads/second
- ~1000's of serial reads/second
Needed a radically new approach!
Doradus OLAP Combines ideas from:
- Online Analytical Processing: data arranged in static cubes
- Columnar databases: Column-oriented storage and compression
- NoSQL databases: Sharding
Features:
- Fast loading: up to 500K objects/second/node
- Dense storage: 1 billion objects in 2 GB!
- Fast cube merging: typically seconds
- No indexes!
Example: Message Tracking Schema
Message Participant Address
PersonManager
Employees
PersonAddress
AttachmentsMessage
ParticipantsMessage
AddressParticipants
Attachment
DQL Object Queries Builds on Lucene syntax
- Full text queries
Adds link paths
- Directed graph searches
- Quantifiers and filters
- Transitive searches
Other features
- Stateless paging
- Sorting
Examples:
- LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992]
- ALL(Participants).ANY(Address.WHERE(Email='*.gmail.com')).Person.Department : support
- Employees^(4).Office='San Jose’
DQL Aggregate Queries Metric functions
- COUNT, AVERAGE, MIN, MAX, DISTINCT, ...
Multi-level grouping
Grouping functions
- BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...
Examples:
- metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate)
- metric=DISTINCT(Attachments.Extension);groups=Tags,
Participants.Address.Person.Department;query=Attachments.Size > 100000
- metric=AVERAGE(Size);groups=TOP(10,Participants.Address.Email)
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
Sources Batches
Batch 4
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards
Batch 4
Merge
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards OLAP Store
Batch 4
Merge
Storing Batches
Field ValuesID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...
Size 48120 5435 24220 ...
SendDate 1280246462000 1279354872112 1279357261413 ...
Priority 0 0 1 ...
Subject.txt ballades encash nautch colloquy geared
nettlier outdoors culvert hypothec winder
stolons ungot guiding rupiahs outgone
...
Subject 1 2 0 ...
...
Data is sorted by object ID and stored as columnar, compressed blobs
Key ColumnsEmail/Message/2014-03-01/{Batch GUID}/ID [compressed data]
Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]
Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]
... ......
OLAP Table
Field Value Arrays
Compressed rows
Merging Batches
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014/03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014/-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
Email/Message/2014-02-28/Size [compressed data]
...
Batch #1: Shard 2014-03-01Message Table
ID ...
Size ...
SendDate ...
...
Batch #2: Shard 2014-03-01Message Table
ID ...
Size ...
SendDate ...
...
...
OLAP Store
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message table dataShard 2014-03-01
Person table dataShard 2014-03-01
Address table dataShard 2014-03-01
Data for other shards
OLAP Query Execution Example query:
- Count messages with Size between 1000-10000 andHasBeenSent=false in shards 2014-03-01 to 2014-03-31
How many rows are read?
- 2 fields x 31 shards = 62 rows
- Typically represents millions of values
Value arrays are scanned in memory
Physical rows are read on “cold” start only
- Multiple caching levels for “warm” and “hot” data
1 Billion Objects in 2GB?Example Security Event (CSV format):
Fixed Fields Variable FieldsComputer Name MAILSERVER18 1 MAILSERVER18$Log Name Security 2Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 WorkstationType Success Audit 4 (0x0,0x142999A)Source Security 5 3Category Logon/Logoff 6 KerberosEvent ID 540 7 KerberosUser Domain NT AUTHORITYUser Name SYSTEM User SID S-1-5-18
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,"(0x0,0x142999A)",3,Kerberos,Kerberos
Events Schema
EventsInsertio
nStrings
Fields:• ComputerName (text)• LogName (text)• Timestamp (timestamp)• Type (text)• Source (text)
Fields:• Index (integer)• Value (text)• Event (link)
Count: 115 Million Count: 880 Million
Params
Event (inverse)
• Category (text)• EventID (integer)• UserDomain (text)• UserSID (text)• Params (link)
Event Schema Load Load stats:
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)
Space usage::nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1
Demo 1) Count all Events in all shards
- 860 shards => 115M events
2) Find the top 5 hours-of-the-day when certain privileged events fail:
- Event IDs are any of 577, 681, 529
- Event type is ‘Failure Audit’
- Insertion string 8 is (0x0,0x3E7)
- Event occurred in first half of 2005 (181 shards)
Doradus OLAP Summary Advantages:
Simple REST API
All fields are searchable without indexes
Ad-hoc statistical searches
Support for graph-based queries
Near real time data warehousing
Dense storage = less hardware
Horizontally scalable when needed
Doradus OLAP Summary Good for applications where data:
Is continuous/streaming
Is structured to semi-structured
Can be loaded in batches
Is partitionable, especially by time
Is typically queried in a subset of shards
Emphasizes statistical queries
top related