data analytics with nosql
TRANSCRIPT
![Page 1: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/1.jpg)
Data Analytics with NOSQL
Mukundan AgaramChris Weiss
![Page 2: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/2.jpg)
Some initial thoughts about data...
Continual issues with large scale web apps– Data growth + query response time
● Data growth => performance degradation● Explosion of big data “analytics” use cases
– Increase in unstructured data● More interconnectivity, more formats, lack of structure...● Document oriented data (XML/JSON) are difficult to
manage and search
– Distributed server configurations ● Large systems, more distribution and HA
Cloud services has aggravated these issues
![Page 3: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/3.jpg)
Agenda for the night
● What is NOSQL?● Varieties of NOSQL● Key Industry Use Cases● Applications for Data Analytics● Landscape● Demos/Walkthroughs● Closing Discussions
![Page 4: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/4.jpg)
What is NOSQL?
● “...mechanism for storage and retrieval of datathat is modeled in means other than tabularrelations used in relational databases.”Wikipedia
● Non SQL or Non-relational● Not Only SQL● Technically since late 1960...
– E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB
![Page 5: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/5.jpg)
What is NOSQL?
● Drivers for modern day NOSQL– Web 2.0
– Big Data
– Facebook, Google, Amazon, Expedia etc.
– Horizontal scaling to clusters of computers● Achilles heel for RDBMS
– Cost
– Provide ● HA● Partition Tolerance (a.k.a sharding)● Speed
![Page 6: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/6.jpg)
NOSQL - Drawbacks and Barriers
● Compromise on consistency (CAP Theorem)● Custom query languages vs. SQL● Lack of standardized interfaces● Existing investments in RDBMS● Most lack true ACID transactions.
– Use an “eventually” consistent model
– Data is replicated with a conflict resolution algorithm
– Methods for conflict resolution and distribution varysignificantly
![Page 7: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/7.jpg)
CAP Theorem
● a.k.a Brewer's theorem● Impossible for a distributed computer system to
simultaneously provide – Consistency
● all nodes see same data at same time
– Availability ● Every request receives a response
– Partition Tolerance● Fault tolerance to partitioning because of network failures
![Page 8: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/8.jpg)
CAP alignment for NOSQL
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
![Page 9: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/9.jpg)
NOSQL direction
The landscape is morphing...● Current NOSQL industry focus
– Address large distributed systems reactionary to theCAP theorem
● The newer breed of NOSQL address importantaspects such as ACID
● There is a new buzz word …– NewSQL
![Page 10: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/10.jpg)
Database Evolution
![Page 11: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/11.jpg)
NOSQL Model Classification
Key Value Stores &Caches
Data is represented as a collection of (K,V) pairs. In-memory,persistent or eventually persistent.
Document Databases Data is stored in JSON document structures.
RDF, OWL & Triple Stores
Meaningful way to connect information. Can inference overtriples (S,P,O). Can be represented graphically. SPARQL
Wide Column Databases Extensible record set. Stores data tables as sections ofcolumns. Great for EDW.
Graph Databases Stores data as a graph G(V,E). Great for correlation analysis,recommendation engines and fraud detection.
Multi-model Databases Combination of one or more varieties of the above.
![Page 12: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/12.jpg)
NOSQL Models
● Key-Value – Cache (EHCache, BigMemory, Coherence, Memcached)
– Store (Redis, Riak, AeroSpike, Oracle NoSQL)
● Document (MongoDB, CouchDB, AmazonDynamoDB)
● Wide Column (Cassandra, HBase, Vertica)
● Graph (Neo4j, Titan, Giraph)
● Multi-model (OrientDB, ArangoDB, Sqrrl)
![Page 13: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/13.jpg)
Source: www.db-engines.com
![Page 14: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/14.jpg)
Consider NOSQL for...
● Enabling “big data” and “web” scale– Massive distribution through horizontal scaling
● Performant queries (alternatives to RDBMS)– Denormalization and large horizontal scalability
● Massive write volumes (Facebook, Twitter)● Fast and dynamic access to key data ● Flexible schemas and data types● Data/Schema Migration● Developer centric environments
![Page 15: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/15.jpg)
Consider NOSQL for...
● Diverse data organization options– Hierarchical correlation
– Graph correlation
– Semantic relationships
– Set based analytics
● Caching in end usage format● Data Archival● Big Data Analytics
– Cumulative metrics and insights
– Correlation
![Page 16: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/16.jpg)
Where RDBMS/SQL is better..
● OLTP ● Data Integrity● SQL centricity● Complex relationships
– Exception of graph NOSQL
● Maturity, stability and standardization
![Page 17: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/17.jpg)
Use Cases● Log management (unstructured data)● Data synchronization (online vs. offline sources)
– Shopping cart, Field sales/services, PoS, Gaming,Transportation/telemetry
● User profile management● Customer 360 degree view● Fraud detection ● Medical/Healthcare diagnosis● Data Archival● Recommendation Engines
![Page 18: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/18.jpg)
Applications for Data Analytics
● Complements (part of) Hadoop and Big Data● Acts as the persistence infrastructure for larger
machine learning use cases– Predictive Analytics
– Fraud/Anomaly/Outlier Detection
– Recommendation engines
● Provides a back drop for interesting datavisualization initiatives– Integrate with visualization packages such as
Tableau
![Page 19: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/19.jpg)
Interesting links
● Redis in Practice: Who's online?www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/
● Inventory list of NOSQL systemswww.nosql-database.org
● Database Engine ranking and analyticswww.db-engines.com
● Visual guide to NOSQL systemswww.blog.nahurst.com/visual-guide-to-nosql-systems
![Page 20: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/20.jpg)
Case Studies / Demos
● Retail fraud detection – Neo4j
– Contrasting with OrientDB
– Tinkerpop/Gremlin/Blue Print
● 360 degree single view of voter information– MongoDB
● Schema on read – Hadoop
![Page 21: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/21.jpg)
![Page 22: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/22.jpg)
![Page 23: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/23.jpg)
Gremlin Blueprints Architecture
Neo4j OrientDB TitanGraph ArangoDB
![Page 24: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/24.jpg)
Qualified Voter – Use Case
● Tracks registration information for all voters inMichigan
● Uses a tabular geography model● Highly normalized schema
– Data partitioned into subsets● Enable local application instances and row level security
● Expensive queries when doing reporting● Expensive queries for performing “single view”
of voter● Several tables with tens of millions of records
![Page 25: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/25.jpg)
Voter Schema
![Page 26: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/26.jpg)
Find the first 100 voters in Ingham county withstatus and school district
SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER,
IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS,
VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE,
DIS.NAME AS SCHOOL_DISTRICT
FROM VOTER V, VOTER_ADDRESS VA, GENDER G,
IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST,
STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD,
DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO
WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID
AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active'
AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID
AND VA.IS_ACTIVE = 'Y'
AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham'
AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID
AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School'
AND ROWNUM <= 100;
![Page 27: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/27.jpg)
![Page 28: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/28.jpg)
![Page 29: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/29.jpg)
Expensive in terms of IO
● Multiple objects read● Two stage IO:● Read index● Read entire table row● Selected and WHERE clause columns
assembled and then filtered● Resources for larger volume query would be
high – memory, CPU, fast disk
![Page 30: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/30.jpg)
Parting conclusions
● NOSQL is a mixed bag of fruit● This space is growing● There are hundreds of products● Best value is realized from identifying the
correct use case– Functional requirements
– Non-functional requirements
![Page 31: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/31.jpg)
Finally you can use NOSQL for...
![Page 32: Data analytics with NOSQL](https://reader031.vdocuments.us/reader031/viewer/2022022205/58cee6081a28ab333d8b51e3/html5/thumbnails/32.jpg)
Thank You!!
Questions?