understanding bigobject white paper 20160401
TRANSCRIPT
Understanding BigObject®
A White Paper
2016/04/01
BigObject ®
White Paper Understanding BigObject
4/7/2016 Page 1
Contents
1. Introduction: What is BigObject®
? ......................................................................................... 2
2. What does BigObject®
do? ..................................................................................................... 2
3. How is BigObject® different from other database-grade platforms? ...................................... 3
4. What type of database is built in BigObject®: relational, hierarchical, or NoSQL? ............... 3
5. What schema does BigObject® adopt? ................................................................................... 4
6. How is BigObject® optimized for such high performance? .................................................... 5
7. How does BigObject®
perform in comparison with other analytic databases? ...................... 5
8. What is smart query, and why use it? ..................................................................................... 6
9. Is BigObject® secure? ............................................................................................................. 7
10. Does BigObject® support column-based tables? ................................................................ 7
11. Does BigObject® support standard SQL? ........................................................................... 7
12. What is ‘in-data’ computing?.............................................................................................. 8
13. How does BigObject® collect data? .................................................................................... 8
14. What concurrency control does BigObject® use to guarantee isolation? How does it
impact its performance? .................................................................................................................. 8
15. Is BigObject® a reliable database? ...................................................................................... 8
White Paper Understanding BigObject
4/7/2016 Page 2
1. Introduction: What is BigObject®?
BigObject® is a database-grade Smart Data Analytics platform that delivers extraordinary
performance and the power of smart query. It is designed to store and analyze large
volumes of data with extremely high performance. It utilizes an extended relational data
model with two variations: the key-value model and the tree model. BigObject® analytics
supports SQL-like query and an in-database computing framework for smart query.
2. What does BigObject® do?
BigObject® supports analytic queries that can be expressed in SQL, and serves as a
programming language for smart queries (please refer further down in the Q&A for a
complete definition of how smart queries are defined). It enables data analytics, such as
multi-dimensional analysis, anomaly detection, complex data pattern detection,
associative analysis, etc., in both batch and streaming processing.
In addition to SQL, there are three additional layers that make up BigObject’s analytics stack:
(1) Find-By-Filter (2) Find-By-Pattern and (3) Find-By-Association. SQL and Find-By-
Filter are both used for multi-dimensional analytics or BI-related applications. Find-By-
Filter uses the ‘FIND’ syntax, which is similar to SQL’s ‘SELECT’ (i.e., record-level
filtering in general), except that it focuses on entity-level (group-level) query with group-
level filtering. In the presence of large volumes of data, what people are interested in is
meaningful information for entities or groups rather than individual facts or records. The
Find-By-Pattern system is used for analytics that follow complex data patterns, such as
the ‘80-20 rule,’ to identify noteworthy entities from within a dataset. It adopts terms
such as ‘significant’ or ‘important’ to describe the entities qualitatively. In reality, these
terms are developed through a process of concept formation based on data patterns,
common properties or previously developed terms. Find-By-Association is for analysis
based on associative links, and identifies associated entities within a dataset, such that A
is said to be associated with B if A and B have some property (i.e., attribute values) in
common. The latter two are sometimes referred to as ‘knowledge-seeking queries’, which
is something quite difficult to express in SQL.
In order to support complex user-defined patterns with user-created terminology, BigObject®
supports a Lua-based programming framework called “In-Place Programming” which
allows developers to define new terms (i.e., for data patterns or behaviors) to describe the
qualities of identified entities/groups. This process is inspired by the concept formation
process used in human communities.
White Paper Understanding BigObject
4/7/2016 Page 3
3. How is BigObject® different from other database-grade
platforms?
Apart from performance, there are two major differences: 1. Support for multi-faceted data
models: the relational model, hierarchical model and key-value model, each of which is
mutually transformable, and 2. support for smart queries that identify interested or
associative entities.
4. What type of database is built in BigObject®: relational,
hierarchical, or NoSQL?
The built-in database of BigObject® is an extended relational database with multi-faceted data
models. It supports relational (i.e., tables), hierarchical (i.e., trees) and key-value (i.e.,
semi-structured data) models, each of which is mutually transformable. The key-value
model is commonly used for raw datasets where schemas cannot be determined in
advance, or might be altered dynamically. The relational model defines relational
formulas and provides record-level query capability, while the hierarchical model is
designed for efficient computation in smart queries and for fast access to intermediate and
queried datasets.
a. Key-Value Model
This dataset consists of a ‘compilation’ of records, each of which contains two parts, the
key and the value. A key can be an ID attribute or a fixed list of ID attributes. The
value is an integer, Boolean value, real number, string or structure for a complex
value. A dataset may contain multiple records with the same key but different
values. Given a subset of keys, a key-value dataset can be transformed into a table
by an operation called Trans-Pivot.
b. Hierarchical Model
This dataset consists of a ‘tree’ of records, each of which contains an ID and a value in
the form of an integer, Boolean value, real number, string or a structure for a
complex value. Given a set of tables and an ordered set of attributes selected from
the table schema, a hierarchical data model can be constructed through an operation
called ‘Trans-Join.’
White Paper Understanding BigObject
4/7/2016 Page 4
BigObject® Multifaceted Database
5. What schema does BigObject® adopt?
For the purpose of simplicity, we often use the Star schema (aka Snowflake schema) to
describe BI-like analytics such as multi-dimensional analysis. The Star schema is a
logical arrangement of tables in a multi-dimensional database such that the entity-
relationship diagram resembles a star (or snowflake). It is an intuitive fact-based analysis
that is both easy to describe and easy to understand. However, BigObject’s data models
are not limited to such multi-dimensional databases, and likewise the analytics involved
are not limited to multi-dimensional analysis. BigObject’s data models are designed to
handle the following data types. Assume a dataset is a collection of records.
1. Fact Data
Fact data wherein the numbers of attributes/columns cannot be determined in advance
(i.e. the dataset cannot satisfy the first normal form). This type of data is best
represented by the key-value model.
2. Single-valued Data
Single-valued data that can satisfy the second normal form, third normal form and Boyce-
Codd normal form (2nd, 3rd and 3.5th). This type of data is best represented by the
relational model.
3. Multi-valued Data
Multi-valued data that can satisfy the fourth and fifth normal form. This type of data is
best represented by the hierarchical model, or a combination of the relational model
and hierarchical model. Time-series data is such an example.
White Paper Understanding BigObject
4/7/2016 Page 5
6. How is BigObject® optimized for such high performance?
The key idea behind the optimization is “push-down logic”. This approach works by pushing
code down to where data is stored and reducing the amount of data extracted from the
data space for computing. Thus, BigObject® is optimized from top to bottom following
the guidelines below:
1. Enable smart query to be performed within the database (to reduce the amount of
data extracted from the database).
2. Enable User-Defined Functions (UDF) to be performed in the same space where
data is kept (to avoid large data retrieval via SQL).
3. Enable an infinite and persistent memory space for both computing and storage, so
that data is placed in strong locality (to avoid excessive disk IO’s).
7. How does BigObject® perform in comparison with other
analytic databases?
A benchmark study has recently been conducted to shed light on the comparative performance
among BigObject® and several other advanced analytic databases.
1. Benchmark Study
The goal of this study was to design a fair and replicable benchmark to evaluate the
computing performance among popular databases, data warehouse systems and
BigObject. Dataset and query statements are based on the U.C. Berkeley AMPLab’s
benchmark (https://amplab.cs.berkeley.edu/benchmark/).
There are two tables:
Rankings: 17,999,999 rows, 1.1 G
User visits: 154,999,997 rows, 25 G
To perform a fair benchmark, all tests are run in Amazon AWS as the following
machines:
BigObject, MySQL and a new SQL: Amazon EC2, Instance type: r3. xlarge
(vCPU:4, ECU:13, Memory:30.5G).
RedShift: Instance type: ds2.xlarge (vCPU: 4, ECU: 14, Memory: 31G).
According to U.C. Berkeley AMPLab’s benchmark, three sets of testing statements are
used, “Scan Query (1a~1c)”, “Aggregation Query (2-)“, and “Join Query (3a~3d)”.
The results are shown in the following table.
White Paper Understanding BigObject
4/7/2016 Page 6
Query BigObject® RedShift MySQL New SQL
1a 0.246s 0.81s 10.202s 6.904s
1b 0.319s 0.655s 13.916s 7.051s
1c 1.193s 2.025s 127.666s 24.577s
2- 45.979s 86.772s >6hrs ﹡1781.060s
3a 3.14s 27.848s >10hrs 108.760s
3b 5.7s 32.047s >10hrs 150.536s
3c 31.187s 122.297s >10hrs >6hrs
3d 70.824s 200.98s >10hrs >9hrs
The performance comparison between BigObject® and RedShift is shown as follows:
8. What is smart query, and why use it?
Although various commercial and open-sourced products of relational databases, OLAP, and
NoSQL database have already been developed, users cannot easily identify meaningful
patterns among the vast volumes of data stored in these databases using the standard
query languages. Databases today support SQL or SQL-like languages. Apart from
the group-by function and basic statistic variables such as ‘sum’ and ‘mean’, the support
of group-level query (or entity-level query) is limited. Without sufficient group-level or
entity-level support, the traditional approach to identifying meaningful patterns at the
group-level offers no alternative but to extract all of the relevant entries from the
database and plug them into separate program data structures for processing.
In the presence of large volumes of data, what people are interested in is meaningful
information for entities or groups rather than individual data entries or records. SQL is
capable of grouping data by entities or attributes, but fails to describe the characteristics
White Paper Understanding BigObject
4/7/2016 Page 7
or patterns of these entities and groups. A group or entity-level query can be processed to
identify meaningful patterns in entities or groups, wherein member records are matched
against specific data patterns. In addition, complex data patterns which may be difficult
to express in SQL or other SQL-like languages are also possible.
UDF (User Defined Functions) or Stored Procedures, which many databases support on
database servers, fall into the same paradigm wherein all needed data records are
extracted from the database and stored in SQL. This type of approach becomes
difficult in both implementation and performance as larger volumes of data are
processed. Therefore, there is an urgent need for a querying mechanism that can easily
express complex data patterns in a high-level manner, and use top-down logic to identify
relevant entities that meet specified data patterns, without needing to extract large
volumes of data.
9. Is BigObject® secure?
Yes. BigObject® is a secure database that supports data encryption to protect data in file and
in transit, and also consists of a set of API’s for security modules for authentication,
authorization and auditing.
Data in file: Data in BigObject® is stored in files, which are protected by encryption using
AES (Advanced Encryption Standards). Encrypting such files at rest helps protect them
should physical security measures fail.
Data in transit: Data in transit, either during upload or download, are protected with
SSL/HTTPS from eavesdropping by unauthorized users.
With BigObject’s API support, BigObject® works with authentication modules such as
OpenRadius and authorization servers such as OAuth2. Access logs are often
implemented and placed in the authentication module to reduce the systems IO overhead.
The BigObject® module itself refers to the package without authentication, but it is API-
ready to connect for third-party authentication and authorization modules.
10. Does BigObject® support column-based tables?
Yes. In general, BigObject® implements ‘column-group’ design, where the columns of a table
can be divided into groups based on the application’s needs. Pure column-based tables
can be designed by treating each column as a column group. Row-based tables can be
designed with single-column groups for all columns. In BigObject, each column group is
backed up by one memory-mapped file.
11. Does BigObject® support standard SQL?
BigObject® supports SQL-like languages with smart query capability.
White Paper Understanding BigObject
4/7/2016 Page 8
12. What is ‘in-data’ computing?
In-Data Computing (also known as In-Place Computing) is an abstract model in which all data
is kept in an infinite and persistent memory space for both storage and computing. In-
data computing is a data-centric approach, where data is computed in the same space it is
stored. Instead of moving data to the code, code is moved to the data space for
processing. With today’s 64-bit architecture and virtualization technology, BigObject®
depicts a persistent and nearly infinite memory space for data, living and working.
13. How does BigObject® collect data?
There are several ways:
REST API: by writing a client program that issues SQL insert statements via the
BigObject® REST API.
Binary API: by uploading an Avro file via TCP.
Uploader: by uploading a csv file to BigObject® from the BigObject® shell. BigObject®
can process data in both CSV and MS Excel formats (i.e., xls file format, but not xlsx).
FluentD: by using FluentD to collect data from different sources.
MySQL: by using MySQL as the staging (or primary) database to sync data into BigObject.
MySQL offers a replication feature for many relational databases.
14. What concurrency control does BigObject® use to
guarantee isolation? How does it impact its performance?
BigObject® adopts a simple but highly efficient locking procedure, similar to object-based
read-write locking.
15. Is BigObject® a reliable database?
BigObject® is a data analysis platform that delivers extraordinary performance and the power
of smart query. It is designed to store and analyze large volumes of data with extremely
high performance. BigObject is designed to complement and bring analytic capability
to relational databases in transaction applications. It is not designed for OLTP (On-Line
Transaction Processing), so there is no support for the notion of “transactions” as in
RDBMS (i.e., Begin/End Transactions in Relational Database Management System.)
White Paper Understanding BigObject
4/7/2016 Page 9
Nevertheless, BigObject is a reliable analytic platform for keeping all analytic data safe and
sound. Data atomicity, consistency, isolation and persistence are all key parts of
architectural design of BigObject at the most basic level of its design. This also
includes a calculated tradeoff between performance and data reliability within its
structure. To support various use cases and requirements, BigObject provides different
reliability mechanisms with minimal performance overhead. Full data isolation is always
guaranteed with concurrent data processing. In the event of system/power failures,
different levels of protection are provided to address data atomicity, consistency and
persistence.
1. Data Protection with Full Isolation
BigObject adopts a read-write locking mechanism to protect against any anomaly that may be
caused by concurrent execution. In other words, BigObject guarantees the Isolation
property (of ACID). The current implementation supports object-level (e.g., table, tree,
etc.) locks. The ideal use case is for large read-only queries with a small amount of
writes, or situations where the write phase can be operated in a quarantined manner.
2. Levels of Data Protection Against System and Power Failures
Level 0 (no crash consistency): Data in BigObject can always be rebuilt from the sources of
records if needed. All BigObject data is managed in memory-mapped files.
Level 0 assures data consistency in the following scenarios:
1. Each single-record write operation (insert, update, delete or put) is atomic.
2. An update becomes persistent in one of the following cases: (1) a sync command is issued
explicitly, (2) the OS flushes changes back to disk implicitly, or (3) the database is check-
pointed, suspended or shut down normally.
3. Data remains consistent and up to date as long as the system is booted and shut down
normally.
Level 1 (crash consistency): In the case of data corruption or loss from a system/power
failure, the state of the database can be restored to the state of the last snapshot. Users
can restore the state before the catastrophic event or rollback to any point in time in the
event of application error. A snapshot can be user issued, scheduled or BigObject driven.
BigObject supports API to LVM2-based snapshots (for Linux), Volume Shadow Copy
(for Windows) or ProphetStor1’s SAN-based snapshots to provide crash consistency.
Level 1 of data protection assures data integrity as follows:
1 ProphetStor is a software-defined storage company whose technology manages heterogeneous enterprise storage.
White Paper Understanding BigObject
4/7/2016 Page 10
1. Data states at checkpoints are consistent.
2. Databases can be restored to the last snapshot state from the history of checkpoints.
Level 2 (WAL based): A write ahead logging (WAL) is used to further restore recovery point
objectives to their last committed state. An operation becomes committed and durable
once it has been acknowledged.
This level of reliability guarantees full ACID properties in the transactional database.
3. Industry Common Practice
BigObject offers crash consistent data protection out of box, without sacrificing performance,
thus users need not rebuild their BigObject database when such accidents occur. This is
particularly useful when the analytic platform receives constant updates from streaming
data.
It is assumed that today’s hardware and power supply are much more reliable than before. By
taking advantage of this assumption as a design tradeoff, BigObject is able to improve
performance while reducing excessive disk IO’s due to the cost of redundancy.