understanding bigobject white paper 20160401

11

Click here to load reader

Upload: bigobject

Post on 12-Apr-2017

294 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Understanding BigObject White Paper 20160401

Understanding BigObject®

A White Paper

2016/04/01

BigObject ®

Page 2: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 1

Contents

1. Introduction: What is BigObject®

? ......................................................................................... 2

2. What does BigObject®

do? ..................................................................................................... 2

3. How is BigObject® different from other database-grade platforms? ...................................... 3

4. What type of database is built in BigObject®: relational, hierarchical, or NoSQL? ............... 3

5. What schema does BigObject® adopt? ................................................................................... 4

6. How is BigObject® optimized for such high performance? .................................................... 5

7. How does BigObject®

perform in comparison with other analytic databases? ...................... 5

8. What is smart query, and why use it? ..................................................................................... 6

9. Is BigObject® secure? ............................................................................................................. 7

10. Does BigObject® support column-based tables? ................................................................ 7

11. Does BigObject® support standard SQL? ........................................................................... 7

12. What is ‘in-data’ computing?.............................................................................................. 8

13. How does BigObject® collect data? .................................................................................... 8

14. What concurrency control does BigObject® use to guarantee isolation? How does it

impact its performance? .................................................................................................................. 8

15. Is BigObject® a reliable database? ...................................................................................... 8

Page 3: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 2

1. Introduction: What is BigObject®?

BigObject® is a database-grade Smart Data Analytics platform that delivers extraordinary

performance and the power of smart query. It is designed to store and analyze large

volumes of data with extremely high performance. It utilizes an extended relational data

model with two variations: the key-value model and the tree model. BigObject® analytics

supports SQL-like query and an in-database computing framework for smart query.

2. What does BigObject® do?

BigObject® supports analytic queries that can be expressed in SQL, and serves as a

programming language for smart queries (please refer further down in the Q&A for a

complete definition of how smart queries are defined). It enables data analytics, such as

multi-dimensional analysis, anomaly detection, complex data pattern detection,

associative analysis, etc., in both batch and streaming processing.

In addition to SQL, there are three additional layers that make up BigObject’s analytics stack:

(1) Find-By-Filter (2) Find-By-Pattern and (3) Find-By-Association. SQL and Find-By-

Filter are both used for multi-dimensional analytics or BI-related applications. Find-By-

Filter uses the ‘FIND’ syntax, which is similar to SQL’s ‘SELECT’ (i.e., record-level

filtering in general), except that it focuses on entity-level (group-level) query with group-

level filtering. In the presence of large volumes of data, what people are interested in is

meaningful information for entities or groups rather than individual facts or records. The

Find-By-Pattern system is used for analytics that follow complex data patterns, such as

the ‘80-20 rule,’ to identify noteworthy entities from within a dataset. It adopts terms

such as ‘significant’ or ‘important’ to describe the entities qualitatively. In reality, these

terms are developed through a process of concept formation based on data patterns,

common properties or previously developed terms. Find-By-Association is for analysis

based on associative links, and identifies associated entities within a dataset, such that A

is said to be associated with B if A and B have some property (i.e., attribute values) in

common. The latter two are sometimes referred to as ‘knowledge-seeking queries’, which

is something quite difficult to express in SQL.

In order to support complex user-defined patterns with user-created terminology, BigObject®

supports a Lua-based programming framework called “In-Place Programming” which

allows developers to define new terms (i.e., for data patterns or behaviors) to describe the

qualities of identified entities/groups. This process is inspired by the concept formation

process used in human communities.

Page 4: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 3

3. How is BigObject® different from other database-grade

platforms?

Apart from performance, there are two major differences: 1. Support for multi-faceted data

models: the relational model, hierarchical model and key-value model, each of which is

mutually transformable, and 2. support for smart queries that identify interested or

associative entities.

4. What type of database is built in BigObject®: relational,

hierarchical, or NoSQL?

The built-in database of BigObject® is an extended relational database with multi-faceted data

models. It supports relational (i.e., tables), hierarchical (i.e., trees) and key-value (i.e.,

semi-structured data) models, each of which is mutually transformable. The key-value

model is commonly used for raw datasets where schemas cannot be determined in

advance, or might be altered dynamically. The relational model defines relational

formulas and provides record-level query capability, while the hierarchical model is

designed for efficient computation in smart queries and for fast access to intermediate and

queried datasets.

a. Key-Value Model

This dataset consists of a ‘compilation’ of records, each of which contains two parts, the

key and the value. A key can be an ID attribute or a fixed list of ID attributes. The

value is an integer, Boolean value, real number, string or structure for a complex

value. A dataset may contain multiple records with the same key but different

values. Given a subset of keys, a key-value dataset can be transformed into a table

by an operation called Trans-Pivot.

b. Hierarchical Model

This dataset consists of a ‘tree’ of records, each of which contains an ID and a value in

the form of an integer, Boolean value, real number, string or a structure for a

complex value. Given a set of tables and an ordered set of attributes selected from

the table schema, a hierarchical data model can be constructed through an operation

called ‘Trans-Join.’

Page 5: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 4

BigObject® Multifaceted Database

5. What schema does BigObject® adopt?

For the purpose of simplicity, we often use the Star schema (aka Snowflake schema) to

describe BI-like analytics such as multi-dimensional analysis. The Star schema is a

logical arrangement of tables in a multi-dimensional database such that the entity-

relationship diagram resembles a star (or snowflake). It is an intuitive fact-based analysis

that is both easy to describe and easy to understand. However, BigObject’s data models

are not limited to such multi-dimensional databases, and likewise the analytics involved

are not limited to multi-dimensional analysis. BigObject’s data models are designed to

handle the following data types. Assume a dataset is a collection of records.

1. Fact Data

Fact data wherein the numbers of attributes/columns cannot be determined in advance

(i.e. the dataset cannot satisfy the first normal form). This type of data is best

represented by the key-value model.

2. Single-valued Data

Single-valued data that can satisfy the second normal form, third normal form and Boyce-

Codd normal form (2nd, 3rd and 3.5th). This type of data is best represented by the

relational model.

3. Multi-valued Data

Multi-valued data that can satisfy the fourth and fifth normal form. This type of data is

best represented by the hierarchical model, or a combination of the relational model

and hierarchical model. Time-series data is such an example.

Page 6: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 5

6. How is BigObject® optimized for such high performance?

The key idea behind the optimization is “push-down logic”. This approach works by pushing

code down to where data is stored and reducing the amount of data extracted from the

data space for computing. Thus, BigObject® is optimized from top to bottom following

the guidelines below:

1. Enable smart query to be performed within the database (to reduce the amount of

data extracted from the database).

2. Enable User-Defined Functions (UDF) to be performed in the same space where

data is kept (to avoid large data retrieval via SQL).

3. Enable an infinite and persistent memory space for both computing and storage, so

that data is placed in strong locality (to avoid excessive disk IO’s).

7. How does BigObject® perform in comparison with other

analytic databases?

A benchmark study has recently been conducted to shed light on the comparative performance

among BigObject® and several other advanced analytic databases.

1. Benchmark Study

The goal of this study was to design a fair and replicable benchmark to evaluate the

computing performance among popular databases, data warehouse systems and

BigObject. Dataset and query statements are based on the U.C. Berkeley AMPLab’s

benchmark (https://amplab.cs.berkeley.edu/benchmark/).

There are two tables:

Rankings: 17,999,999 rows, 1.1 G

User visits: 154,999,997 rows, 25 G

To perform a fair benchmark, all tests are run in Amazon AWS as the following

machines:

BigObject, MySQL and a new SQL: Amazon EC2, Instance type: r3. xlarge

(vCPU:4, ECU:13, Memory:30.5G).

RedShift: Instance type: ds2.xlarge (vCPU: 4, ECU: 14, Memory: 31G).

According to U.C. Berkeley AMPLab’s benchmark, three sets of testing statements are

used, “Scan Query (1a~1c)”, “Aggregation Query (2-)“, and “Join Query (3a~3d)”.

The results are shown in the following table.

Page 7: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 6

Query BigObject® RedShift MySQL New SQL

1a 0.246s 0.81s 10.202s 6.904s

1b 0.319s 0.655s 13.916s 7.051s

1c 1.193s 2.025s 127.666s 24.577s

2- 45.979s 86.772s >6hrs ﹡1781.060s

3a 3.14s 27.848s >10hrs 108.760s

3b 5.7s 32.047s >10hrs 150.536s

3c 31.187s 122.297s >10hrs >6hrs

3d 70.824s 200.98s >10hrs >9hrs

The performance comparison between BigObject® and RedShift is shown as follows:

8. What is smart query, and why use it?

Although various commercial and open-sourced products of relational databases, OLAP, and

NoSQL database have already been developed, users cannot easily identify meaningful

patterns among the vast volumes of data stored in these databases using the standard

query languages. Databases today support SQL or SQL-like languages. Apart from

the group-by function and basic statistic variables such as ‘sum’ and ‘mean’, the support

of group-level query (or entity-level query) is limited. Without sufficient group-level or

entity-level support, the traditional approach to identifying meaningful patterns at the

group-level offers no alternative but to extract all of the relevant entries from the

database and plug them into separate program data structures for processing.

In the presence of large volumes of data, what people are interested in is meaningful

information for entities or groups rather than individual data entries or records. SQL is

capable of grouping data by entities or attributes, but fails to describe the characteristics

Page 8: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 7

or patterns of these entities and groups. A group or entity-level query can be processed to

identify meaningful patterns in entities or groups, wherein member records are matched

against specific data patterns. In addition, complex data patterns which may be difficult

to express in SQL or other SQL-like languages are also possible.

UDF (User Defined Functions) or Stored Procedures, which many databases support on

database servers, fall into the same paradigm wherein all needed data records are

extracted from the database and stored in SQL. This type of approach becomes

difficult in both implementation and performance as larger volumes of data are

processed. Therefore, there is an urgent need for a querying mechanism that can easily

express complex data patterns in a high-level manner, and use top-down logic to identify

relevant entities that meet specified data patterns, without needing to extract large

volumes of data.

9. Is BigObject® secure?

Yes. BigObject® is a secure database that supports data encryption to protect data in file and

in transit, and also consists of a set of API’s for security modules for authentication,

authorization and auditing.

Data in file: Data in BigObject® is stored in files, which are protected by encryption using

AES (Advanced Encryption Standards). Encrypting such files at rest helps protect them

should physical security measures fail.

Data in transit: Data in transit, either during upload or download, are protected with

SSL/HTTPS from eavesdropping by unauthorized users.

With BigObject’s API support, BigObject® works with authentication modules such as

OpenRadius and authorization servers such as OAuth2. Access logs are often

implemented and placed in the authentication module to reduce the systems IO overhead.

The BigObject® module itself refers to the package without authentication, but it is API-

ready to connect for third-party authentication and authorization modules.

10. Does BigObject® support column-based tables?

Yes. In general, BigObject® implements ‘column-group’ design, where the columns of a table

can be divided into groups based on the application’s needs. Pure column-based tables

can be designed by treating each column as a column group. Row-based tables can be

designed with single-column groups for all columns. In BigObject, each column group is

backed up by one memory-mapped file.

11. Does BigObject® support standard SQL?

BigObject® supports SQL-like languages with smart query capability.

Page 9: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 8

12. What is ‘in-data’ computing?

In-Data Computing (also known as In-Place Computing) is an abstract model in which all data

is kept in an infinite and persistent memory space for both storage and computing. In-

data computing is a data-centric approach, where data is computed in the same space it is

stored. Instead of moving data to the code, code is moved to the data space for

processing. With today’s 64-bit architecture and virtualization technology, BigObject®

depicts a persistent and nearly infinite memory space for data, living and working.

13. How does BigObject® collect data?

There are several ways:

REST API: by writing a client program that issues SQL insert statements via the

BigObject® REST API.

Binary API: by uploading an Avro file via TCP.

Uploader: by uploading a csv file to BigObject® from the BigObject® shell. BigObject®

can process data in both CSV and MS Excel formats (i.e., xls file format, but not xlsx).

FluentD: by using FluentD to collect data from different sources.

MySQL: by using MySQL as the staging (or primary) database to sync data into BigObject.

MySQL offers a replication feature for many relational databases.

14. What concurrency control does BigObject® use to

guarantee isolation? How does it impact its performance?

BigObject® adopts a simple but highly efficient locking procedure, similar to object-based

read-write locking.

15. Is BigObject® a reliable database?

BigObject® is a data analysis platform that delivers extraordinary performance and the power

of smart query. It is designed to store and analyze large volumes of data with extremely

high performance. BigObject is designed to complement and bring analytic capability

to relational databases in transaction applications. It is not designed for OLTP (On-Line

Transaction Processing), so there is no support for the notion of “transactions” as in

RDBMS (i.e., Begin/End Transactions in Relational Database Management System.)

Page 10: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 9

Nevertheless, BigObject is a reliable analytic platform for keeping all analytic data safe and

sound. Data atomicity, consistency, isolation and persistence are all key parts of

architectural design of BigObject at the most basic level of its design. This also

includes a calculated tradeoff between performance and data reliability within its

structure. To support various use cases and requirements, BigObject provides different

reliability mechanisms with minimal performance overhead. Full data isolation is always

guaranteed with concurrent data processing. In the event of system/power failures,

different levels of protection are provided to address data atomicity, consistency and

persistence.

1. Data Protection with Full Isolation

BigObject adopts a read-write locking mechanism to protect against any anomaly that may be

caused by concurrent execution. In other words, BigObject guarantees the Isolation

property (of ACID). The current implementation supports object-level (e.g., table, tree,

etc.) locks. The ideal use case is for large read-only queries with a small amount of

writes, or situations where the write phase can be operated in a quarantined manner.

2. Levels of Data Protection Against System and Power Failures

Level 0 (no crash consistency): Data in BigObject can always be rebuilt from the sources of

records if needed. All BigObject data is managed in memory-mapped files.

Level 0 assures data consistency in the following scenarios:

1. Each single-record write operation (insert, update, delete or put) is atomic.

2. An update becomes persistent in one of the following cases: (1) a sync command is issued

explicitly, (2) the OS flushes changes back to disk implicitly, or (3) the database is check-

pointed, suspended or shut down normally.

3. Data remains consistent and up to date as long as the system is booted and shut down

normally.

Level 1 (crash consistency): In the case of data corruption or loss from a system/power

failure, the state of the database can be restored to the state of the last snapshot. Users

can restore the state before the catastrophic event or rollback to any point in time in the

event of application error. A snapshot can be user issued, scheduled or BigObject driven.

BigObject supports API to LVM2-based snapshots (for Linux), Volume Shadow Copy

(for Windows) or ProphetStor1’s SAN-based snapshots to provide crash consistency.

Level 1 of data protection assures data integrity as follows:

1 ProphetStor is a software-defined storage company whose technology manages heterogeneous enterprise storage.

Page 11: Understanding BigObject White Paper 20160401

White Paper Understanding BigObject

4/7/2016 Page 10

1. Data states at checkpoints are consistent.

2. Databases can be restored to the last snapshot state from the history of checkpoints.

Level 2 (WAL based): A write ahead logging (WAL) is used to further restore recovery point

objectives to their last committed state. An operation becomes committed and durable

once it has been acknowledged.

This level of reliability guarantees full ACID properties in the transactional database.

3. Industry Common Practice

BigObject offers crash consistent data protection out of box, without sacrificing performance,

thus users need not rebuild their BigObject database when such accidents occur. This is

particularly useful when the analytic platform receives constant updates from streaming

data.

It is assumed that today’s hardware and power supply are much more reliable than before. By

taking advantage of this assumption as a design tradeoff, BigObject is able to improve

performance while reducing excessive disk IO’s due to the cost of redundancy.