introduction to bigobject and in-place programming framework

15
Analytic Database for Interactive Queries and Real-time Computations

Upload: bigobject

Post on 12-Aug-2015

744 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to BigObject and In-Place Programming Framework

Analytic Database for Interactive Queries and Real-time Computations

Page 2: Introduction to BigObject and In-Place Programming Framework

BigObject is an analytic database based on In-Place Computing model. It differs from the traditional database in many ways:

1. It serves as a secondary database, no transaction supported, while it is capable of delivering 1,000x performance enhancement in data analytics, for example, billions of records in seconds on one single server.

2. Data is organized in relational model and hierarchical model.

3. It supports In-Place Programming, where patterns, filters or actions are defined and executed in database.

4. It supports query power beyond SQL in three layers:

Discover Associations: measure the associative links given any two data elements.

Find by Pattern: identify significant factors based on statistical principles.

Find by Filter: refine the datasets by specific attributes or quantifiers.

What is BigObject

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 3: Introduction to BigObject and In-Place Programming Framework

Who should use BigObject?

(1) Developers with no statistics training who want to solve data science problems and interpret the data captured from their applications. (weblogs, user reviews, sales orders…)

(2) Data scientists/analysts who would like to extract particular datasets in real-time and use them for further analysis. (with R, Excel, BI tools…)

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 4: Introduction to BigObject and In-Place Programming Framework

What kinds of data and problems does BigObject address?

Through our internal experiments and collaborations with customers, we have tried sales records, weblogs, Twitter datasets, sensor data, book reviews, stock quotes… The performance is best shown in structured, hierarchical data or essentially star schema. Users can upload csv files or directly copy data from relational database.

The power of BigObject is best shown in data-heavy computations. For example, correlation matrix that contains all the associative links between any two data nodes or discrepancy calculation derived from the comparison of datasets from different timespans.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 5: Introduction to BigObject and In-Place Programming Framework

Hows does BigObject work in my system?

The only requirement is 64bit Linux environment. You can simply pull docker image to run in your system and connect BigObject with you current databases and applications via ODBC driver, RESTful APIs, and DB loaders.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 6: Introduction to BigObject and In-Place Programming Framework

Use Cases

Page 7: Introduction to BigObject and In-Place Programming Framework

Find Significant Factor

Purpose: run a statistic test to see if an attribute is an influential factor to the data distribution

Example: FIND significant() Product.name IN Customer.state BY SUM(qty) FROM sales => Check if the sales of Product varies substantially across different state.

Find PARETO Principal

Purpose: run a statistic test to see if the data distribution meets 80/20 rule (80% of the effects come from 20% of the causes)

Example: FIND 10 pareto(80,20) Product.brand IN Customer.state BY sum(qty) FROM sales =>The above statement returns up to 10 brands whose 80% of revenue comes from the first 20% of state sales.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 8: Introduction to BigObject and In-Place Programming Framework

Find Distinct Count

Purpose: find distinct count (non repeated) of entities

Example: FIND dcount() Product.brand IN Customer.state FROM sales

Find the Discrepancy

Purpose: find the difference in measures between two parallel datasets

Example: FIND AMinusB("Male","Female") Product.brand IN Customer.gender BY sum(qty) FROM sales => Find the difference in each product purchase between male and female customers.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 9: Introduction to BigObject and In-Place Programming Framework

BUILD ASSOCIATION

Construct a correlation matrix between entities in one or two different dimensions, such as product to product association, or product to brand association.

Example: BUILD ASSOCIATION prod2prod (Product.name) BY Customer.id FROM sales => build an association matrix within product. The association is calculated as the number of customers that bought both products.

GET FREQ

Get the subject that is most associated with a given entity.

GET PROB

Get the probability distribution. The results are normalized by the total count of the hint_attribute belonging to the query attribute.

GET FACT

Print out all the results of association.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 10: Introduction to BigObject and In-Place Programming Framework

Terminology

Page 11: Introduction to BigObject and In-Place Programming Framework

In-Place Computing

In-Place Computing is the foundation of BigObject. It approximates an abstract model where data objects are stored and computed in a flat and infinite address space. To implement this model, we introduced a set of principles, including:

Macro Data Structure, or MDS

MDS is the basic unit of data objects used in In-Place Computing Model. An MDS is 1) large enough to hold any big data, 2)persistent, 3) re-locatable and 4) splittable or mergeable.

That means it be split, merged, copied and moved to any other address space without modifying the internal reference.

Memory Mapping

Each object is memory-mapped and backed by a file. The size of memory-mapped files can be larger than the combined size of physical memory and swap space.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 12: Introduction to BigObject and In-Place Programming Framework

Table Object

Star-schema or snowflake-schema data will be arranged as table objects in BigObject space where attributes are row-based and measures are column-based.

Tree Object

We structure the memory layout to preserve both sibling locality and descendant locality, where the relevant nodes are byte-to-byte contiguous in memory with no irrelevant nodes mixed in between.

The hierarchy of a tree object is semantically associated with a meaning or property of data, described by attributes in dimensions. Thus, it represents a natural structure of human thinking process to narrow down a big problem into sub-problems (i.e., divide and conquer). Each tree node holds a “Key-Value” pair where the aggregated value of leaf nodes are kept.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 13: Introduction to BigObject and In-Place Programming Framework

In-Place Programming

It is the mechanism to implement In-Place Computing Model. Currently it supports Lua programming language.

This programming framework allows (1) a piece of program to attach to an object node for execution and (2) expressing and evaluating an arithmetic expression of big objects.

Program Tag

It is a piece of program that contains statements to execute at node-level of tree objects. Program tags can be designed to behave differently in different levels and can be parameterized at invocation.

Copyright © 2015 MacroData, Inc. All Rights Reserved.

Page 14: Introduction to BigObject and In-Place Programming Framework

Click Here to Launch BigObject

Page 15: Introduction to BigObject and In-Place Programming Framework

Resources

Official Website

http://bigobject.io/

Github

https://github.com/macrodatalab

Bitbucket

https://bitbucket.org/macrodata

Docker Hub

https://registry.hub.docker.com/u/macrodata/bigobject/

Copyright © 2015 MacroData, Inc. All Rights Reserved.