in memory olap engine

Post on 19-May-2015

1.673 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

In memory OLAP engineSamuel Pelletier Kaviju inc. samuel@kaviju.com

OLAP ?

• An acronym for OnLine Analytical Processing.

• In simple words, a system to query a multidimensional data set and get answer fast for interactive reports.

• A well known implementation is an Excel Pivot Table.

Why build something new• I wanted something fast, memory efficient for simple queries with

millions of facts.

• Sql queries dost not works for millions of facts with multiple dimensions, especially with large number of rows.

• There are specialized tools for OLAP from Microsoft, Oracle and others but they are large and expensive, too much for my needs.

• Generic cheap toolkits are not memory efficient, this is the cost for their simplicity.

• I wanted a simple solution to deploy with minimal dependency.

Memory usage and time to retrieve 1 000 000 invoice lines

• Fetching EOs uses 1.2 GB of ram in 13-19 s

• Fetching raw rows uses 750 MB of ram in 5-8 s.

• Fetching as POJOs with jdbc uses 130 MB in 4.0 s.

• Reading from file as POJOs uses 130 MB in 1.4 s.

• For 7 M rows, EOs would require 8.4 GB for gazillions of small objects (bad for the GC).

Time to compute sum of sales for 1 000 000 invoice lines

• 2.1 s for "select sum(sales)..." in FrontBase with table in RAM.

• 0.5 s for @sum.sales on EOs.

• 0.12 s for @sum.sales on raw rows.

• 0.5 s for @sum.sales on POJOs.

• 0.009 s for a loop with direct attribute access on POJOs.

Some concepts

• Facts are the elements being analyzed. An exemple is invoice lines.

• Facts contains measures like quantities, prices or amounts.

• Facts are linked to dimensions used to filter and aggregate them. For invoice lines, we have product, invoice, date, etc.

• Dimensions are often part of a hierarchy, for example, products are in a product category, dates are in a month and in a week.

Sample Invoice dimension hierarchy

InvoiceLine

Invoice

DateMonth

Ship to Client type

Sold to

Product

Salesman

SalesManager

Week

Client type

Measures:Shipped QtySalesProfits

Steps to implement an engine

• Create the Engine class.

• Create required classes to model the dimension hierarchy.

• Create the Value class for your facts.

• Create the Group class that will compute summarized results.

• Create the dimensions definition classes.

Engine class

• Engine class extends OlapEngine with Group and Value types. public class SalesEngine extends OlapEngine<GroupEntry,Value>

• Create the objects required for the data model and lookup table used to load the facts.

• Load the fact into Value objects.

• Create and register the dimensions.

Create required model objects

public class Product { public final int code; public final String name; public final ProductCategory category; public Product(int code, String name, ProductCategory category) { super(); this.code = code; this.name = name; this.category = category; }}! private void loadProducts() { productsByCode = new HashMap<Integer, Product>();! WOResourceManager resourceManager = ERXApplication.application().resourceManager(); String fileName = "olapData/products.txt"; try ( InputStream fileData = resourceManager.inputStreamForResourceNamed(fileName, null, null);) { InputStreamReader fileReader = new InputStreamReader(fileData, "utf-8"); BufferedReader reader = new BufferedReader(fileReader); String line; while ( (line = reader.readLine()) != null) { String[] cols = line.split("\t", -1); Product product = new Product(Integer.parseInt(cols[0]), cols[0], categoryWithID(cols[1])); productsByCode.put(product.code, product); } } ... }

Load the facts and create dimensions

private void loadInvoiceLines() { ... loadProductCategories(); loadProducts();! InvoiceDimension invoiceDim = new InvoiceDimension(this); SalesmanDimension salesmanDim = new SalesmanDimension(this); while ( (line = reader.readLine()) != null) { String[] cols = line.split("\t", -1);! InvoiceLine invoiceLine = new InvoiceLine(valueIndex++, Short.parseShort(cols[1])); invoiceLine.shippedQty = Integer.parseInt(cols[6]); invoiceLine.sales = Float.parseFloat(cols[7]); invoiceLine.profits = Float.parseFloat(cols[8]); lines.add(invoiceLine); invoiceDim.addLine(invoiceLine, cols[0], cols);! invoiceLine.salesmanNumber = Integer.parseInt(cols[12]); salesmanDim.addIndexEntry(invoiceLine.salesmanNumber, invoiceLine); ... } } addDimension(productDimension); addDimension(productDimension.createProductCategoryDimension()); ... lines.trimToSize(); setValues(lines); }

Value and GroupEntry classes

• Value classe contains your basic facts (invoice lines for example) public class InvoiceLine extends OlapValue<Sales>

• GroupEntry is use to compute summarized results. public class Sales extends GroupEntry<InvoiceLine>

• These are tightly coupled, a GroupEntry represent a computed result for an array of Values; metrics are found in both classes.

Value Class

public class InvoiceLine extends OlapValue<Sales> { public Invoice invoice; public final short lineNumber; public Product product;! public int shippedQty; public float sales; public float profits;! public int salesmanNumber; public int salesManagerNumber;! public InvoiceLine(int valueIndex, short lineNumber) { super(valueIndex); this.lineNumber = lineNumber; }}

GroupEntry class

public class Sales extends GroupEntry<InvoiceLine> { private int shippedQty; private double sales = 0.0; private double profits = 0.0; ! public Sales(GroupEntryKey<Sales, InvoiceLine> key) { super(key); }! @Override public void addEntry(InvoiceLine entry) { shippedQty += entry.shippedQty; sales += entry.sales; profits += entry.profits; }! @Override public void optimizeMemoryUsage() { } return sales; }! ...}

Dimensions classes

• Dimensions implement the engine indexes and key extraction for result aggregation.

• Dimensions are usually linked to another class representing an entity like Invoice, Client, Product or ProductCatogory.

• Entity are value object POJO for optimal speed an memory usage. You may add a method to get the corresponding eo.

• Dimensions are either leaf (a group of facts) or group (a group of leaf entries).

Product dimension class

public class ProductDimension extends OlapLeafDimension<Sales,Integer,InvoiceLine> {! public ProductDimension(OlapEngine<Sales, InvoiceLine> engine) { super(engine, "productCode"); }! @Override public Integer getKeyForEntry(InvoiceLine entry) { return entry.product.code; }! @Override public Integer getKeyForString(String keyString) { return Integer.valueOf(keyString); } public ProductCategoryDimension createProductCategoryDimension() { long startTime = System.currentTimeMillis(); ProductCategoryDimension dimension = new ProductCategoryDimension(engine, this);! for (Product product : salesEngine().products()) { dimension.addIndexEntry(product.category.categoryID, product.code); } long fetchTime = System.currentTimeMillis() - startTime; engine.logMessage("createProductCategoryDimension completed in "+fetchTime+"ms."); return dimension; }! private SalesEngine salesEngine() { return (SalesEngine) engine; }

Product category dimension class

public class ProductCategoryDimension extends OlapGroupDimension<Sales,Integer,InvoiceLine,ProductDimension,Integer> {! public ProductCategoryDimension(OlapEngine<Sales, InvoiceLine> engine, ProductDimension childDimension) { super(engine, "productCategoryCode", childDimension); }! @Override public Integer getKeyForEntry(InvoiceLine entry) { return entry.product.category.categoryID; }! @Override public Integer getKeyForString(String keyString) { return Integer.valueOf(keyString); }

Initialize and use in an app

• The engine is multithread capable once loaded.

• I usually create a singleton for the engine; it can also be in your app class.

• Entity are value object POJO for optimal speed an memory usage. You may add a method to get the corresponding eo.

• Dimensions are either leaf (a group of facts) or group (a group of leaf entries).

Use in a application

public Application() { ... SalesEngine.createEngine(); }!!In the component that uses the engine! public OlapNavigator(WOContext context) { super(context); .... engine = SalesEngine.sharedEngine(); if (engine == null) { Engine me bay null if it has not completed it's loading... } }! someFetchMethod() { OlapResult<Sales, InvoiceLine> result = engine.resultForRequest(query);! rows = new NSArray<Sales>(result.getGroups()); sort or put inside a ERXDisplayGroup... }!

Demo app

Java and memory

• To keep the garbage collector happy, it is better to have a maximum heap at least 2-3 times the real usage.

• GC can kill your app performance if memory is starved. With default setting, it may even kill your server by using multiple core for long periods at least in 1.5 and 1.6.

• Java 1.7 contains a new collector, probable better.

Q&ASamuel Pelletier samuel@kaviju.com

top related