search-based business intelligence and reverse data engineering with apache solr

34
Apache: Big Data Europe 2015 Search-based business intelligence and reverse data engineering with Apache Solr Mario-Leander Reimer Chief Technologist

Upload: mario-leander-reimer

Post on 12-Apr-2017

539 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015

Search-based business intelligence andreverse data engineering with Apache Solr

M a r i o - L e a n d e r R e i m e rC h i e f T e c h n o l o g i s t

Page 2: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015

This talk will …

Mario-Leander Reimer 228. September 2015

o Give a brief overview of the AIR system’s architecture

o Show reverse data engineering using Solr and MIR

o Talk about the fight for our right to Solr

o Describe solutions for the problem of combinatorial explosion

o Outline a flexible and lightweight ETL approach for Solr

Page 3: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 328. September 2015

A <<Anwendungscluster>>

AIR Repository

A <<Application Cluster>>

AIR Loader

Mechanic

A <<System>>

AIR Central

A <<Subsystem>>

Maintenance

I <<Subsystem>>

Apache Solr

A <<Client>>

AIR Client

I <<Subsystem>>

.NET WPF A <<Subsystem>>

Solr Extensions

A <<Subsystem>>

Defects

A <<Subsystem>>

Flat Rates

A <<Subsystem>>

Service Bulletins

ServiceTechnician

A <<Ext. System>>

3rd Party Application

A <<Subsystem>>

AIR Fork DLL

A <<Subsystem>>

AIR Call DLL

Launch

I <<Subsystem>>

Spring Framework

I <<Subsystem>>

JEE 5

A <<System>>

AIR Control

I <<Subsystem>>

Jenkins

A <<Subsystem>>

Documents

A <<Subsystem>>

Vehicles

A <<Subsystem>>

Measures

Backend Databasesand Systems

A <<Subsystem>>

Repair Overview

A <<Subsystem>>

...

A <<Subsystem>>

JSF Web UI

A <<Subsystem>>

REST API

IndependentWorkshop

A <<Client>>

Browser Search andDisplay

A <<Ext. System>>

3rd Party iOS App

A <<Subsystem>>

AIR iOS Lib

A <<Subsystem>>

Defects

A <<Subsystem>>

Flat Rates

A <<Subsystem>>

Service Bulletins

I <<Subsystem>>

Spring Framework

A <<Subsystem>>

Documents

A <<Subsystem>>

Parts

A <<Subsystem>>

WS Clients

A <<Subsystem>>

File Storage

A <<Subsystem>>

Solr Access

A <<Subsystem>>

Protocoll

A <<Subsystem>>

Watchlist

A <<Subsystem>>

Masterdata

A <<Subsystem>>

Retrofits

AIR DBDocument

Storage

A <<Ext. System>>

AIR Bus

I <<Ext. System>>

Backend Systems

Query

A <<Subsystem>>

Vehicles Execute

Load

20 Languages800 GB

Solr Index

A <<Subsystem>>

Maintenance

Page 4: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 428. September 2015

A <<Anwendungscluster>>

AIR Repository

A <<Application Cluster>>

AIR Loader

A <<Subsystem>>

Maintenance

I <<Subsystem>>

Apache Solr Master

A <<Subsystem>>

Solr Extensions

A <<Subsystem>>

Defects

A <<Subsystem>>

Flat Rates

A <<Subsystem>>

Service Bulletins

A <<System>>

AIR Control

I <<Subsystem>>

Jenkins

A <<Subsystem>>

Documents

Backend Databasesand Systems

A <<Subsystem>>

Repair Overview

A <<Subsystem>>

...

I <<Subsystem>>

Spring Framework

A <<Subsystem>>

Vehicles Execute

Load

20 Languages800 GB

Solr Index

I <<Subsystem>>

Apache Solr Slave

A <<Subsystem>>

Solr Extensions

Replicate

A <<System>>

MIR

20 Languages800 GB

Solr Index

Search

Page 5: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015

Let‘s go back to when it all began …

Source: http://www.october212015.com/images/timecircuits.jpg 5

Page 6: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

The project vision: find the right information in less than 3 clicks.

6

The situation:

o Users had to use up to 7 different applications for their daily work.

o Systems were not really integrated nicely.

o Finding the correct information was laborious and error prone.

The idea:

o Combine the data into a consistent information network.

o Make the information network and its data searchable and navigable.

o Replace existing application with one easy to use application.

Mario-Leander Reimer28. September 2015

Page 7: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

But how do we find the originating system for the desired data?

7Mario-Leander Reimer28. September 2015

Where to find the vehicle data?

60 potential systems and 5000 entities. Other data

Vehicle data

System A System B System C System D

Page 8: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

And how do we find the hidden relations between the systems and their data?

8Mario-Leander Reimer28. September 2015

How is the data linked to each other?

400.000 potential relations. Other data

Vehicle

System A System B System C System D

Customer

Documents

Page 9: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015

Meta Information Research (MIR)

9Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg

Page 10: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr.

10

o MIR manages meta information about the source systems (the data models and record descriptions)

o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets

o MIR also manages the target data model and Solr schema description

Mario-Leander Reimer28. September 2015

MetadataIndex

A <<System>>

Meta Information Research

I <<Subsystem>>

Apache Solr

A <<Subsystem>>

MIR User Interface

Backend Databasesand Systems

A <<Subsystem>>

MIR Loader

A <<Subsystem>>

MIR Generators

Read

Sources (Java, XML)Magic Draw

25MB

Page 11: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015 11

Wildcard queries

Facetteddrill down

Tree view ofsystems, tablesand attributes

Search results

Found potential synonyms for thechassis number

Page 12: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015 12

EAT YOUR OWN DOG FOOD.

The AIR Solr schema definition ismodelled and defined within MIR.

Solr schemaattributesSolr entities for

each release

Page 13: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

def sourceGenerator = MIR + Solr + Maven;

13Mario-Leander Reimer28. September 2015

Page 14: Search-based business intelligence and reverse data engineering with Apache Solr

14

But Solr is a full text

search engine. You have

to use an Oracle DB for

your application data!

NO!

Page 15: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Some of the AIR requirements were ...

15

o Focus is on search. Transactions are not required.

o High demands on request volume and performance.

o Free navigation on data model and content.

o Support for full text search and facetted search.

o Offline capabilities.

o Scalability from low-end device to server to cloud.

Mario-Leander Reimer28. September 2015

Page 16: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Apache Solr outperformed Oracle significantlyin query time as well as index size.

16Mario-Leander Reimer28. September 2015

SELECT * FROM VEHICLE WHERE VIN='V%'

INFO_TYPE:VEHICLE AND VIN:V*

SELECT * FROM MEASURE WHERE TEXT='engine'

INFO_TYPE:MEASURE AND TEXT:engine

SELECT * FROM VEHICLE WHERE VIN='%X%'

INFO_TYPE:VEHICLE AND VIN:*X*

| 038 ms | 000 ms | 000 ms

| 383 ms | 384 ms | 383 ms

| 092 ms | 000 ms | 000 ms

| 389 ms | 387 ms | 386 ms

| 039 ms | 000 ms | 000 ms

| 859 ms | 379 ms | 383 ms

Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle

Page 17: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015 1728. September 2015Source: http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg

Dirt Race Use Case:

o Low-end devices

o No Internet

Page 18: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm.

18

Running Debian Linux + JDK8

Jetty Container with the Solr andAIR WARs deployed

Reduced Solr data set with approx~1.5 Mio documents

Mario-Leander Reimer28. September 2015

Model B Hardware Specs:o ARMv6 CPU at 700Mhzo 512MB RAMo 32GB SD-Card

Page 19: Search-based business intelligence and reverse data engineering with Apache Solr

19

YOU GOTTA FIGHT FOR

YOUR RIGHT TO SOLR!

Page 20: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

No silver bullet. A careful schema design is crucial for your Solr performance.

2028. September 2015 Mario-Leander Reimer

Page 21: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

33.071.137Vehicles

648.129Technical

Documents

14.830.197Flat Rate Units

5.078.411FRU Groups

55.000Parts

648.129Measures

18.573Repair

Instructions

6.180Fault Indications

m

n

m

n

m

n

1.678.667Packages

n

m

n

n

m

n

n41.385Types

Naive data denormalization can quickly leadto combinatorial explosion.

21Mario-Leander Reimer28. September 2015

Num Docs: 55.777.706 Relationship Navigation

Page 22: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Multi-valued fields can efficiently store 1..n relations but may result in false positives.

22Mario-Leander Reimer28. September 2015

{"INFO_TYPE":"AWPOS_GROUP",

"NUMMER" :[ "1134190" , "1235590" ]

"BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"]

"E_SERIES" :[ "F10" , "E30" ]

}

In case this doesn‘t matter, perform a post filtering in your application.

Note: latest Solr versions support nested child documents. Use instead.

Index 0 Index 1

q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10

q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30

Page 23: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Technical documents and their validity wereexpressed in a binary representation.

23

o Validity expressions may have up to 46 characteristics.

o Validity expressions use 5 different boolean operators (AND, NOT, …)

o Validity expessions can be nested and complex.

o Some characteristics are dynamic and not even known at index time.

Mario-Leander Reimer28. September 2015

Solution: transform the validity expressions into theequivalent JavaScript terms and evaluate these termsat query time using a custom function query filter.

Page 24: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015

Binary validity expression example.

2428. September 2015

Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘

Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘

Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘

Page 25: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Transformation of binary validity terms intotheir JavaScript equivalent at index time.

25Mario-Leander Reimer28. September 2015

((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH'))

AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH')

{"INFO_TYPE": "TECHNISCHES_DOKUMENT",

"DOKUMENT_TITEL": "Getriebe aus- und einbauen",

"DOKUMENT_ART": " reparaturanleitung",

"VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))",

„BRAND": [„BMW PKW"],

...

}

Page 26: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

The JavaScript validity term is evaluated at query time using a custom function query.

26Mario-Leander Reimer28. September 2015

&fq=INFO_TYPE:TECHNISCHES_DOKUMENT

&fq=DOKUMENT_ART:reparaturanleitung

&fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500}

jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU

1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG

wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV

kJBVVJFSUhFIjoiWCcifQ==)

http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html

Ba

se

64

de

co

de {

„BRAND":"BMW PKW", "E_SERIES":"F10","TRANSMISSION":"MECH"

}

Page 27: Search-based business intelligence and reverse data engineering with Apache Solr

27

How often do we load

data? How do we ensure

data consistency?

Page 28: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive.

28Mario-Leander Reimer28. September 2015

DataWarehouse

SystemB

SystemA

SystemC

File DB

File DB

AIR Solr

ETL

ETL

ETL

ETL

ETL

ETL

ETL jobs would usually beimplemented with Informatica

Significant business logicrequired depending on thesource database

Page 29: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Flexible and lightweight ETL combined withContinuous Delivery and DevOps.

29Mario-Leander Reimer28. September 2015

H <<System>>

AIR Search

H <<System>>

AIR Loader Slave

I <<System>>

Jenkins Slave

I <<System>>

Apache Maven

Developer

Operations

Solr Index

A <<System>>

AIR Loader

I <<System>>

Apache Solr

Data Source A

I <<System>>

Jenkins Master

Sta

rt

I <<System>>

Nexus Repository

Bu

ild &

Dep

loy

Bu

ildR

un

Solr Index

I <<System>>

Apache SolrReplicate

Data Source n

Extract

Load

Page 30: Search-based business intelligence and reverse data engineering with Apache Solr

30

Page 31: Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015Apache: Big Data Europe 2015

Apache Solr has become a powerful tool fordata analytics applications. Be creative.

31

Our next big project using Apache Solr is already on its way.

High performance application to predict and calculate the bill of materials for all required parts and orders.

Apache Solr as a compressed, scalable and high performance time series database.

FOSDEM’15 – Florian Lautenschlager, QAware GmbH

Leveraging the Power of SOLR and SPARK

Apache: Big Data 2015 – Johannes Weigend, QAware GmbH

Mario-Leander Reimer28. September 2015

Page 32: Search-based business intelligence and reverse data engineering with Apache Solr

32

Business intelligence

is about asking the

right questions about

your data.

Page 33: Search-based business intelligence and reverse data engineering with Apache Solr

33

And with Apache Solr

you can search and

find the answers you

are looking for.

Page 34: Search-based business intelligence and reverse data engineering with Apache Solr

https://twitter.com/leanderreimer/

https://slideshare.net/MarioLeanderReimer/

https://speakerdeck.com/lreimer/

&

Mario-Leander Reimer Chief Technologist, QAware GmbH