Apache: Big Data Europe 2015
Search-based business intelligence andreverse data engineering with Apache Solr
M a r i o - L e a n d e r R e i m e rC h i e f T e c h n o l o g i s t
Apache: Big Data Europe 2015
This talk will …
Mario-Leander Reimer 228. September 2015
o Give a brief overview of the AIR system’s architecture
o Show reverse data engineering using Solr and MIR
o Talk about the fight for our right to Solr
o Describe solutions for the problem of combinatorial explosion
o Outline a flexible and lightweight ETL approach for Solr
Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 328. September 2015
A <<Anwendungscluster>>
AIR Repository
A <<Application Cluster>>
AIR Loader
Mechanic
A <<System>>
AIR Central
A <<Subsystem>>
Maintenance
I <<Subsystem>>
Apache Solr
A <<Client>>
AIR Client
I <<Subsystem>>
.NET WPF A <<Subsystem>>
Solr Extensions
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
ServiceTechnician
A <<Ext. System>>
3rd Party Application
A <<Subsystem>>
AIR Fork DLL
A <<Subsystem>>
AIR Call DLL
Launch
I <<Subsystem>>
Spring Framework
I <<Subsystem>>
JEE 5
A <<System>>
AIR Control
I <<Subsystem>>
Jenkins
A <<Subsystem>>
Documents
A <<Subsystem>>
Vehicles
A <<Subsystem>>
Measures
Backend Databasesand Systems
A <<Subsystem>>
Repair Overview
A <<Subsystem>>
...
A <<Subsystem>>
JSF Web UI
A <<Subsystem>>
REST API
IndependentWorkshop
A <<Client>>
Browser Search andDisplay
A <<Ext. System>>
3rd Party iOS App
A <<Subsystem>>
AIR iOS Lib
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
I <<Subsystem>>
Spring Framework
A <<Subsystem>>
Documents
A <<Subsystem>>
Parts
A <<Subsystem>>
WS Clients
A <<Subsystem>>
File Storage
A <<Subsystem>>
Solr Access
A <<Subsystem>>
Protocoll
A <<Subsystem>>
Watchlist
A <<Subsystem>>
Masterdata
A <<Subsystem>>
Retrofits
AIR DBDocument
Storage
A <<Ext. System>>
AIR Bus
I <<Ext. System>>
Backend Systems
Query
A <<Subsystem>>
Vehicles Execute
Load
20 Languages800 GB
Solr Index
A <<Subsystem>>
Maintenance
Apache: Big Data Europe 2015Apache: Big Data Europe 2015Mario-Leander Reimer 428. September 2015
A <<Anwendungscluster>>
AIR Repository
A <<Application Cluster>>
AIR Loader
A <<Subsystem>>
Maintenance
I <<Subsystem>>
Apache Solr Master
A <<Subsystem>>
Solr Extensions
A <<Subsystem>>
Defects
A <<Subsystem>>
Flat Rates
A <<Subsystem>>
Service Bulletins
A <<System>>
AIR Control
I <<Subsystem>>
Jenkins
A <<Subsystem>>
Documents
Backend Databasesand Systems
A <<Subsystem>>
Repair Overview
A <<Subsystem>>
...
I <<Subsystem>>
Spring Framework
A <<Subsystem>>
Vehicles Execute
Load
20 Languages800 GB
Solr Index
I <<Subsystem>>
Apache Solr Slave
A <<Subsystem>>
Solr Extensions
Replicate
A <<System>>
MIR
20 Languages800 GB
Solr Index
Search
Apache: Big Data Europe 2015
Let‘s go back to when it all began …
Source: http://www.october212015.com/images/timecircuits.jpg 5
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
The project vision: find the right information in less than 3 clicks.
6
The situation:
o Users had to use up to 7 different applications for their daily work.
o Systems were not really integrated nicely.
o Finding the correct information was laborious and error prone.
The idea:
o Combine the data into a consistent information network.
o Make the information network and its data searchable and navigable.
o Replace existing application with one easy to use application.
Mario-Leander Reimer28. September 2015
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
But how do we find the originating system for the desired data?
7Mario-Leander Reimer28. September 2015
Where to find the vehicle data?
60 potential systems and 5000 entities. Other data
Vehicle data
System A System B System C System D
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
And how do we find the hidden relations between the systems and their data?
8Mario-Leander Reimer28. September 2015
How is the data linked to each other?
400.000 potential relations. Other data
Vehicle
System A System B System C System D
Customer
Documents
Apache: Big Data Europe 2015
Meta Information Research (MIR)
9Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr.
10
o MIR manages meta information about the source systems (the data models and record descriptions)
o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets
o MIR also manages the target data model and Solr schema description
Mario-Leander Reimer28. September 2015
MetadataIndex
A <<System>>
Meta Information Research
I <<Subsystem>>
Apache Solr
A <<Subsystem>>
MIR User Interface
Backend Databasesand Systems
A <<Subsystem>>
MIR Loader
A <<Subsystem>>
MIR Generators
Read
Sources (Java, XML)Magic Draw
25MB
Apache: Big Data Europe 2015 11
Wildcard queries
Facetteddrill down
Tree view ofsystems, tablesand attributes
Search results
Found potential synonyms for thechassis number
Apache: Big Data Europe 2015 12
EAT YOUR OWN DOG FOOD.
The AIR Solr schema definition ismodelled and defined within MIR.
Solr schemaattributesSolr entities for
each release
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
def sourceGenerator = MIR + Solr + Maven;
13Mario-Leander Reimer28. September 2015
14
But Solr is a full text
search engine. You have
to use an Oracle DB for
your application data!
NO!
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Some of the AIR requirements were ...
15
o Focus is on search. Transactions are not required.
o High demands on request volume and performance.
o Free navigation on data model and content.
o Support for full text search and facetted search.
o Offline capabilities.
o Scalability from low-end device to server to cloud.
Mario-Leander Reimer28. September 2015
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Apache Solr outperformed Oracle significantlyin query time as well as index size.
16Mario-Leander Reimer28. September 2015
SELECT * FROM VEHICLE WHERE VIN='V%'
INFO_TYPE:VEHICLE AND VIN:V*
SELECT * FROM MEASURE WHERE TEXT='engine'
INFO_TYPE:MEASURE AND TEXT:engine
SELECT * FROM VEHICLE WHERE VIN='%X%'
INFO_TYPE:VEHICLE AND VIN:*X*
| 038 ms | 000 ms | 000 ms
| 383 ms | 384 ms | 383 ms
| 092 ms | 000 ms | 000 ms
| 389 ms | 387 ms | 386 ms
| 039 ms | 000 ms | 000 ms
| 859 ms | 379 ms | 383 ms
Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle
Apache: Big Data Europe 2015 1728. September 2015Source: http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg
Dirt Race Use Case:
o Low-end devices
o No Internet
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm.
18
Running Debian Linux + JDK8
Jetty Container with the Solr andAIR WARs deployed
Reduced Solr data set with approx~1.5 Mio documents
Mario-Leander Reimer28. September 2015
Model B Hardware Specs:o ARMv6 CPU at 700Mhzo 512MB RAMo 32GB SD-Card
19
YOU GOTTA FIGHT FOR
YOUR RIGHT TO SOLR!
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
No silver bullet. A careful schema design is crucial for your Solr performance.
2028. September 2015 Mario-Leander Reimer
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
33.071.137Vehicles
648.129Technical
Documents
14.830.197Flat Rate Units
5.078.411FRU Groups
55.000Parts
648.129Measures
18.573Repair
Instructions
6.180Fault Indications
m
n
m
n
m
n
1.678.667Packages
n
m
n
n
m
n
n41.385Types
Naive data denormalization can quickly leadto combinatorial explosion.
21Mario-Leander Reimer28. September 2015
Num Docs: 55.777.706 Relationship Navigation
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Multi-valued fields can efficiently store 1..n relations but may result in false positives.
22Mario-Leander Reimer28. September 2015
{"INFO_TYPE":"AWPOS_GROUP",
"NUMMER" :[ "1134190" , "1235590" ]
"BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"]
"E_SERIES" :[ "F10" , "E30" ]
}
In case this doesn‘t matter, perform a post filtering in your application.
Note: latest Solr versions support nested child documents. Use instead.
Index 0 Index 1
q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10
q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Technical documents and their validity wereexpressed in a binary representation.
23
o Validity expressions may have up to 46 characteristics.
o Validity expressions use 5 different boolean operators (AND, NOT, …)
o Validity expessions can be nested and complex.
o Some characteristics are dynamic and not even known at index time.
Mario-Leander Reimer28. September 2015
Solution: transform the validity expressions into theequivalent JavaScript terms and evaluate these termsat query time using a custom function query filter.
Apache: Big Data Europe 2015
Binary validity expression example.
2428. September 2015
Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘
Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘
Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Transformation of binary validity terms intotheir JavaScript equivalent at index time.
25Mario-Leander Reimer28. September 2015
((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH'))
AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH')
{"INFO_TYPE": "TECHNISCHES_DOKUMENT",
"DOKUMENT_TITEL": "Getriebe aus- und einbauen",
"DOKUMENT_ART": " reparaturanleitung",
"VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))",
„BRAND": [„BMW PKW"],
...
}
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
The JavaScript validity term is evaluated at query time using a custom function query.
26Mario-Leander Reimer28. September 2015
&fq=INFO_TYPE:TECHNISCHES_DOKUMENT
&fq=DOKUMENT_ART:reparaturanleitung
&fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500}
jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU
1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG
wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV
kJBVVJFSUhFIjoiWCcifQ==)
http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html
Ba
se
64
de
co
de {
„BRAND":"BMW PKW", "E_SERIES":"F10","TRANSMISSION":"MECH"
}
27
How often do we load
data? How do we ensure
data consistency?
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive.
28Mario-Leander Reimer28. September 2015
DataWarehouse
SystemB
SystemA
SystemC
File DB
File DB
AIR Solr
ETL
ETL
ETL
ETL
ETL
ETL
ETL jobs would usually beimplemented with Informatica
Significant business logicrequired depending on thesource database
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Flexible and lightweight ETL combined withContinuous Delivery and DevOps.
29Mario-Leander Reimer28. September 2015
H <<System>>
AIR Search
H <<System>>
AIR Loader Slave
I <<System>>
Jenkins Slave
I <<System>>
Apache Maven
Developer
Operations
Solr Index
A <<System>>
AIR Loader
I <<System>>
Apache Solr
Data Source A
I <<System>>
Jenkins Master
Sta
rt
I <<System>>
Nexus Repository
Bu
ild &
Dep
loy
Bu
ildR
un
Solr Index
I <<System>>
Apache SolrReplicate
Data Source n
Extract
Load
30
Apache: Big Data Europe 2015Apache: Big Data Europe 2015
Apache Solr has become a powerful tool fordata analytics applications. Be creative.
31
Our next big project using Apache Solr is already on its way.
High performance application to predict and calculate the bill of materials for all required parts and orders.
Apache Solr as a compressed, scalable and high performance time series database.
FOSDEM’15 – Florian Lautenschlager, QAware GmbH
Leveraging the Power of SOLR and SPARK
Apache: Big Data 2015 – Johannes Weigend, QAware GmbH
Mario-Leander Reimer28. September 2015
32
Business intelligence
is about asking the
right questions about
your data.
33
And with Apache Solr
you can search and
find the answers you
are looking for.
https://twitter.com/leanderreimer/
https://slideshare.net/MarioLeanderReimer/
https://speakerdeck.com/lreimer/
&
Mario-Leander Reimer Chief Technologist, QAware GmbH