Download - Oracle openworld-presentation
Oracle – Big DataTHE INTELLIGENCE LIFE-CYCLEand Schema-Last Approach
Dr Neil Brittliff PhD
A little about myself… Awarded a PhD at the University of Canberra in March this year for my work in
the Big Data space Currently employed as Data Scientist within the Australian Government Have been employed by 5 law enforcement agencies Developed Cryptographic Software to support the Australian Medicare System First used Oracle products back in 1986 Worked in the IT industry since 1982 Resides in Canberra (capital of Australia)
Canberra is the only capital city in Australia that is not named after a person Interests
Tennis (play) / Cricket (watch) Bushwalking and camping Piano Playing (very bad) Making stuff out of wood Enjoys the art of Programming (prefers the ‘C’ language) Pushing the limits of the Raspberry Pi
2
University of Canberra - 2015
Talk Structure 3
Motivation Principles and Constraints Intelligence Life-Cycle
Collect & Collate Analyse & Produce Report & Disseminate
Motivation Research
What is a Schema The Problem with ETL Data Cleansing verses Data Triage
A New Architecture Oracle Big Data The Schema-Last Approach
Indexing Technologies and Exploitation User Reaction Observations and Opportunities
University of Canberra - 2015
National Criminal Intelligence 4
The Law Enforcement community are also in the business of collecting and analysing criminal intelligence and data, and where possible, sharing that resulting information…
To do this, they need rich, contemporary, and comprehensive criminal intelligence… The National Criminal Intelligence Fusion Capability, which brings together subject
matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.
Fusion capability identifies the threats and vulnerabilities through the use of data.
It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.
University of Canberra - 2015
Australian Institute of Criminology
5
• While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved.
• Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining.
Darren Quick and Kim-Kwang Raymond ChooAustralian Institute of Criminology September 2014
University of Canberra - 2015
Objectives 6
Support the Australian Intelligence Criminal Model Simple Interface to exploit the data Data ingestion must be simple to do
and minimise transformation Support the large variety of data sources Fast ingestion and retrieval times Enable exact and fuzzy searching
Support ‘Identity Resolution’
Support metadata Main the data’s integrity
Preserve Data-Lineage/Provenance Reproduce the ingested data source
exactly!
We don’t want this!
University of Canberra - 2015
The Intelligence Life-Cycle
7
Plan, prioritise & direct
Collect & collate
Report & disseminate
Analyse & produce
Evaluate & review
University of Canberra - 2015
Intelligence – Data Source Classification
8
Low95%
High5%
Data SOURCE CLASSIFICATIONLow HighVelocity
VarietyVolumeVeracity
Value
Colle
ct &
col
late
Anal
yse
& p
rodu
ce
University of Canberra - 2015
Some Definitions: 9
That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.
Colle
ct &
Col
late
Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values.
Data munging or sometimes referred to as data wrangling means taking data that’s storedin one format and changing it into another format.
Analyse
AnalyseStorage
Schema Application 10Sc
hem
a Fi
rst
Raw Data
Triage
Cleanse
Raw Data
StorageSc
hem
a La
st
Schema
Schema
University of Canberra - 2015
Data Cleansing … 11
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012
Colle
ct &
Col
late
University of Canberra - 2015
Data Sources – Always Increasing
12
Gap
Colle
ct &
Col
late
University of Canberra - 2015
Data Cleansing - Doesn’t WORK
13
“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.
“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014.
“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.
Colle
ct &
Col
late
University of Canberra - 2015
Data Cleansing – Loss of Format
14
Input Date Cleansed Date
Comment
20 July 2014 20-07-2014 Australian DateJuly-20-2014 20-07-2014 American
Format(mmm-dd-yyyy)
2014-20-07 20-07-2014 Arabic Format (right to left)
20-07-14 20-07-2014 Data AmbiguityJuly 2014 01-07-2014 Imputed Value
"If you torture the data long enough, it will confess.“
Clifton R. Musser
Colle
ct &
Col
late
University of Canberra - 2015
ETL vs Triage 15
Initiate
Extract
Determine
Suitability?
Transform
n
Assessment?
Load
Report
Complete
n
Initiate
Triage
Load
Suitability?
Application
n
Verify?
Fuse
Resolve
Complete
n
Colle
ct &
Col
late
ETL Triage
University of Canberra - 2015
We did our research … 16
University of Canberra - 2015
Oracle’s BDA(Big Data Appliance)
17
Colle
ct &
Col
late
University of Canberra - 2015
Data Storage/Collation 18
Store the Data Semantically Built on an defined taxonomy/ontology Perfect to capture metadata
Searched for the perfect Triple-Store
Subject Predicate Object
Triple
GraphList
Colle
ct &
Col
late
University of Canberra - 2015
The Architecture 19
Collect & Collate Analyse & Produce
Set Store
Hbase
Historical
Data
NewData
RDF
/ Mod
ellin
g
Feeds
Dat
a Ex
plor
atio
n
Sem
anti
c St
ore
Disseminate
Index
IIR
Index
SOLR
BDA
Pala
ntir
Sear
ch A
ssis
tant
Data Flow
Dat
a Ex
ploi
tati
on
SPARQL
R Language
Apache PIG
University of Canberra - 2015
Schema Last … 20
‘Triaged’ Data
First NameMiddle NameLast Name
Schema
Full-Name
Street NumberStreet NameSuburbStatePostcode
Full-Address
Colle
ct &
Col
late
Models
University of Canberra - 2015
ACC Search Engines – ‘Smackdown’
21
Feature SOLR IIR
License Apache License CommercialStorage Inverted List Third-party
DatabaseSupport Google Like search Next
ReleaseScore Model Inverse
Document Frequency
NormalizedScore
Result Pagination Homophone Support Can use
synonym support
Phoneme Search Spread indexes across multiple nodes Schema-less Support
Programming Interface Rest SOAP - API
Geo-spatial
Colle
ct &
Col
late
University of Canberra - 2015
Collect & Collation Tool 22
Colle
ct &
Col
late
University of Canberra - 2015
Bongo – Exploration 23
Anal
yse
& P
rodu
ce
University of Canberra - 2015
Palantir – Semantic Interface 24
Repo
rt &
Dis
sem
inat
e
User Reaction 25
Time to Triage
< 1 Hour> 1 Hour < 24 Hour> 24 Hours
General Size % - Megabytes< 1
> 1 < 100> 100 < 1000> 1000
• Developed a Palantir Plugin to search the Fusion Data Holding
• Bulk Matching was a great success
• In general, user reaction has been positive
• Time to Triage was usually under an hour where cleansing could take weeks!!!
Australian Crime Commission 2015
University of Canberra - 2015
Ingestion Rate –The Improvement
26
Colle
ct &
Col
late
University of Canberra - 2015
Observations… 27
The Bulk Matcher Performance and Reliability
Interaction with Palantir Configuration over Customisation Search for the ‘Single Source of Truth’
Golden Record Acceptance of the Schema Last Approach Overwhelmed by Search Results
University of Canberra - 2015
Further Reading and Contacts
28
Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4
Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8
Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5
Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5
Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2
Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014
A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014
Australian Institute of Criminology http://www.aic.gov.au
University of Canberrahttp://www.Canberra.edu.au