Download - Reasoning over big data
Reasoning Over Big Data StoresEric Little, PhDVP Data SciencePolytechnic School of Engineering - [email protected]
Slide 2
Who We Are & What We Do
OSTHUS, Inc. is the U.S. subsidiary of OSTHUS GmbHGlobal presence - offices in Germany, U.S. & China Provide advanced solutions, consulting and technology services for Pharmaceutical and Biotech R&D
Technology provider for the Allotrope effort, globally aligning several pharma and biotech companies
Slide 3
Semantic Technologies – Smart Data Piece
Semantic Technologies Provide several important features for emerging new
technologies• Controlled vocabularies• Taxonomies• Metadata structures• Ontology models• Logical inference
Data today continues to evolve and grow in both size and complexity. We need hybrid solutions that can provide real insights
Analytics is growing into a new kind of field – Data Science Is data science about interacting with machines or humans? Must be able to strike a balance between complexity of the
data and simplicity of the presentation to the user
Slide 4
Metadata, Reference Data & Master Data
• While often lumped together, these are distinct kinds of data
• Semantic Technologies can help with the organization of these kinds of data – but should not be done in isolation
• Scalability is achieved using complementary approaches
Incr
ease
d co
ncep
tual
com
plex
ity Increased Scalability Issues
Slide 5
Graphs are good for information – not so good for high-bandwidth applications where speed and scalability are the primary drivers.
Can require highly specialized hardware, software techniques or engineers
Semantics should be confined to the metadata aspects of the problem – use other tech for the rest
Where Semantics Can Fall Short
Slide 6
Big Data is a real challenge – but starting to become a buzz word
Many “Big Data Problems” can be reduced to smaller data problems
Applications exist that require complex inferencing over very large data sets
A current client has lab readings from 40,000+ devices
How to do this effectively?
The Big Data Problem
Slide 7
Why Not Just Build the Data Lake?
Data lakes are fine when you are gathering and storing the data
What happens later on when a lot of data is in there?
The benefits are that data can stay in its original form – no real ETLBut running analytics across disparate stores is very challenging“Without metadata, every subsequent use of data means analysts start from scratch.” (Gartner 2014)
Slide 8
Reasoning Over Big Data Is A Growing Topic
There has been an inordinate amount of time and energy spent on just queries.
This is not reasoning though – it is just retrieval
What is Reasoning? More than just automated query sets run in sequence or
parallel Reasoning is about inferring new information that isn’t in the
raw data. It is a heuristic – where one discovers or learns something new
for themselves Deductive, Inductive, Abductive
Slide 99
Logical Reasoning (does not always assume set theory)
Mathematical Reasoning (which is logical reasoning, but assumes set theory as the basis)
Types of Reasoning One Can Use
Slide 10
Reasoning Evolution
Slide 11
Types of Semantic Inference (Forward and Backward Chaining)
Uses Modus Ponens
Finds a T consequent and affirms related antecedent (verifies connection)
Uses Modus Ponens
Finds a T antecedent & affirms a related consequent (new knowledge)
Slide 12
Ontology Layering Is Important for Scale
Data Source Models
Multi- & Single-Source Data Integration Models
Domain Models (Objs, Attributes, Process & Relations)
System Lvl Models (Rules)
Dat
a Tr
acea
bilit
y (P
rove
nanc
e)
Use
r Driv
en O
ntol
ogie
s
Upper-Lvl Models
Meta-dataLevels
(Human Concepts)
Data-centric Levels
(Machine Language)
Metaphysics – not just data modelsData Sources connected directly to higher classificationsFederation allows for improved scale
Slide 13
Get your semantics experts and your big data scientists on the same page
Utilize tables where possible – avoid multi-node graph hops Use graphs for metadata – leave instance data in place when
possible Large graphs should be avoided Lots of columns and rows are fine – joins across tables are not Break graph information into other formats wherever possible
Pre-compute phases are important Pre-compute multi-table joins based on SME input, known
semantic patterns, business rules/logics, etc. Use statistical methods to cluster data (e.g., normalcy calcs)
Use the tech that is right for the job
Combining Semantics and NoSQL
Slide 14
One Example of Using RDF in Cloud-scalable Applications
Example of a current approach being used – there are othersCan scale across multiple cloud nodes (where TS’s have issues)Triples are indexed items
THANK YOU – QUESTIONS?Eric Little, PhDVP Data ScienceOSTHUS, [email protected](M) 321-480-4818 www.linkedin.com/pub/eric-little