Download - Reasoning over big data

Reasoning Over Big Data StoresEric Little, PhDVP Data SciencePolytechnic School of Engineering - [email protected]

Who We Are & What We Do

OSTHUS, Inc. is the U.S. subsidiary of OSTHUS GmbHGlobal presence - offices in Germany, U.S. & China Provide advanced solutions, consulting and technology services for Pharmaceutical and Biotech R&D

Technology provider for the Allotrope effort, globally aligning several pharma and biotech companies

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCNb0xbzisscCFQU5iAod1wgNDA&url=https://www.eonerc.rwth-aachen.de/&ei=XTfTVZbSKoXyoATXkbRg&psig=AFQjCNHpgiYZNVnbCyDywqIo421ouTsE0A&ust=1439991906289225

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCIPM6u7isscCFc40iAodgucNFg&url=http://www.findbrevardhomes.com/blog/affordable-waterfront-homes-brevard-county-florida/&ei=xzfTVcPMCc7poASCz7ewAQ&psig=AFQjCNEzyOiXjKxzGAG-HtMFwrcRIGv8VQ&ust=1439992076745773

Semantic Technologies – Smart Data Piece

Semantic Technologies Provide several important features for emerging new

technologies• Controlled vocabularies• Taxonomies• Metadata structures• Ontology models• Logical inference

Data today continues to evolve and grow in both size and complexity. We need hybrid solutions that can provide real insights

Analytics is growing into a new kind of field – Data Science Is data science about interacting with machines or humans? Must be able to strike a balance between complexity of the

data and simplicity of the presentation to the user

Metadata, Reference Data & Master Data

• While often lumped together, these are distinct kinds of data

• Semantic Technologies can help with the organization of these kinds of data – but should not be done in isolation

• Scalability is achieved using complementary approaches

Incr

ease

d co

ncep

tual

com

plex

ity Increased Scalability Issues

Graphs are good for information – not so good for high-bandwidth applications where speed and scalability are the primary drivers.

Can require highly specialized hardware, software techniques or engineers

Semantics should be confined to the metadata aspects of the problem – use other tech for the rest

Where Semantics Can Fall Short

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCOHCo6LmsscCFYkoiAodPtUJKw&url=http://stackoverflow.com/questions/11563375/how-to-create-a-graph-for-a-semantic-web-service-data-set&ei=WDvTVeG2FYnRoAS-qqfYAg&psig=AFQjCNGWAcaCu4Vi5QeNgqDz6inYE9tB5g&ust=1439993024452334

Big Data is a real challenge – but starting to become a buzz word

Many “Big Data Problems” can be reduced to smaller data problems

Applications exist that require complex inferencing over very large data sets

A current client has lab readings from 40,000+ devices

How to do this effectively?

The Big Data Problem

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCPXJj4PnsscCFUIriAodhM8B4A&url=http://liliendahl.com/category/supposed-to-be-a-joke/&ei=IzzTVbWsG8LWoASEn4eADg&psig=AFQjCNFvP-hcvAYCQUD3SNDQVZWZ6CRN6Q&ust=1439993205130894

Why Not Just Build the Data Lake?

Data lakes are fine when you are gathering and storing the data

What happens later on when a lot of data is in there?

The benefits are that data can stay in its original form – no real ETLBut running analytics across disparate stores is very challenging“Without metadata, every subsequent use of data means analysts start from scratch.” (Gartner 2014)

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCOfooNXnsscCFZIziAody0kEzg&url=http://www.kdnuggets.com/2012/12/machine-learning-data-mining-humor.html&ei=zzzTVafFKpLnoATLk5HwDA&psig=AFQjCNFNnw8gh_tNZl0aC-wuE4_-2vfeSw&ust=1439993357644644

Reasoning Over Big Data Is A Growing Topic

There has been an inordinate amount of time and energy spent on just queries.

This is not reasoning though – it is just retrieval

What is Reasoning? More than just automated query sets run in sequence or

parallel Reasoning is about inferring new information that isn’t in the

raw data. It is a heuristic – where one discovers or learns something new

for themselves Deductive, Inductive, Abductive

Logical Reasoning (does not always assume set theory)

Mathematical Reasoning (which is logical reasoning, but assumes set theory as the basis)

Types of Reasoning One Can Use

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCMu-_OnTs8cCFYd9iAodfr4M0w&url=http://plato.stanford.edu/entries/square/&ei=Oq7TVcu5Cof7oQT-_LKYDQ&psig=AFQjCNEXZ6GqyfXQkmDAV4NVZb9UN_yeug&ust=1440022437165617



Reasoning Evolution

Types of Semantic Inference (Forward and Backward Chaining)

Uses Modus Ponens

Finds a T consequent and affirms related antecedent (verifies connection)

Uses Modus Ponens

Finds a T antecedent & affirms a related consequent (new knowledge)

Ontology Layering Is Important for Scale

Data Source Models

Multi- & Single-Source Data Integration Models

Domain Models (Objs, Attributes, Process & Relations)

System Lvl Models (Rules)

Dat

a Tr

acea

bilit

y (P

rove

nanc

e)

Use

r Driv

en O

ntol

ogie

s

Upper-Lvl Models

Meta-dataLevels

(Human Concepts)

Data-centric Levels

(Machine Language)

Metaphysics – not just data modelsData Sources connected directly to higher classificationsFederation allows for improved scale

Get your semantics experts and your big data scientists on the same page

Utilize tables where possible – avoid multi-node graph hops Use graphs for metadata – leave instance data in place when

possible Large graphs should be avoided Lots of columns and rows are fine – joins across tables are not Break graph information into other formats wherever possible

Pre-compute phases are important Pre-compute multi-table joins based on SME input, known

semantic patterns, business rules/logics, etc. Use statistical methods to cluster data (e.g., normalcy calcs)

Use the tech that is right for the job

Combining Semantics and NoSQL

One Example of Using RDF in Cloud-scalable Applications

Example of a current approach being used – there are othersCan scale across multiple cloud nodes (where TS’s have issues)Triples are indexed items

THANK YOU – QUESTIONS?Eric Little, PhDVP Data ScienceOSTHUS, [email protected](M) 321-480-4818 www.linkedin.com/pub/eric-little

mailto:[email protected]

Download - Reasoning over big data

Top Related