streaming hypothesis reasoning - william smith, jan 2016

SHyReStreaming Hypothesis Reasoning

WILLIAM SMITH, PATRICK PAULSON, MARK BORKUM, DEBORAH MCGUINNESS, BRENDA PRAGGASTIS, RUI YAN, YUE LIUDAML 2016 – Seattle, WASmart Data Conference, 2015 – San Jose, California

January 26, 2016

The legends PROTECTED INFORMATION and PROPRIETARY INFORMATION apply to information describing Subject Inventions as defined in Contract No. DE-AC05-76RL01830 and any other information which may be properly withheld from public disclosure thereunder

DOE’s National Laboratories are Solving America’s Toughest Challenges

4

New

Capability Ca

pabi

lity

Need

s

MissionDrivers

Analyzing Changing Online Landscapes

Seed LDRD Projects- Signatures of Communities & Change

- Digital Currency Graph Forensics - DarkNet Characterization- Signatures in the Cloud

Signature Discovery Initiative (SDI)

Analysis in Motion (AIM)

National Security

Computing

Disrupting Illicit Trafficking

Nuclear Security

National Defense

Homeland Security

Special Programs

Seattle Innovation District

Asymmetric Resilient Cybersecurity (ARC)

Cyber-Physical Systems

Ubiquitous Sensing

6

Analysis in Motion

Streaming Data Characterization & Processing

Library of foundational streaming algorithms, methods for extracting features from streamsData reduction techniques like semantic characterization

Hypothesis Generation & Testing

Scalable symbolic deduction & incremental machine learning to track a stream Generate, update, and validate human-understandable hypotheses from streaming classifiers

Human-Machine Feedback

Interaction with human interfaces to implicitly weight, tune, and modify underlying modelsVisual strategies for bidirectional communication of and interaction with multiple hypotheses

Work Environments

Integration framework and testing range Instrumentation to measure overall accuracy, utility, and throughput

7May 2, 2023

AIM Program Area 1

Streaming Data Characterization

Compression Analysis (CA)Video compression algorithms provide an efficient means of detecting and classifying events in a streamNonstandard featuresBecame full project at mid-year

Scalable Feature Extraction and Sampling (SFE)

Given a dataset, can we find a minimum subset that provides similar accuracy as the entire dataset?Parallel setting using MPIOpen source library (MaTEX)

8

AIM Program Area 3


User-Centered Hypothesis Definition (UCHD)

Transitioned to new PM and new technical focus in FebruaryWhat does a machine-generated hypothesis look like to a human analyst?

Science of Interaction (SOI)Use user clickstream data as an indicator of user sensemakingDeveloped and open-sourced the Streaming Canvas softwareUI engineering for use casesUser studies

May 2, 2023

9May 2, 2023

AIM Program Area 3


Mitigating Cognitive Depletion in Streaming Environments (CD)Predict and mitigate human performance degradationQuantify increase in error and impulsivity based on time from last breakStudies using Halo and exam dataUser study planned

Kills / Deaths

Halo: Reach

10

Streaming Analytics

CHALLENGE____________________________________________________________________

Craft machine-generated hypotheses as data arrive, steering data collection and using human feedback to tune a multi-classifier system.

PNNL IMPACT____________________________________________________________

Developing niche in interactive streaming analytics at scale; basis for invited keynotes at IEEE HCBDR, AAAS Big Data in Life Science, Data Science Innovation Summit, Science of Multi-INT.

Developed streaming automated detection of first point of failure in lithium battery through electron microscopy.

PNNL streaming architecture used as reference model for special programs sponsors.

Collaborators: Rensselaer Polytechnic, Laboratory for Analytic Sciences.

TXT VIS STREAM GRAPH STATS DATA PROV CYBER

11

Data Provenance & Workflow at Extreme Scale

CHALLENGE____________________________________________________________________

Ensuring reliable performance and reproducibility of complex and adaptive workflows in extreme scale environments.

PNNL IMPACT____________________________________________________________

Workflow Performance Provenance ontology captures performance and reproducibility metrics across the complete system and application stack, helping to identify causal relationships.ProvEn uses PNNL’s provenance ontology to record, correlate, and analyze events; distinguished from mainstream provenance by focusing on process not just data heritage.PNNL is informing ASCR directions for future provenance investments.

TXT VIS STREAM GRAPH STATS DATA PROV CYBER

Protected Information | Proprietary Information

Project Approach

12

National Security Computing Program Areas

13

INFRASTRUCTURE

Data and workflow management

HPC programming models and libraries

Power, performance, and reliability modeling

Resiliency theory Mobile and edge computing Embedded systems Systems engineering and

agile development Cloud and streaming

architectures

Modeling and simulation Data quality and

provenance Sampling strategies Experimental design Human language

technology Computer vision Large graph analysis Recommender systems Social and behavioral

science

ANALYTICS DECISION SUPPORT

Visualization Human-computer

interaction User experience design Semantic computing Operations research Test environments Analytic tradecraft and

critical thinking Situational awareness Collaborative systems Training systems

MISSION AREAS AND OPERATIONAL DEPLOYMENT

Cyber analysis | Bio-surveillance | Social media analysis | Forensics | Emergency preparedness and response

Law enforcement | Critical infrastructure resiliency | Trafficking networks | Power grid management

May 2, 2023 14

Project GoalsResearch Question

How do we structure the Semantic technology stack to consume and reason over a volatile data stream, and what are the effects of this configuration when expressing streaming data models through common-of-the-shelf (COTS) reasoners?

Goals of ProjectBuild prototype frameworks created to consume streaming data into a Semantic Web stackModel streaming data in a Description Logic (DL) ontology and reason over the new graph using a set of DL compliant reasonersModel streaming data into an ontology, DL or comparable rule set, that can be compared across reasoning clientsStudy the effects of cache maintenance, primarily data eviction, on the Semantic Web stack and results across reasonersDevelop engineering proposal to convert prototypes into singular platform that can be deployed on cloud networks (AWS, PIC)

May 2, 2023 15

Project Approach

Propositional data are streaming in at a certain rate, and we can only see some “window” of them at any given time.We sample the data in the window and add them to a fixed-size cache.

We need effective methods of sampling.The fixed-size cache differentiates our framing of the problem from agglomerative databases (i.e., “just store everything”).

Deductive reasoning is continuously performed over the cache in order to try and answer queries and corroborate/refute hypotheses as quickly as possible.

Low-latency, high-throughput reasoning on ephemeral data is a hard, open problem.

There will likely be many conclusions to bring to the attention of the user, and so ranking is needed in order to prioritize attention.

The idea of ranking is not so hard, but determining the correct ordering is.

16

Approach

17

Approach

May 2, 2023 18

Engineering Approach

May 2, 2023 19

Four Concurrent States

May 2, 2023 20

Four Concurrent States

May 2, 2023 21

SHyRe Decision Tree

May 2, 2023 22

SHyRe Decision Tree

May 2, 2023 23

SHyRe Decision Tree

5 Possible Outcomes:

1. Query Pellet with built in JENA RDF functionality2. Query Pellet with SPARQL Query3. Encode SPARQL to URL format and CURL a triplestore endpoint.4. Use SNARL protocol to query StarDog with SPARQL Query5. Use AGQuery protocol to query AllegroGraph with SPARQL Query

a. *RDFS++ Reasoning

May 2, 2023 24

Engineering Approach

Protected Information | Proprietary Information 25

Use Case 1: Nuclear Magnetic Resonance

May 2, 2023 26

What is Nuclear Magnetic Resonance?

May 2, 2023 27

NMR Accomplishments to Date

Research Question AnsweredBy consuming an undefined count of scans, can we assemble a NMR run, model compounds within an ontology of background data, and then reason across this new combined model of compound and spectrum ontology?

Logic Constraints AnsweredStreaming data – When is a spectrum fully assembled?How do we decide which functions to model in the ontology, and which to apply to a query?

SHyRe NMR ModelDescription Logic background ontology of compound classes and peaks (Pellet implementation)RDFS background ontology of compound classes and peaks (StarDog / AllegroGraph implementations)Consume and model a NMR run from a stream of spectrum scansQuery the NMR run after applying the compound background ontology

28


29



Use Case 2: Shipping a Strategic Surprise

30

May 2, 2023 31

How do we detect a Strategic Surprise?

May 2, 2023 32


May 2, 2023 33


May 2, 2023 34


May 2, 2023 35

Strategic Surprise Accomplishments to Date

Research Question AnsweredBased on a company’s import records, can we determine if they are entering a new LOB?

Logic Constraints AnsweredStreaming data – have to determine if record might be important in futureExplain reasoning to enable user intervention / interaction and integration with other models

SHyRe Strategic Surprise ModelModel each company by the HSCODEs it importsIdentify companies that represent all companies in a LOB

Exemplar of the LOBUse training data to get HSCODEs used by each exemplar

Count the number of matching HSCODEs between monitored company and exemplars

36


Outputs 0 0 15 88 129 1690

5000

10000

15000

20000

25000

30000

35000

40000

45000

InputsOutputs

Required Input Records to Produce Output

May 2, 2023 37


Input Import Records Output Results CPU (seconds) CPU (inputs / second)

0 0 1.292

1 0 1.693

10,000 15 77.619 128.834

20,000 88 185.553 107.786

30,000 129 330.895 90.663

40,000 169 508.902 78.601

Required Input Records to Produce Output


Project Challenges

39

Challenges

Reasoning Differences in Standards (RDFS / OWL EL/DL / RDFS++)

May 2, 2023 40

Reasoner Difficulty

Pellet Nearly complete OWL DL, but not currently maintained.

StarDog Strict separation of A-Box / T-Box reasoning within OWL DL across embedded Pellet and StarDog systems. Creates oddly formed, verbose SPARQL queries.

AllegroGraph Proprietary reasoning with inconsistent standards.

Complex cache eviction algorithms and unsupported SPARQL standards

Reasoner Difficulty

Pellet Requires complex internal storage algorithms to manipulate memory graphs

StarDog SPARQL DELETE can only support literal triples. Variables within a DELETE invoke background graph indexing and frequently fail.

May 2, 2023 41

Conclusions

Contract with Rensselaer Polytechnic InstituteRui Yan and Yue Liu joined SHyRe team advised by Prof. Deborah McGuinness

Complete: International Conference for Biomedical Ontologies Paper William Smith, Alan Chapell, Courtney Courley

Complete: Smart Data 2015 ConferenceWilliam Smith, Deborah McGuinness, Rui Yan

Complete: Conference on Information Knowledge Management 2015 PaperMark Borkum, William Smith, Deborah McGuinness, Rui Yan, Yue Liu

Complete: ISWC 2015 Workshop PaperRui Yan, Brenda Praggastis, William Smith, Deborah McGuinness

In Progress: Skolemization/Currying to enable decidable reasoningPatrick Paulson

In Progress: Journal of Web Semantics, Streaming Edition Paper

William Smith

Human Centered Analytics

[email protected]

+1.206.528.3356

SHYRE: Streaming Hypothesis Reasoningaim.pnnl.gov


mailto:[email protected]