streaming hypothesis reasoning - william smith, jan 2016
TRANSCRIPT
SHyReStreaming Hypothesis Reasoning
WILLIAM SMITH, PATRICK PAULSON, MARK BORKUM, DEBORAH MCGUINNESS, BRENDA PRAGGASTIS, RUI YAN, YUE LIUDAML 2016 – Seattle, WASmart Data Conference, 2015 – San Jose, California
January 26, 2016
The legends PROTECTED INFORMATION and PROPRIETARY INFORMATION apply to information describing Subject Inventions as defined in Contract No. DE-AC05-76RL01830 and any other information which may be properly withheld from public disclosure thereunder
DOE’s National Laboratories are Solving America’s Toughest Challenges
4
New
Capability Ca
pabi
lity
Need
s
MissionDrivers
Analyzing Changing Online Landscapes
Seed LDRD Projects- Signatures of Communities & Change
- Digital Currency Graph Forensics - DarkNet Characterization- Signatures in the Cloud
Signature Discovery Initiative (SDI)
Analysis in Motion (AIM)
National Security
Computing
Disrupting Illicit Trafficking
Nuclear Security
National Defense
Homeland Security
Special Programs
Seattle Innovation District
Asymmetric Resilient Cybersecurity (ARC)
Cyber-Physical Systems
Ubiquitous Sensing
6
Analysis in Motion
Streaming Data Characterization & Processing
Library of foundational streaming algorithms, methods for extracting features from streamsData reduction techniques like semantic characterization
Hypothesis Generation & Testing
Scalable symbolic deduction & incremental machine learning to track a stream Generate, update, and validate human-understandable hypotheses from streaming classifiers
Human-Machine Feedback
Interaction with human interfaces to implicitly weight, tune, and modify underlying modelsVisual strategies for bidirectional communication of and interaction with multiple hypotheses
Work Environments
Integration framework and testing range Instrumentation to measure overall accuracy, utility, and throughput
7May 2, 2023
AIM Program Area 1
Streaming Data Characterization
Compression Analysis (CA)Video compression algorithms provide an efficient means of detecting and classifying events in a streamNonstandard featuresBecame full project at mid-year
Scalable Feature Extraction and Sampling (SFE)
Given a dataset, can we find a minimum subset that provides similar accuracy as the entire dataset?Parallel setting using MPIOpen source library (MaTEX)
8
AIM Program Area 3
Human-Machine Feedback
User-Centered Hypothesis Definition (UCHD)
Transitioned to new PM and new technical focus in FebruaryWhat does a machine-generated hypothesis look like to a human analyst?
Science of Interaction (SOI)Use user clickstream data as an indicator of user sensemakingDeveloped and open-sourced the Streaming Canvas softwareUI engineering for use casesUser studies
May 2, 2023
9May 2, 2023
AIM Program Area 3
Human-Machine Feedback
Mitigating Cognitive Depletion in Streaming Environments (CD)Predict and mitigate human performance degradationQuantify increase in error and impulsivity based on time from last breakStudies using Halo and exam dataUser study planned
Kills / Deaths
Halo: Reach
10
Streaming Analytics
CHALLENGE____________________________________________________________________
Craft machine-generated hypotheses as data arrive, steering data collection and using human feedback to tune a multi-classifier system.
PNNL IMPACT____________________________________________________________
Developing niche in interactive streaming analytics at scale; basis for invited keynotes at IEEE HCBDR, AAAS Big Data in Life Science, Data Science Innovation Summit, Science of Multi-INT.
Developed streaming automated detection of first point of failure in lithium battery through electron microscopy.
PNNL streaming architecture used as reference model for special programs sponsors.
Collaborators: Rensselaer Polytechnic, Laboratory for Analytic Sciences.
TXT VIS STREAM GRAPH STATS DATA PROV CYBER
11
Data Provenance & Workflow at Extreme Scale
CHALLENGE____________________________________________________________________
Ensuring reliable performance and reproducibility of complex and adaptive workflows in extreme scale environments.
PNNL IMPACT____________________________________________________________
Workflow Performance Provenance ontology captures performance and reproducibility metrics across the complete system and application stack, helping to identify causal relationships.ProvEn uses PNNL’s provenance ontology to record, correlate, and analyze events; distinguished from mainstream provenance by focusing on process not just data heritage.PNNL is informing ASCR directions for future provenance investments.
TXT VIS STREAM GRAPH STATS DATA PROV CYBER
Protected Information | Proprietary Information
Project Approach
12
National Security Computing Program Areas
13
INFRASTRUCTURE
Data and workflow management
HPC programming models and libraries
Power, performance, and reliability modeling
Resiliency theory Mobile and edge computing Embedded systems Systems engineering and
agile development Cloud and streaming
architectures
Modeling and simulation Data quality and
provenance Sampling strategies Experimental design Human language
technology Computer vision Large graph analysis Recommender systems Social and behavioral
science
ANALYTICS DECISION SUPPORT
Visualization Human-computer
interaction User experience design Semantic computing Operations research Test environments Analytic tradecraft and
critical thinking Situational awareness Collaborative systems Training systems
MISSION AREAS AND OPERATIONAL DEPLOYMENT
Cyber analysis | Bio-surveillance | Social media analysis | Forensics | Emergency preparedness and response
Law enforcement | Critical infrastructure resiliency | Trafficking networks | Power grid management
May 2, 2023 14
Project GoalsResearch Question
How do we structure the Semantic technology stack to consume and reason over a volatile data stream, and what are the effects of this configuration when expressing streaming data models through common-of-the-shelf (COTS) reasoners?
Goals of ProjectBuild prototype frameworks created to consume streaming data into a Semantic Web stackModel streaming data in a Description Logic (DL) ontology and reason over the new graph using a set of DL compliant reasonersModel streaming data into an ontology, DL or comparable rule set, that can be compared across reasoning clientsStudy the effects of cache maintenance, primarily data eviction, on the Semantic Web stack and results across reasonersDevelop engineering proposal to convert prototypes into singular platform that can be deployed on cloud networks (AWS, PIC)
May 2, 2023 15
Project Approach
Propositional data are streaming in at a certain rate, and we can only see some “window” of them at any given time.We sample the data in the window and add them to a fixed-size cache.
We need effective methods of sampling.The fixed-size cache differentiates our framing of the problem from agglomerative databases (i.e., “just store everything”).
Deductive reasoning is continuously performed over the cache in order to try and answer queries and corroborate/refute hypotheses as quickly as possible.
Low-latency, high-throughput reasoning on ephemeral data is a hard, open problem.
There will likely be many conclusions to bring to the attention of the user, and so ranking is needed in order to prioritize attention.
The idea of ranking is not so hard, but determining the correct ordering is.
16
Approach
17
Approach
May 2, 2023 18
Engineering Approach
May 2, 2023 19
Four Concurrent States
May 2, 2023 20
Four Concurrent States
May 2, 2023 21
SHyRe Decision Tree
May 2, 2023 22
SHyRe Decision Tree
May 2, 2023 23
SHyRe Decision Tree
5 Possible Outcomes:
1. Query Pellet with built in JENA RDF functionality2. Query Pellet with SPARQL Query3. Encode SPARQL to URL format and CURL a triplestore endpoint.4. Use SNARL protocol to query StarDog with SPARQL Query5. Use AGQuery protocol to query AllegroGraph with SPARQL Query
a. *RDFS++ Reasoning
May 2, 2023 24
Engineering Approach
Protected Information | Proprietary Information 25
Use Case 1: Nuclear Magnetic Resonance
May 2, 2023 26
What is Nuclear Magnetic Resonance?
May 2, 2023 27
NMR Accomplishments to Date
Research Question AnsweredBy consuming an undefined count of scans, can we assemble a NMR run, model compounds within an ontology of background data, and then reason across this new combined model of compound and spectrum ontology?
Logic Constraints AnsweredStreaming data – When is a spectrum fully assembled?How do we decide which functions to model in the ontology, and which to apply to a query?
SHyRe NMR ModelDescription Logic background ontology of compound classes and peaks (Pellet implementation)RDFS background ontology of compound classes and peaks (StarDog / AllegroGraph implementations)Consume and model a NMR run from a stream of spectrum scansQuery the NMR run after applying the compound background ontology
28
NMR Accomplishments to Date
29
NMR Accomplishments to Date
Protected Information | Proprietary Information
Use Case 2: Shipping a Strategic Surprise
30
May 2, 2023 31
How do we detect a Strategic Surprise?
May 2, 2023 32
How do we detect a Strategic Surprise?
May 2, 2023 33
How do we detect a Strategic Surprise?
May 2, 2023 34
How do we detect a Strategic Surprise?
May 2, 2023 35
Strategic Surprise Accomplishments to Date
Research Question AnsweredBased on a company’s import records, can we determine if they are entering a new LOB?
Logic Constraints AnsweredStreaming data – have to determine if record might be important in futureExplain reasoning to enable user intervention / interaction and integration with other models
SHyRe Strategic Surprise ModelModel each company by the HSCODEs it importsIdentify companies that represent all companies in a LOB
Exemplar of the LOBUse training data to get HSCODEs used by each exemplar
Count the number of matching HSCODEs between monitored company and exemplars
36
Strategic Surprise Accomplishments to Date
Outputs 0 0 15 88 129 1690
5000
10000
15000
20000
25000
30000
35000
40000
45000
InputsOutputs
Required Input Records to Produce Output
May 2, 2023 37
Strategic Surprise Accomplishments to Date
Input Import Records Output Results CPU (seconds) CPU (inputs / second)
0 0 1.292
1 0 1.693
10,000 15 77.619 128.834
20,000 88 185.553 107.786
30,000 129 330.895 90.663
40,000 169 508.902 78.601
Required Input Records to Produce Output
Protected Information | Proprietary Information
Project Challenges
39
Challenges
Reasoning Differences in Standards (RDFS / OWL EL/DL / RDFS++)
May 2, 2023 40
Reasoner Difficulty
Pellet Nearly complete OWL DL, but not currently maintained.
StarDog Strict separation of A-Box / T-Box reasoning within OWL DL across embedded Pellet and StarDog systems. Creates oddly formed, verbose SPARQL queries.
AllegroGraph Proprietary reasoning with inconsistent standards.
Complex cache eviction algorithms and unsupported SPARQL standards
Reasoner Difficulty
Pellet Requires complex internal storage algorithms to manipulate memory graphs
StarDog SPARQL DELETE can only support literal triples. Variables within a DELETE invoke background graph indexing and frequently fail.
May 2, 2023 41
Conclusions
Contract with Rensselaer Polytechnic InstituteRui Yan and Yue Liu joined SHyRe team advised by Prof. Deborah McGuinness
Complete: International Conference for Biomedical Ontologies Paper William Smith, Alan Chapell, Courtney Courley
Complete: Smart Data 2015 ConferenceWilliam Smith, Deborah McGuinness, Rui Yan
Complete: Conference on Information Knowledge Management 2015 PaperMark Borkum, William Smith, Deborah McGuinness, Rui Yan, Yue Liu
Complete: ISWC 2015 Workshop PaperRui Yan, Brenda Praggastis, William Smith, Deborah McGuinness
In Progress: Skolemization/Currying to enable decidable reasoningPatrick Paulson
In Progress: Journal of Web Semantics, Streaming Edition Paper
William Smith
Human Centered Analytics
+1.206.528.3356
SHYRE: Streaming Hypothesis Reasoningaim.pnnl.gov
Protected Information | Proprietary Information