1 homeland security research at dimacs. 2 working group on adverse event/disease reporting,...

22
1 HOMELAND SECURITY RESEARCH AT DIMACS

Post on 15-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

1

HOMELAND SECURITY RESEARCH AT DIMACS

Page 2: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

2

Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis

•Health surveillance a core activity in public health•Concerns about bioterrorism have attracted attention to new surveillance methods:

–OTC drug sales–Subway worker absenteeism–Ambulance dispatches

•Spawns need for novel statistical methods for surveillance of multiple data streams.

Disease Surveillance

Drug Safety Surveillance

Syndromic Surveillance

Vaccine SafetySurveillance

Disease Surveillance

Drug Safety Surveillance

Syndromic Surveillance

Vaccine SafetySurveillance

Page 3: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

3

Working Group on Privacy & Confidentiality of Health Data

•Privacy concerns are a major stumbling block to public health surveillance, in particular bioterrorism surveillance.•Challenge: produce anonymous data specific enough for research.•Exploring ways to remove identifiers (s.s. #, tel. #, zip code) from data sets.•Exploring ways to aggregate, remove information from data sets.

Page 4: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

4

Working Group on Analogies between Computer Viruses and Biological Viruses

•Can ideas for defending against biological viruses lead to ideas for defending against computer viruses?•Concern about large gap between initial time of attack and implementation of defensive strategies•“Public health” approach: Once a virus has infected a machine, it tries to connect it to as many computers as possible, as fast as possible. A “throttle” limits rate at which a computer can connect to new computers.

Time

# o

f In

fections

Pre-attack

Initia

l occurrence

Clean up

Page 5: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

5

Working Group on Modeling Social Responses to Bioterrorism

•Models of the spread of infectious disease commonly assume passive bystanders and rational actors who will comply with health authorities.•It is not clear how well this assumption applies to situations like a bioterrorist attack using smallpox or plague.•Interdisciplinary group is discussing incorporating social behavior into models, models of public health decisionmaking, risk communication.

1947, NYC, smallpox outbreak

Page 6: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

6

• Early warning is critical

• This is a crucial factor underlying government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack

The BASIS System

The Bioterrorism Sensor Location Problem

Page 7: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

7

Two Fundamental Problems

• Sensor Location Problem (SLP): – Choose an

appropriate mix of sensors

– decide where to locate them for best protection and early warning

Page 8: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

8

Two Fundamental Problems• Pattern Interpretation

Problem (PIP): When sensors set off an alarm, help public health decision makers decide– Has an attack taken place?– What additional

monitoring is needed?– What was its extent and

location?– What is an appropriate

response?

Page 9: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

9

Monitoring Message Streams: Algorithmic Methods for Automatic

Processing of Messages

Supported by Interagency KD-D Group

Page 10: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

10

Motivation: monitoring email traffic

OBJECTIVE:

Monitor huge streams of textualized communication to automatically detect pattern changes and "significant" events

Page 11: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

11

• Given stream of text in any language.

• Decide whether "new events" are present in the flow of messages.

• Event: new topic or topic with unusual level of activity.

• Retrospective or “Supervised” Event Identification: Classification into pre-existing classes.

TECHNICAL PROBLEM:

Page 12: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

12

• Batch filtering: Given relevant documents up front.

• Adaptive filtering: “pay” for information about relevance as process moves along.

TECHNICAL PROBLEM:

Page 13: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

13

MORE COMPLEX PROBLEM: PROSPECTIVE DETECTION OR “UNSUPERVISED” LEARNING

• Classes change - new classes or change meaning

• A difficult problem in statistics• Recent new C.S. approaches

“Semi-supervised Learning”: • Algorithm suggests a new class• Human analyst labels it; determines its

significance

Page 14: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

14

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text -- to meet storage and processing limitations;

(2). Representation of Text -- put in form amenable to computation and statistical analysis;

(3). Matching Scheme -- computing similarity between documents;

(4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”)

(5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.

Page 15: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

15

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II

•These distinctions are somewhat arbitrary.

•Many approaches to message processing overlap several of these components of automatic message processing.

Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text data.

Page 16: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

16

COMPRESSION:• Reduce the dimension before statistical analysis.• We often have just one shot at the data as it

comes “streaming by”

Page 17: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

17

COMPRESSION II:

• Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly.

We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. Our methods so far give us some confidence that we are right.

Page 18: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

18

COMPRESSION III:

•Three directions of work involving adaptation of nearest neighbor (NN) algorithms from theoretical computer science:

*Use of random projections into real subspaces. (Still promising, though not competitive for our data.)

*Random projections into Hamming cubes

*Efficient discovery of “deviant” cases in stream of vectorized entities

Page 19: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

19

• Representations: Boolean representations; weighting schemes

• Matching Schemes: Boolean matching; nonlinear transforms of individual feature values

• Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting;

• Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes

MORE SOPHISTICATED STATISTICAL APPROACHES BEING STUDIED:

Page 20: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

20

• No readily available data set has all the characteristics of data on which we expect our methods to be used

• However: Many of our methods depend essentially only on term frequencies by document.

• Thus, many available data sets can be used for experimentation.

DATA SETS USED:

Page 21: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

21

TREC (Text Retrieval Conference) data: time-stamped subsets of the data (order 105 to 106 messages)

Reuters Corpus Vol. 1 (8 x 105 messages)

Medline Abstracts (order 107 with human indexing)

DATA SETS USED II:

Page 22: 1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity

22

Endre Boros, RUTCORPaul Kantor, SCILSDave Lewis, ConsultantIlya Muchnik, DIMACS/CSS. Muthukrishnan, CSDavid Madigan, StatisticsRafail Ostrovsky, Telcordia TechnologiesFred Roberts, RutgersMartin Strauss, AT&T LabsWen-Hua Ju, Avaya Labs (collaborator)

THE MONITORING MESSAGE STREAMS PROJECT TEAM: