center for science of information

38
Science & Technology Centers Program Center for Science of Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego UIUC Center for Science of Information NSF Center for Science of Information: Overview & Results 1 National Science Foundation/Science & Technology Centers Program

Upload: deron

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Center for Science of Information. NSF Center for Science of Information: Overview & Results. Outline. 1. Science of Information 2. Center Mission Integrated Research - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Bryn Mawr

Howard

MIT

Princeton

Purdue

Stanford

Texas A&M

UC Berkeley

UC San Diego

UIUC

1

Center forScience of Information

NSF Center for Science of Information:Overview & Results

National Science Foundation/Science & Technology Centers Program

Page 2: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Outline1. Science of Information 2. Center Mission Integrated Research Education and Diversity Knowledge Transfer

3. Research Graph Compression Deinterleaving Markov Processes

2

Page 3: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Shannon LegacyThe Information Revolution started in 1948, with the publication of:

A Mathematical Theory of Communication.

The digital age began.

Claude Shannon:Shannon information quantifies the extent to which a recipient of data can reduce its statistical uncertainty.“semantic aspects of communication are irrelevant . . .”

Fundamental Limits for Storage and Communication (Information Theory Field)

Applications Enabler/Driver:CD, iPod, DVD, video games, Internet, Facebook, WiFi, mobile, Google, . . Design Driver:universal data compression, voiceband modems, CDMA, multiantenna, discrete denoising, space-time codes, cryptography, . . .

3

Page 4: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

4

1. Back from infinity: Extend Shannon findings to finite size data structures (i.e., sequences, graphs), that is, develop information theory of various data structures beyond first-order asymptotics.

2. Science of Information: Extend Information Theory to meet new challenges in biology, massive data, communication, knowledge extraction, economics, …

In order to accomplish it we must understand new aspects of information in:

structure, time, space, and semantics, and

dynamic information, limited resources, complexity, representation-invariant information, and cooperation & dependency.

Post-Shannon Challenges

Page 5: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Post-Shannon ChallengesStructure:Measures are needed for quantifying information embodied in structures (e.g., information in material structures, nanostructures, biomolecules, gene regulatory networks, protein networks, social networks, financial transactions).

Szpankowski & Choi : Information containedin unlabeled graphs & universal graphical compression..Grama & Subramaniam : quantifying role of noise and incomplete data in biological network reconstruction.Neville: Outlining characteristics (e.g., weak dependence) sufficient for network models to be well-defined in the limit.Yu & Qi: Finding distributions of latent structures in social networks.

Page 6: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Post-Shannon ChallengesTime: Classical Information Theory is at its weakest in dealing with problems of delay (e.g., informationArriving late may be useless of has less value).

Verdu & Polyanskiy: major breakthrough in extending Shannon capacity theorem to finite blocklength information theory.Kumar: design reliable scheduling policies with delayconstraints for wireless networks.Weissman: real time coding system to investigate theimpact of delay on expected distortion.Subramaniam: reconstruct networks from dynamic biological data.Ramkrishna: quantify fluxes in biological networks byshowing that enzyme metabolism is regulated by a survival strategy in controlling enzyme syntheses.Space:Bialek: explores transmission of information in making spatial patterns in developing embryos – relation toinformation capacity of a communication channel.

Page 7: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Post-Shannon Challenges

7

Limited Resources: In many scenarios, information is limited by available computational resources (e.g., cell phone, living cell).

Bialek works on structure of molecular networks that optimize information flow, subject to constraints on the total number of molecules being used.

Verdu, Goldsmith investigates the minimum energy per bit as a function of data length in Gaussian channels.

Semantics:Is there a way to account for the meaning or semantics of information?

Sudan argues that the meaning of information becomes relevant whenever there is diversity across communicating parties and when parties themselves evolve over time.

Page 8: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Post-Shannon Challenges

8

Representation-invariance:How to know whether two representations of the same information are information equivalent?

Learnable Information (BigData):

Data driven science focuses on extracting information from data. How much information can actually be extracted from a given data repository? How much knowledge is in Google's database?

Atallah investigates how to enable the untrusted remote server to store and manipulate the client's confidential data. Clifton investigates differential privacy methods for graphs to determine conditions under which privacy is practically.Grama, Subramaniam and Szpankowski work on analysis of extreme-scale networks; such datasets are noisy and their analyses need fast probabilistic algorithms and compressed representations.

Page 9: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Post-Shannon Challenges

9

Cooperation & Dependency: How does cooperation impact information (nodes should cooperate in their own self-interest)? Cover initiates a theory of cooperation and coordination in networks, that is, they study the achievable joint distribution among network nodes, provided that the communication rates are given.

Dependency and rational expectation are critical ingredients in Sims' work on modern dynamic economic theory.

Coleman is studying statistical causality in neural systems by using Granger principles.

Quantum Information:The flow of information in macroscopic systems andmicroscopic systems may posses different characteristics.Aaronson and Shor lead these investigations. Aaronson develops computational complexity of linear systems.

Page 10: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Outline1. Science of Information

2. Center Mission1. Integrated Research2. Education and Diversity3. Knowledge Transfer

3. Research

10

Page 11: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Mission and Center Goals Advance science and technology through a new quantitative understanding of the representation, communication and processing of information in biological, physical, social and engineering systems.

Some Specific Center’s Goals:• define core theoretical principles governing transfer of information,• develop metrics and methods for information,• apply to problems in physical and social sciences, and engineering,• offer a venue for multi-disciplinary long-term collaborations,• explore effective ways to educate students, • train the next generation of researchers,• broaden participation of underrepresented groups,• transfer advances in research to education and industry

11

Education &Diversity

RESEARCH

Knowledge Transfer

Page 12: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

STC TeamBryn Mawr College: D. KumarHoward University: C. Liu, L. BurgeMIT: P. Shor (co-PI), M. Sudan (Microsoft)Purdue University: (lead): W. Szpankowski (PI)Princeton University: S. Verdu (co-PI)Stanford University: A. Goldsmith (co-PI)Texas A&M: P.R. KumarUniversity of California, Berkeley: Bin Yu (co-PI)University of California, San Diego: S. SubramaniamUIUC: O. Milenkovic.

Bin Yu, U.C. Berkeley

Sergio Verdú,Princeton

Peter Shor,MIT

Andrea Goldsmith,Stanford

Wojciech Szpankowski, Purdue

12

R. Aguilar, M. Atallah, C. Clifton, S. Datta, A. Grama, S. Jagannathan, A. Mathur, J. Neville, D. Ramkrishna, J. Rice, Z. Pizlo, L. Si, V. Rego, A. Qi, M. Ward, D. Blank, D. Xu, C. Liu, L. Burge, M. Garuba, S. Aaronson, N. Lynch, R. Rivest, Y. Polyanskiy, W. Bialek, S. Kulkarni, C. Sims, G. Bejerano, T. Cover, T. Weissman, V. Anantharam, J. Gallant, C. Kaufman, D. Tse,T.Coleman.

Page 13: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Center Participant Awards• Nobel Prize (Economics): Chris Sims

• National Academies (NAS/NAE) – Bialek, Cover, Datta, Lynch, Kumar, Ramkrishna, Rice, Rivest, Shor, Sims, Verdu.

• Turing award winner -- Rivest.

• Shannon award winners -- Cover and Verdu.

• Nevanlinna Prize (outstanding contributions in Mathematical Aspects of Information Sciences) -- Sudan and Shor.

• Richard W. Hamming Medal – Cover and Verdu.

• Humboldt Research Award -- Szpankowski.

13

Page 14: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

STC StaffDirector – Wojciech Szpankowski

Managing Director – Bob Brown

Education Director – Brent Ladd

Diversity Director – Barbara Gibson

Multimedia Specialist – Mike Atwell

Administrative Asst. – Kiya SmithKiya Smith

Administrative Asst.

Mike AtwellMultimedia Specialist

Brent LaddEducation Director

Bob BrownManaging Director

Barbara Gibson, Diversity Director

14

Page 15: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Integrated Research

15

Research Thrusts:

1. Life Sciences

2. Communication

3. Knowledge Management (BigData)

S. Subramaniam A. Grama

David Tse T. Weissman

S. Kulkarni M. Atallah

Create a shared intellectual space, integral to the Center’s activities, providing a collaborative research environment that crosses disciplinary and institutional boundaries.

Page 16: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

16

Opportunistic Research Workshops

Kickoff Workshop – Chicago, IL – October 6-7, 2010 (28 STC participants and students)1. Allerton Workshop, IL – September 28, 2010

2. Stanford Workshop – Palo Alto, CA – January 24, 2011 (Princeton – Stanford –

Berkeley)

3. UC San Diego Workshop – February 11, 2011 (Berkeley – UCSD)

4. Princeton Workshop – May 13, 2011 (Princeton – Purdue)

5. Purdue Workshop – September 9, 2011 (Berkeley – UCSD – Purdue)

6. Allerton Workshop – September 27, 2011 (Purdue –UIUC)

7. UCSD Workshop on Neuroscience, February 2012 (UCSD – Berkeley)

8. Princeton Grand & Petit Challenges Workshop, March 2012 (Center-wide)

9. Maui Workshop (ICC), (Industrial Round-Table), May, 2012 (Center-wide)

10. MIT Workshop, July 2012.

Page 17: Center for Science of Information

Science & Technology Centers Program

Center for Science of InformationGrand and Petits

Challenges1. Data Representation and Analysis• Succinct Data Structures (compressed structures that lend themselves to

operations efficiently)• Metrics: tradeoff storage efficiency and query cost• Validations on specific data (sequences, networks, structures, high dim. sets)• Streaming Data (correlation and anomaly detection, semantics, association)• Generative Models for data (dynamic model generations)• Identifying statistically correlated graph modules (e.g., given a graph with

edges weighted by node correlations and a generative model for graphs, identify the most statistically significantly correlated sun-networks)

• Information theoretic methods for inference of causality from data and modeling of agents.

• Complexity of flux models/networks

17

Page 18: Center for Science of Information

Science & Technology Centers Program

Center for Science of InformationGrand and Petits

Challenges2. Challenges in Life Science• Sequence Analysis

– Reconstructing sequences for emerging nano-pore sequencers.– Optimal error correction of NGS reads.

• Genome-wide associations– Rigorous approach to associations while accurately quantifying the prior in data

• Darwin Channel and Repetitive Channel• Semantic and syntactic integration of data from different datasets (with

same or different abstractions)• Construction of flux models guided measures of information theoretic

complexity of models• Quantification of query response quality in the presence of noise• Rethinking interactions and signals from fMRI• Develop in-vivo system to monitor 3D of animal brain (will allow to see the

interaction between different cells in cortex – flow of information, structural relations)

18

Page 19: Center for Science of Information

Science & Technology Centers Program

Center for Science of InformationGrand and Petits

Challenges3. Challenges in Economics

• To study the impact of information flow on dynamics of economic systems

• Impact of finite rate feedback and delay on system behavior

• Impact of agents with widely differing capabilities (information and computation) on overall system state, and on the state of individual agents.

• Choosing portfolios in the presence of information constraints.

19

Page 20: Center for Science of Information

Science & Technology Centers Program

Center for Science of InformationGrand and Petits

Challenges4. Challenges in Communications• Coordination over networks

– Game theoretic solution concept for coordination over networks– Applications in control.

• Relearning information theory in non-asymptotic regimes– Modeling energy, computational power, feedback, sampling,

and delay• Point to point communication with delay and/or finite rate feedback• Back from Infinity (analytic and non-asymptotic information theory)• Information and computation

a. Quantifying fundamental limits of in-network computation, and the computing capacity of networks for different functions

b. Complexity of distributed computation in wireless and wired networksc. Information theoretic study of aggregation for scalable query processing

20

Page 21: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Prestige Lecture Series

21

Page 22: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Outline1. Science of Information2. Center Mission

1. Integrated Research2. Education and Diversity3. Knowledge Transfer

3. Research Graph Compression Deinterleaving Markov Processes

22

Page 23: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

23

Information Theory of Data Structures: Following Ziv (1997) we propose toexplore finite size information theory of data structures (i.e., sequences,graphs), that is, to develop information theory of various data structuresbeyond first-order asymptotics. We focus here on information of graphicalstructures (unlabeled graphs).

F. Brooks, jr. (JACM, 50, 2003, “Three Great Challenges for . . .CS”):We have no theory that gives us a metric for the Information embodied instructure. This is the most fundamental gap in the theoretical underpinningsof information science and of computer science.

Networks (Internet, protein-protein interactions, and collaboration network)and Matter (chemicals and proteins) have structures.They can be abstracted by (unlabeled) graphs.

Structural Information

(Y. Choi, W.S., IEEE Trans. Info.Th., 58, 2012)

Page 24: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

24

Graph and Structural Entropy

24

Information Content of Unlabeled Graphs:A structure model S of a graph G is defined for an unlabeled version.Some labeled graphs have the same structure.

Graph Entropy vs Structural Entropy:

The probability of a structure S is: P(S) = N(S) · P(G)where N(S) is the number of different labeled graphs having the samestructure.

Page 25: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

25

Relationship between and

Page 26: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

26

Erdös-Rényi Graph Model

Page 27: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

27

Structural Entropy

Page 28: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

28

Compression Algorithm called Structural zip in short SZIP – Demo

Structural Zip (SZIP) Algorithm

Page 29: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

29

Asymptotic Optimality of SZIP

Page 30: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

Outline1. Science of Information2. Center Mission

1. Integrated Research2. Education and Diversity3. Knowledge Transfer

3. Research Graph Compression Deinterleaving Markov Processes

30

Page 31: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

31

Interleaved StreamsUser 1 output: a a b a b b a a a b a b a a b b b b a b a b b a b a a a a b b b

User 2 output: A B C A B C A B C A B C A B C A B C A B C A B C A B C A B

But only one log file is available, with no correlation between users: a A B a c b A a b b B a C a A B a b a b C a A B C A a b B b c

Questions:• Can we detect the existence of two users?• Can we de-interleave the two streams?

(G. Seroussi, W.S., M. Weinberger, IEEE Trans. Info. Th., 2012.)

Page 32: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

32

Probabilistic Model

Page 33: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

33

Problem StatementGiven a sample of , infer the partition(s) ( w.h.p. as ) P n

Is there a unique solution? (Not always!) Any partition of induces a representation of , but

• may not have finite memory or ‘s may not be independent• (switch) may not be memoryless (Markov)

For the problem was studied by Batu et al. (2004)• an approach to identify one valid IMP presentation of w.h.p.• greedy, very partial use of statistics slow convergence• but simple and fast

We propose an MDL approach for unknown • much better results, finds all partitions compatible with • Complex if implemented via exhaustive search, but randomized gradient

descent approximation achieves virtually the same results

' A P'iP 'iP'wP

1k

P

kP

Some Observations:

𝒛𝑛

Page 34: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

34

Uniqueness of RepresentationClearly, if is memoryless and then any re-partitioning of leads to another

partition compatible with• canonical partition (w.r.t. ): one in which such ‘s are singletons• canonical version of (both compatible with )

Theorem: If compatible with , then compatable with iff IMP presentation unique up to re-partitioning of “memoryless alphabets”

Proof idea (warm-up): :any string over with : sub-string “projected” over : a -tuple over with (there’s always one)

independent of

independent of ,

iP 1iA iAP

P iA* P

P ' P* '*

s iA ( ) 0iP s ' ' js s A ' jA

( ) 0iP u u k \ 'i jA A

( | ) ( | ) / ( ) ( | ') / ( )i w wP a s P a s P i P a s P i

( | ') ( | ' ) ( | )P a s P a s u P a u 's

( | )iP a s s ,a s

Page 35: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

35

Finding via Penalized ML

Page 36: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

36

Main Result

Theorem: Let and . For in the penalization we have

Proof Ideas:1. P is a FSM process.2. If P’ is the ``wrong’’ process (partition), then either one of the underlying process

or switch process will not be of finite order, or some of the independence assumptions will be violated.

3. The penalty for P’ will only increase.4. Rigorous proofs use large deviations results and Pinsker’s inequality over the

underlying state space.

1 2( , ,..., ; )m wP I P P P P max ( )i ik ord P

2

= argmin

Page 37: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

37

Experimental Results

Page 38: Center for Science of Information

Science & Technology Centers Program

Center for Science of Information

38

Generalization and Future Work A different Markov order can be estimated for each , as well as more general

tree models

Switch with memory• can be solved with similar tools

Future work: intersecting sub-alphabets• much more challenging!

iP