every bit counts

21
Every Bit Counts A semantic approach to the binary representation of data and programs Andrew Kennedy & Dimitrios Vytiniotis Microsoft Research Cambridge {akenn,dimitris}@microsoft.com ICFP 2010, Baltimore, MD

Upload: ion

Post on 23-Feb-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Every Bit Counts. A semantic approach to the binary representation of data and programs. Andrew Kennedy & Dimitrios Vytiniotis Microsoft Research Cambridge { akenn,dimitris }@microsoft.com. ICFP 2010 , Baltimore, MD. Before the fun starts: “is encoding and decoding relevant?”. Sure: - PowerPoint PPT Presentation

TRANSCRIPT

Every Bit CountsA semantic approach to the binary

representation of data and programs

Andrew Kennedy & Dimitrios VytiniotisMicrosoft Research Cambridge

{akenn,dimitris}@microsoft.comICFP 2010, Baltimore, MD

Before the fun starts: “is encoding and decoding relevant?”

Sure:

• How to design easy-to-verify tamper-proof bytecode formats?– Semi-formal work for Java [Franz et al.]

• How to incorporate semantic and statistical info for more compact encodings and compression? – Java bytecode and .NET CLI are quite bulky formats, also work on Javascript compression

schemes etc. – Lots of work in the XML world, oracle-based PCC checking [Necula and Rahul], term

compression [J. Cheney], … • How to make it easy to prove the correctness of a codec?

– Lots of work in the generic programming realm, also PADS [Fisher et al.]…

• … and offer all that in a nice DSL? – Easier in use and verification than pickler combinators [Kennedy]

Let’s play a “guess who” gameI have some PL researcher in mind. Can you guess who?

YesDo they do research in functional programming?

Do they care a lot about minimal invariance? NoDo they work on polytypic programming? YesAre they taller than 1.90m? YesAre they in the ICFP committee? No

Guess which program

Question AnswerIs it a function application? No

(It must be a λ-expression.) Is the argument an Int? Yes

Is the body a variable? No

Is the body a function application? No

(It must be another λ-expression.) Is its argument an Int? Yes

Is its body a variable? Yes

Is it the most recently introduced variable? No

Aha! You thought of λx:Int.λy:Int.x

I have some program* in mind. Can you guess which?

Code 0100110

* A closed program in STLC with Int base type.

The idea

Represent a codec by a strategy for playing a question & answer guessing-game

• Encode – ask questions of data and record answers as bitstream

• Decode– interpret bitstream as answers to the same Q&A strategy

Example play, set-theoretically

All well-typed programs

Lambda expressions

Function application expressions

Int-argument lambdas

Non-Int-argument lambdas

Int-argument lambdas with variable body

Int-argument lambdas with non-variable body

Is it a function application? No.Is its argument an Int? Yes.Is its body a variable? No.

λ x :I nt . λ y :I nt . x

Binary partition of set

Singleton set

Set of possible data values

From sets to types

• Possible set of data values: type • Binary partition of set: type isomorphism • Singleton set: type isomorphism • Strategy: possibly-infinite binary decision tree whose– nodes contain type isomorphisms – leaves contain type isomorphisms

Or, in code:

𝑡 ≅ 1

A silly game: unary naturals

isZero: 01

Infinite tree, crucially relying on laziness (co-induction in Coq)

isZero:

…01

Generic encoding and decoding

𝑒𝑛𝑐∷𝐺𝑎𝑚𝑒𝑡→ 𝑡→ [𝐵𝑖𝑡 ]If is a singleton, there’s

no information to encode!

Otherwise, use the map from to to “ask” in which partition lives

Emit a bit and continue on the left or right subtree with the deconstructed value

Encode a value of type to a bitstream

𝑑𝑒𝑐∷𝐺𝑎𝑚𝑒𝑡→ [𝐵𝑖𝑡 ]→ (𝑡 , [𝐵𝑖𝑡 ])May throw error if bistream too short

Example:

Correctness for free*

Bitstrings

Set

01001010

* If the ISOs are indeed isomorphisms

𝑥𝑒𝑛𝑐𝑔

𝑑𝑒𝑐𝑔

𝑒𝑛𝑐𝑔

𝑑𝑒𝑐𝑔

Non-ambiguity and non-redundancy for free*

Bitstrings

Set

01001010

* If the ISOs are indeed isomorphisms

𝑥𝑦

01001110

Unambiguous codes

Non-redundant codes

Game combinators

Dependent pairs: type of second

component depends on value of first

Cast a game from one type to another through

an isomorphism

Given games for and , construct games for sum

or product of and

Combinators in action!

Combinators = co-fixpoints𝑝𝑟𝑜𝑑𝐺𝑎𝑚𝑒∷𝐺𝑎𝑚𝑒𝑡1→𝐺𝑎𝑚𝑒𝑡 2→𝐺𝑎𝑚𝑒 (𝑡1 , 𝑡2)

𝑔1∷𝐺𝑎𝑚𝑒𝑡1 𝑔2∷𝐺𝑎𝑚𝑒𝑡 2 𝑝𝑟𝑜𝑑𝐺𝑎𝑚𝑒𝑔1𝑔2

𝑔2𝑔2

𝑔2

No silly questions please, and Every Bit Counts!

• If possible, strategy should not ask “silly questions” that reveal no new information e.g.Are you a number smaller than 5? YesAre you a number smaller than 7? Of course I am!

• This corresponds to proper partitioning: For all isos in game sets and are non-empty

Theorem: Suppose has proper partitioning, and there is a leaf for every element of its domain. If fails then there is some extension of such that succeeds.

But what does that mean?Theorem: … blablablah …

EVERY bitstring represents a non-empty set of elements in

the domain

That feels highly compact! Can we take this domain to be the “set of

well typed programs”?

Simple types

Problem: Devise a game for STLC with no “silly questions”!

• Idea 1: Parameterize game on environment (for open terms) and type:

Γ⊢𝑒1:𝜏1→𝜏2 Γ⊢𝑒2:𝜏1Γ⊢𝑒1𝑒2:𝜏2

Γ , 𝑥 :𝜏1⊢𝑒 :𝜏2Γ⊢𝜆𝑥 :𝜏1 .𝑒 :𝜏1→𝜏2 Γ ,𝑥 :𝜏 , Γ ′⊢𝑥 :𝜏  

𝜏∷=𝑛𝑎𝑡∨𝜏→𝜏

Not every environment/type combination is inhabited. To avoid asking “silly” questions (at game construction time – not at encoding/decoding time) we have to solve inhabitation problems.

Some ingenuity requiredIdea 2: Parameterize on environment and pattern of form where is a wildcard

All environment/pattern combinations are inhabited, no need to solve hard problems at game construction time

A provably EVERY BIT COUNTS encoding for STLC

(and the proof did not kill us)

The STLC gameCan we play a game for variables with this pattern in this environment?

Are you a variable?

Are you anapplication?

Application game: 1. Play game for argument2. *Get* the argument and play game for the function using the

argument’s type

Pearly too! Haskell code for STLC,

on one slide

See paper for details, games for several statistical compression schemes, and even

more game transformations

Future directions

Do it for real! E.g. .NET CIL, ghc Core

Integrate arithmetic coding. Put probabilities on arcs of tree.

Develop “methodology” for codecs for typed programming languages. (=> No ingenuity required?)

What’s left, after the fun?

An elegant characterization

codecs Q&A strategiesand a DSL to program them

1. Q&A strategies can give rise to non-redundant, compact coding schemes

2. Offer cheap verification 3. And are fun to program with

Download and play: http://research.microsoft.com/people/dimitris