probability as a measure of information added

26
J Log Lang Inf (2012) 21:163–188 DOI 10.1007/s10849-011-9142-0 Probability as a Measure of Information Added Peter Milne Published online: 24 May 2011 © Springer Science+Business Media B.V. 2011 Abstract Some propositions add more information to bodies of propositions than do others. We start with intuitive considerations on qualitative comparisons of infor- mation added. Central to these are considerations bearing on conjunctions and on negations. We find that we can discern two distinct, incompatible, notions of infor- mation added. From the comparative notions we pass to quantitative measurement of information added. In this we borrow heavily from the literature on quantitative rep- resentations of qualitative, comparative conditional probability. We look at two ways to obtain a quantitative conception of information added. One, the most direct, mir- rors Bernard Koopman’s construction of conditional probability: by making a strong structural assumption, it leads to a measure that is, transparently, some function of a function P which is, formally, an assignment of conditional probability (in fact, a Pop- per function). P reverses the information added order and mislocates the natural zero of the scale so some transformation of this scale is needed but the derivation of P falls out so readily that no particular transformation suggests itself. The Cox–Good–Aczél method assumes the existence of a quantitative measure matching the qualitative rela- tion, and builds on the structural constraints to obtain a measure of information that can be rescaled as, formally, an assignment of conditional probability. A classical result of Cantor’s, subsequently strengthened by Debreu, goes some way towards justifying the assumption of the existence of a quantitative scale. What the two approaches give us is a pointer towards a novel interpretation of probability as a rescaling of a measure of information added. Keywords Information · Probability · Comparative probability · Koopman · Cox P. Milne (B ) University of Stirling, Stirling, United Kingdom e-mail: [email protected] 123

Upload: peter-milne

Post on 25-Aug-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability as a Measure of Information Added

J Log Lang Inf (2012) 21:163–188DOI 10.1007/s10849-011-9142-0

Probability as a Measure of Information Added

Peter Milne

Published online: 24 May 2011© Springer Science+Business Media B.V. 2011

Abstract Some propositions add more information to bodies of propositions thando others. We start with intuitive considerations on qualitative comparisons of infor-mation added. Central to these are considerations bearing on conjunctions and onnegations. We find that we can discern two distinct, incompatible, notions of infor-mation added. From the comparative notions we pass to quantitative measurement ofinformation added. In this we borrow heavily from the literature on quantitative rep-resentations of qualitative, comparative conditional probability. We look at two waysto obtain a quantitative conception of information added. One, the most direct, mir-rors Bernard Koopman’s construction of conditional probability: by making a strongstructural assumption, it leads to a measure that is, transparently, some function of afunction P which is, formally, an assignment of conditional probability (in fact, a Pop-per function). P reverses the information added order and mislocates the natural zeroof the scale so some transformation of this scale is needed but the derivation of P fallsout so readily that no particular transformation suggests itself. The Cox–Good–Aczélmethod assumes the existence of a quantitative measure matching the qualitative rela-tion, and builds on the structural constraints to obtain a measure of information that canbe rescaled as, formally, an assignment of conditional probability. A classical resultof Cantor’s, subsequently strengthened by Debreu, goes some way towards justifyingthe assumption of the existence of a quantitative scale. What the two approaches giveus is a pointer towards a novel interpretation of probability as a rescaling of a measureof information added.

Keywords Information · Probability · Comparative probability ·Koopman · Cox

P. Milne (B)University of Stirling, Stirling, United Kingdome-mail: [email protected]

123

Page 2: Probability as a Measure of Information Added

164 P. Milne

1 Introduction

My purpose is to present, if not a new interpretation of probability as an explicationof an everyday conception labelled ‘probability’, then, at least, new interpretationsof two calculi of conditional probability, to be understood as governing rescalingsof measures of information added. This accords with the spirit, if not the letter, ofKolmogorov’s belief in the applicability of a formal, mathematical theory ‘whose for-mulas can be applied both to probability calculus and to many other fields of pureand applied mathematics’, complying with its ‘intrinsic logical structure’, but hav-ing ‘nothing to do with the specific meaning of the theory’, i.e. with probability asordinarily understood (even by mathematicians) (Kolmogorov 1929, p. 48).1

In information theory, one starts with a (standard) probability distribution P oversignal elements and takes − log P as a measure of the information—in the senseof “surprise value”—of the signal elements.2 I want to turn the orthodox informa-tion-theoretic relationship on its head: to start from qualitative considerations on theinformation added by a proposition α to a body of propositions �, and thence to obtaina quantitative, additive measure of information i , additive in the sense that

i(α ∧ β, �) = i(α, �) + i(β, � ∪ {α}); 3

under such a measure

e−i(α,�)×c

will turn out to be, for the appropriate value of c, formally, an assignment of conditionalprobability (more or less).

I shall borrow heavily from the literature on quantitative measures of comparativeconditional probability in order that we may arrive at quantitative measures of infor-mation added. As we shall see, notions of (conditional) probability can then readily beextracted. Of the two conceptions of information added that we investigate, one yieldsa calculus much more familiar than the other: a slight variation on Popper functionsversus a variation on what Charles Morgan and Edwin Mares call core probabilityfunctions (Morgan and Mares 1995).

What is novel here is mostly the interpretation put on familiar formal results. Inthe next section I shall go quite carefully through the considerations that put us in

1 The spirit but not the letter because Kolmogorov was thinking of applications of a formal probabilitycalculus grounded in general measure theory.2 See, e.g., Osteyee and Good (1974, Ch. I). Alternately, one considers information not a property ofindividual signal elements but of the totality. One then starts from a distribution of probabilities over apartition and lays down constraints on measures of “information” or “entropy” or “uncertainty” of the par-tition; for an exhaustive mathematical treatment see Aczél and Darócy (1975), for a more recent survey inbriefer compass see Csiszár (2008). An approach that has little in common with the present one, save thatit does not presuppose a probability distribution, instead lays down axiomatic constraints on measures ofinformation/entropy of whole Boolean algebras and derives a probability measure (Ingarden and Urbanik1962).3 Cf. Carnap and Bar-Hillel’s notion of the ‘content-measure of j relative to i’ (Carnap and Bar-Hillel1953, p. 150).

123

Page 3: Probability as a Measure of Information Added

Probability as a Measure of Information Added 165

a position to appeal to those known results, in part because it is interesting in itselfto see how much follows from intuitive and (apparently) weak assumptions—includ-ing the incompatibility of the two conceptions. In later sections I shall use but notgive proofs of familiar results, largely because the proofs are lengthy and involved;adequate references to sources in the literature will be given.

2 Intuitive, Comparative Information Added

Our starting point is the thought that given a sentence α and a collection of sentences�, α adds information to �: in general the set � ∪ {α} is more informative than �,an obvious exception being when � entails α, for then the content of α is alreadycontained in �. (� need not be consistent, but if it is not then, for the reason just given,no information can be added to it.) What has just been said holds good on two ways ofthinking about information added: according to one, α adds the more information to� the more it tells us that is not already in �; according to the other, α adds the moreinformation to � the more it rules out possibilities left open by �. The encyclopae-dia editor values her contributor’s piece the more the more it says that isn’t alreadyincluded elsewhere in the encyclopaedia; on the other hand, the detective values newevidence the more the more it rules out possibilities left open by the evidence gatheredto date. To see the difference, suppose that � makes no claim as to the outcome ofa lottery, and that α and β are possible outcomes. Then α and β are pretty much ona par as to what is novel in each; but the possibilities they rule out may differ a lot:all the difference between win and loss, perhaps, if � includes details of your lotteryticket. That’s rough but it will do to be going on with; the second of the two notionsin particular will be subject to refinement below.

On both conceptions, we can, for each sentence α and set of sentences �, considerthe amount of information α adds to �. Although we will arrive at quantitative mea-sures, for the moment this is not to be understood in any narrowly quantitative sense,any more than one thinks of strictly quantitative replies to ‘How much do you loveme?’—river deep, mountain high, perhaps, but there’s no SI unit of love. But whilequantitative analyses may be hard to make, comparisons are not, as in ‘You love yourdog/golf clubs/mother [delete as applicable] more than you love me!’ (Ouch!) There’smore structure, more logical structure at least, in the case of information added buteven so, it would be nice to know what, if anything, entitles us to suppose that there isa quantitative measure. To get to that, we start with the qualitative. Until such point asa difference is explicitly marked, what follows holds good on either of the two waysof thinking about information added suggested above.

We read

〈α, �〉 � 〈β,�〉

as the comparison

the sentence β adds at least as much information to the set of sentences � as thesentence α adds to the set of sentences �,

123

Page 4: Probability as a Measure of Information Added

166 P. Milne

and

〈α, �〉 ≺ 〈β,�〉as the comparison

the sentence β adds more information to the set of sentences � than the sentenceα adds to the set of sentences �.

We write

〈α, �〉 ≈ 〈β,�〉for

〈α, �〉 � 〈β,�〉 and 〈β,�〉 � 〈α, �〉 .

Obviously,

〈α, �〉 ≺ 〈β,�〉 if, and only if, 〈α, �〉 � 〈β,�〉 and 〈β,�〉 �� 〈α, �〉 .

2.1 General

This qualitative relation � is evidently reflexive and transitive while ≺ is irreflexiveand transitive. What other properties do they have? More particularly, in what waysare they responsive to logical structure and relations? Here are a few general principlesthe encyclopaedia editor and the detective can agree on.

(i) If α ∈ � then, for all β and �, 〈α, �〉 � 〈β,�〉.Justification Whatever β adds to �, the amount of information added cannot be lessthan when no information is added, as it is not if α already belongs to �.

(ii) For some α and � for which � ∪ {α} is consistent, 〈α, � ∪ {α}〉 ≺ 〈α, �〉.It surely is not the case that the only way to really add information is to add a prop-osition inconsistent with what was there before, i.e., with �. This constraint ensuresthat the relations ≺ and � are non-trivial (but not, as yet, necessarily interestinglynon-trivial).

(iii) If α β then 〈β, �〉 � 〈α, �〉.4When α entails β, everything β says is implicit, if not explicit, in α and so α must addat least as much information to any set of sentences � as does β.

(iv) If, for all β ∈ �,� β and, for all γ ∈ �,� γ then 〈α, �〉 ≈ 〈α,�〉.One and the same sentence cannot add different amounts of information to bodiesof propositions that have essentially the same content, for in effect the antecedent

4 The logic in use below is the ∧,∨, ¬-fragment of classical propositional logic.

123

Page 5: Probability as a Measure of Information Added

Probability as a Measure of Information Added 167

captures the idea that � and � contain the same information, even if expressed indifferent ways. We are tacitly invoking a venerable tradition here in taking � and �

to have the same content if everything in one is a consequence of the other. In effect,we equate the information in a body a propositions with what can be drawn from thatbody. This is in some respects an idealised notion: we idealise away from the matter ofhow information is (re-) presented and how it is to be extracted. Nevertheless, this doesaccord with a notion of information or content that we have. For example, differentformulations of a scientific theory are different formulations of the same theory invirtue of having the same consequences.

2.2 Conjunction and Negation

What of particular logical operations?

(v)(a) If 〈α, �〉 � 〈β,�〉 and 〈γ, � ∪ {α}〉 � 〈δ,� ∪ {β}〉 then 〈α ∧ γ, �〉 �〈β ∧ δ,�〉;(v)(b) if 〈α, �〉 ≺ 〈β,�〉 and 〈γ, � ∪ {α}〉 � 〈δ,� ∪ {β}〉 then 〈α ∧ γ, �〉 ≺〈β ∧ δ,�〉 except perhaps when 〈γ, � ∪ {α}〉 has attained some upper limit;

(v)(c) if 〈α, �〉 � 〈β,�〉 and 〈γ, � ∪ {α}〉 ≺ 〈δ,� ∪ {β}〉 then 〈α ∧ γ, �〉 ≺〈β ∧ δ,�〉 except perhaps when 〈α, �〉 has attained some upper limit.

(v)(d) 〈α ∧ γ, � ∪ {α}〉 ≈ 〈γ, � ∪ {α}〉 � 〈α ∧ γ, �〉.Speaking logically, conjunction just is a device for tacking together two items. Withthat in hand, for both the encyclopaedia editor and the detective what information isadded by a conjunction α ∧ γ is therefore the same whether the conjuncts are addedconsecutively, or in one go as a conjunction. And so if α adds to � less/no more infor-mation than β adds to �, and if γ adds to �, over and above what α does, less/nomore information than δ adds, over and above what β does, to �, the conjunctionα ∧ γ must add less/cannot add more to � than β ∧ δ adds to �—unless one conjunctadds so much on its own that it completely swamps the contribution of the other con-junct, as may be possible if that conjunct adds some maximal amount (should therebe a maximum).5 We obtain (v)(a) − (c) when we equate the amount of informationadded to � by γ , over and above the information added to � by α, with the amountof information added by γ to the set containing α together with all the sentences in�, and this makes sense because the information in � ∪ {α} just is what results fromadding the information in α to the information already in �. So (v)(a) − (c) seeminescapable. As does (v)(d), for the amount of information α ∧ γ adds to � must beat least as great as what γ adds to � ∪ {α}, i.e., as what γ adds over and above what

5 As yet we haven’t established that there is an upper limit to information added; we’re just playing safe,introducing what is, with a little bit of thought, a fairly obvious qualification. The obvious way, but wedo not insist the only way, to achieve swamping is by adding inconsistent information. The probabilityanalogue is that P(α ∧ β|γ ) = 0 when P(α|γ ) = 0, no matter what the value of P(β|γ ). Making theswamping metaphor vivid, − log P(α ∧ β|γ ) = ∞ when − log P(α|γ ) = ∞, no matter what the value of− log P(β|γ ).

123

Page 6: Probability as a Measure of Information Added

168 P. Milne

α adds and since α adds no information to � ∪ {α}, all the information α ∧ γ adds to� ∪ {α} is, given the logical role of conjunction, down to what γ adds to � ∪ {α}.

From (iii) it follows that logically equivalent sentences add the same amount ofinformation, and from (iii) and (v)(d) it follows that

〈α, �〉 � 〈α ∧ β, �〉 , 〈β, �〉 � 〈α ∧ β, �〉 , 〈β, � ∪ {α}〉 � 〈α ∧ β, �〉 and〈α, � ∪ {β}〉 � 〈α ∧ β, �〉.

Also, it follows from (i) and (iv) that if � is inconsistent, then, for all α, β and�, 〈α, �〉 � 〈β,�〉.

As a strengthening of (iii) we have that 〈β, �〉 � 〈α, �〉 when � ∪ {α} β.

Proof By (iii), 〈β, �〉 � 〈α∧ β,�〉; by (iii) again, 〈α ∧ β, �〉 ≈ 〈α ∧ (¬α ∨ β) , �〉;by (iv), 〈α ∧ (¬α ∨ β) , �〉 ≈ 〈α ∧ (¬α ∨ β) , � ∪ {¬α ∨ β}〉 as � ∪ {α} β;by (v)(d), 〈α ∧ (¬α ∨ β) , � ∪ {¬α ∨ β}〉 ≈ 〈α, � ∪ {¬α ∨ β}〉; finally, by (iv)again, 〈α, � ∪ {¬α ∨ β}〉 ≈ 〈α, �〉. �If no information can be added to �∪{α}, i.e. for all β, 〈β, � ∪ {α}〉 ≈ 〈α, � ∪ {α}〉,

then, 〈α, �〉 ≈ 〈⊥, �〉, where ⊥ is a contradiction.

Proof If for all β, 〈β, � ∪ {α}〉 ≈ 〈α, � ∪ {α}〉, then, by (iii) and (v)(a), for allβ, 〈α, �〉 ≈ 〈α ∧ α, �〉 ≈ 〈α ∧ β, �〉, and, in particular, 〈α, �〉 ≈ 〈α ∧ ⊥, �〉 ≈〈⊥, �〉. �Having established the axioms (i)–(v), it is at this point, when we turn to look at

negation, that our two conceptions of information added diverge—the encyclopaediaeditor and the detective part company. Here are two principles, both of which mayseem plausible, both of which are defensible, but which in conjunction with with(i)–(v) entail that information added is an all-or-nothing affair. This is the point ofincompatibility.

(vi1) 〈α, � ∪ �〉 � 〈α, �〉.(vi2) Provided there are propositions η and θ such that 〈η, �〉 ≺ 〈θ, �〉, if〈α, �〉 � 〈β,�〉 then 〈¬β,�〉 � 〈¬α, �〉.

That in company with (i)–(v) they entail that information added is an all-or-nothingaffair is readily shown.

Theorem For any relation� satisfying (i)–(v), (vi1) and (vi2), 〈α, �〉 ≈ 〈α, �∪ {α}〉or 〈α, �〉 ≈ 〈⊥, �〉, for any α and �.

Proof If no information can be added to �∪{α} then 〈α, �〉 ≈ 〈⊥, �〉. Suppose,then, that information can be added to � ∪ {α}, i.e. that there are propositions η

and θ such that 〈η, � ∪ {α}〉 ≺ 〈θ, � ∪ {α}〉. By (vi1), 〈¬α, � ∪ {α}〉 � 〈¬α, �〉.Applying (vi2) to the latter, 〈¬¬α, �〉 � 〈¬¬α, � ∪ {α}〉. As ¬¬α � α, by(iii), 〈α, �〉 � 〈α, � ∪ {α}〉. By (i), 〈α, � ∪ {α}〉 � 〈α, �〉, hence 〈α, � ∪ {α}〉 ≈〈α, �〉.6 �

6 The argument here borrows some from the proof of Theorem 3.1, part (a), in Morgan and Mares (1995).That this is so might be more apparent if we took consistency of � and � as the criterion for order reversal

123

Page 7: Probability as a Measure of Information Added

Probability as a Measure of Information Added 169

This consequence is, intuitively, utterly absurd, for it tells us that information canbe added to the single proposition α (strictly, to the set {α}) only if α adds no infor-mation to the empty set. This suffices to show that (vi1) and (vi2) pertain to differentconceptions of information added.

Back of (vi1) lies, I think, this: whatever information � contains bearing on α, �∪�

contains at least that much, possibly more, and so of the information α adds to � somemay already belong to � ∪ � and if it does not then α just adds to � ∪ � what itdoes to �. While grounding (vi1) perfectly adequately, the idea of information addedat work here seems to supply no good reason to suppose, as (vi2) maintains, that theamount of information the negation of a proposition adds to a collection of proposi-tions increases as the amount the proposition itself adds decreases, for it would seemthat if � bears on α then it must also bear on ¬α, perhaps even to an equal extent butcertainly not increasingly as the extent to which it bears on α decreases.7—Roughly,the thought behind (vi1) may be summarised thus: the very same article cannot addmore information to the whole encyclopaedia than it adds to the volume that containsit, for it cannot overlap—share content—with more in the volume than in the wholeencyclopaedia: additions are valued for their novelty.8

(vi2) is a natural consequence of a quite different conception of information added,a conception that sees the information α adds to � in terms of the possibilities left openby � that are ruled out by α. In this respect information added is akin to Karl Popper’snotion of empirical content, the empirical content of a theory being the greater themore observable facts are incompatible with it Popper (1959, §§31 and 34, 1972,pp. 232, 385); it is perhaps even closer akin to a conception of semantic informationadvocated by Carnap and Bar-Hillel: ‘[W]e take the content of a statement to be aclass of those possible states of the universe which are excluded by this statement. . .’ (Bar-Hillel 1952, quoted in Schroeder (2004, p. 392)).9 (vi2) is then extremelyplausible: since classically either α or its negation is true, they jointly cover all thepossibilities left open by � and each may rule out some of these, the more possibilitiesthat α rules out the fewer that its negation rules out and vice versa, a sentence beingthe more informative relative to �, and hence adding the more information to �, themore possibilities it rules out. Here, with our talk of more and fewer possibilities beingruled out, we do seem to be getting much closer to a properly quantitative conceptionof information added. On this conception, the amount of information α adds to � isbest thought of as something like a measure of the proportion of cases left open by �

Footnote 6 continuedunder negation, for then we’d find that 〈α, � ∪ {α}〉 ≈ 〈α, �〉 whenever �∪{α} is consistent, in contradictionto (ii).7 Cf. Ian Hacking’s criticism of Bernard Koopman’s use of the probability analogue of (vi2) (Hacking1965, pp. 31–34).8 Or suppose that we have a grasp of some notion of partial entailment: then α adds the more informationto � the less it is entailed by � but α cannot be less of a consequence of � ∪ � than it is of �. Of course,some may dispute that monotonicity holds of partial entailment, although it’s far from obvious to me why itshould not. One thing we can say is that if it holds, the proof above shows that standard probability theorywas never a good formal means to model partial entailment.9 Cf. Michael Dummett’s characterization of the content of an assertion in terms of what it rules out(Dummett 1976 pp. 82–84, 1978, pp. xl, 22).

123

Page 8: Probability as a Measure of Information Added

170 P. Milne

that are ruled out by α. This is, of course, somewhat inexact as nothing said this farindicates that � should leave only finitely many possibilities open, nor have we saidanything concerning the individuation of “possibilities” and “cases”, but inexact asit may be, it suffices to make clear both why (vi1) is not remotely plausible on thisconception and why (vi2) is. Regarding (vi1), the proportion of possibilities left openby � ∪ � that α rules out may be far greater than the proportion of those left openby � alone that it rules out. Regarding (vi2), if α rules out no greater proportion ofthe possibilities left open by � than β rules out of those left open by �, then, exactlybecause α and its negation divide up the first lot, β and its negation the second lot, ¬β

rules out no greater proportion of the possibilities left open by � than ¬α rules out ofthose left open by �. (It may be of interest to note that to the extent that the set of prop-ositions � could be a set of beliefs (and certainly not every set of propositions could bethe totality of beliefs of an agent anything like we are (see, e.g.,Weirich 1983)), (vi2)can also be given a convincing reading in terms of “surprise value”: should someonewhose beliefs comprise exactly � find α’s being the case more surprising than doesanother, whose beliefs comprise exactly �, find β’s being the case, the former willfind α’s not being the case less surprising than the other find’s β’s not being the case.)

In the light of this justification it should come as no surprise that

(vi∗2) provided there are propositions η and θ such that 〈η, �〉 ≺ 〈θ, �〉, if〈α, �〉 ≺ 〈β,�〉 then 〈¬β,�〉 ≺ 〈¬α, �〉.

Proof If 〈α, �〉 ≺ 〈β,�〉 and there are propositions η and θ , such that 〈η, �〉 ≺〈θ, �〉, then 〈¬β,�〉 � 〈¬α, �〉, by (vi2). Suppose that 〈¬α, �〉 � 〈¬β,�〉.Then, as information can be added to �, by (iii), 〈β,�〉 ≈ 〈¬¬β,�〉 �〈¬¬α, �〉 ≈ 〈α, �〉, contrary to assumption. �

In different ways, (vi1) and (vi2) both establish the existence of maximal additionsof information. In the case of (vi2) there is a uniform maximum: the information addedby a contradiction to any set of propositions to which information can be added.

Proof (vi1): 〈α, � ∪ {α}〉 � 〈α, �〉 � 〈⊥, �〉 � 〈⊥, ∅〉. (By (ii), 〈α, � ∪ {α}〉 ≺〈⊥, ∅〉.)(vi2): if ∃η, θ such that 〈η, �〉 ≺ 〈θ, �〉, then, for all β and �, 〈¬⊥, �〉 �〈¬β,�〉, and so 〈β,�〉 ≈ 〈¬¬β,�〉 � 〈¬¬⊥, �〉 ≈ 〈⊥, �〉. �Because ⊥ α, for every α, even absent (vi2), we have that 〈α, �〉 � 〈⊥, �〉 and

that 〈α, �〉 ≈ 〈⊥, �〉 when no information can be added to � ∪ {α}. What we won’tbe able to establish is that 〈⊥, �〉 ≈ 〈⊥,�〉 for all � and � to which information canbe added. Because—with a particular purpose in mind—it will be useful to have thisconstraint later even in the absence of (vi2), we add:

(vi2°) 〈⊥, �〉 � 〈⊥,�〉 for inconsistent ⊥ and all � to which it is possible toadd information.

In tandem with (vi1), (vi2°) has this unlikely consequence:

if 〈α, �〉 ≺ 〈⊥, �〉 then 〈¬α, �〉 ≈ 〈⊥, �〉.—If α does not add to � as much asa contradiction does then ¬α does!

123

Page 9: Probability as a Measure of Information Added

Probability as a Measure of Information Added 171

Proof By hypothesis, 〈α, �〉 ≺ 〈⊥, �〉, so, by (vi1), 〈α, � ∪ {¬α}〉 ≺ 〈⊥, �〉.But, by (v)(d) and (iii), 〈α, � ∪ {¬α}〉 ≈ 〈α ∧ ¬α, � ∪ {¬α}〉 ≈ 〈⊥, � ∪ {¬α}〉.By (vi2°) it is impossible to add information to �∪{¬α}, i.e. for all propositionsη and θ, 〈η, � ∪ {¬α}〉 ≈ 〈θ, � ∪ {¬α}〉. But then, 〈¬α, �〉 ≈ 〈⊥, �〉. �

Perhaps all this does is show how badly (vi2°) sits with the overlap/encyclopaediaarticle view of information added.

The overlap/encyclopaedia article model is coherent. It values information addedfor its novelty (and that alone). Roughly, as already suggested, α adds the more infor-mation to � the more it tells us that � itself does not. We can provide a quantitativemodel under which (i)–(v) and (vi1) are satisfied:

Let L be a countably infinite set of propositions closed under negation and con-junction and let μ be a bounded measure defined on the power-set of L, i.e. μ is anon-negative additive set function and μ (L) < +∞. For finite �, define i (α, �)

as μ (Cn (¬ (γ1 ∧ γ2 ∧ · · · ∧ γn)) − Cn (¬α)), when � = {γ1, γ2, . . . , γn} andas μ (L − Cn (¬α)) when � = ∅, where, for any proposition β in L, Cn (β) isβ’s consequence class, i.e. the set {δ ∈ L : β δ}. For infinite �, set i (α, �) =inf {i (α,�) : � ⊂ � and � is finite}.i is an additive measure satisfying (i)–(vi1). It does not satisfy (vi2°).10

Coherent as it may be, in focussing on novelty it ignores what is surely often a veryimportant aspect of information in the everyday sense: information is useful for whatit gives us in conjunction with what we already have. Information is valued not only,perhaps not even, for novelty but for what it allows us to add to what we had—andamong what we can add are consequences of what’s new and what we had, conse-quences of � ∪ {α}. Classically, inconsistent sets of propositions entail everything,so adding information that yields an inconsistent set is a very informative addition,too informative to be useful. We have what we might be tempted to think of as anoff-the-scale information overload. Adding a contradiction, or some proposition thatadds as much information as a contradiction would, the resulting set of propositions“explodes”, it entails everything or at least yields as much information as a set thatentails everything. The thought behind (vi2°) is that any other addition of informationis small beer in comparison.

From here on I shall take for granted that our conception of information addedendorses at least (vi2°). From (i) – (v) and (vi2°) we get:

if 〈α, �〉 ≺ 〈⊥, �〉 then 〈⊥, � ∪ {α}〉 ≈ 〈⊥, �〉 .

10 The obvious construction sets i (α, �) = μ (Cn (α) − Cn (γ1 ∧ γ2 ∧ · · · ∧ γn)), taking the size of theset of consequences of α that are not consequences of � as a measure of α’s novelty with respect to �, butthe consequence class of a conjunction is more than the union of the consequence classes of its conjunctsexcept when one conjunct entails the other. Instead, then, we measure the novelty of α with respect to �

by the size of the set of propositions whose negations entail all consequences of � but fail to entail α (a setthat, like the set of consequences of α that are not consequences of �, is closed under conjunction). Theconsequence class of a disjunction is the intersection of the consequence classes of its disjuncts and, quitegenerally, X − (Y ∩ Z) = (X − Y ) ∪ ((X ∩ Y ) − Z).

123

Page 10: Probability as a Measure of Information Added

172 P. Milne

2.3 A Refinement

The justification for (v) relied on the idea that a conjunction contains exactly the infor-mation in its two conjuncts, and so what information is added by a conjunction is thesame as when the conjuncts are added consecutively. Now, suppose that α∧γ adds nomore information to � than β ∧δ adds to �, while α adds at least as much informationto � as β adds to �. Then should it be the case that γ adds more to � over and abovewhat α adds, than δ adds to � over and above what β adds, the excess information justdisappears, which disappearance can be accounted for only if the information addedto � by β is swamped by what δ adds to � over and above what β adds. Likewise,when α ∧ γ adds no more information to � than β ∧ δ adds to �, while γ adds atleast as much to � over and above what α adds as δ adds to � over and above what β

adds, then should it be the case that α adds more information to � than β adds to �

this can only be because what δ adds to � over and above what β adds is swamped bywhat β adds to �. (Allowing that 〈γ, � ∪ {α}〉 and 〈δ,� ∪ {β}〉, respectively, 〈α, �〉and 〈β,�〉, might be incomparable doesn’t seem an option here if we can get as faras comparing the conjunctions and one of each of their conjuncts.)

This gives us

(vii)(a) If 〈α ∧ γ, �〉 � 〈β ∧ δ,�〉 and 〈β,�〉 � 〈α, �〉 then 〈γ, � ∪ {α}〉 �〈δ,� ∪ {β}〉 unless, perhaps, 〈β,�〉 attains some maximum.

and

(vii)(b) If 〈α ∧ γ, �〉 � 〈β ∧ δ,�〉 and 〈δ,� ∪ {β}〉 � 〈γ, � ∪ {α}〉 then〈α, �〉 � 〈β,�〉 unless, perhaps, 〈δ,� ∪ {β}〉 attains some maximum.

When � is connected, i.e. that, for any α, β, � and �, 〈α, �〉 � 〈β,�〉 or 〈β,�〉 �〈α, �〉, (vii)(a) is equivalent to (v)(c) and (vii)(b) to (v)(b).

These principles are more significant than may at first appear. (vii)(a) yields thisprinciple:

(vii)(c) If 〈α∧γ, �〉 ≈ 〈⊥, �〉 then 〈α, �〉 ≈ 〈⊥, �〉or 〈γ, � ∪ {α}〉 ≈ 〈⊥, � ∪ {α}〉.However obvious the probability analogue, notice that without (vi2°), which has notbeen used in the derivation, 〈⊥, �〉 and 〈⊥, � ∪ {α}〉 may function solely as “localmaxima”, distinct relative maxima for additions of information to � and to � ∪ {α}respectively. (Of course, 〈⊥, � ∪ {α}〉 � 〈⊥, �〉.)

More impressive are these consequences:

(vii)(d) if � ¬ (α ∧ γ ) ,� ¬ (β ∧ δ) , 〈α, �〉 � 〈β,�〉 and 〈γ, �〉 � 〈δ,�〉then 〈α ∨ γ, �〉 � 〈β ∨ δ,�〉;(vii)(e) if for some ε and ζ, 〈ε, �〉 ≺ 〈ζ, �〉 and if � ¬ (α ∧ γ ) ,� ¬ (β ∧ δ) , 〈α, �〉 ≺ 〈β,�〉 and 〈γ, �〉 � 〈δ,�〉 then 〈α ∨ γ, �〉 ≺ 〈β ∨ δ,�〉.Proof (In part loosely following Koopman (1940a).) Firstly, we may supposethat there are propositions η and θ such that 〈η,�〉 ≺ 〈θ,�〉, for else (d) holdsbecause 〈α ∨ γ, �〉 � 〈α, �〉 � 〈β,�〉 ≈ 〈β ∨ δ,�〉, and (e) holds vacuouslyas the constraint 〈α, �〉 ≺ 〈β,�〉 cannot be met.

123

Page 11: Probability as a Measure of Information Added

Probability as a Measure of Information Added 173

Now, to prove (d). If there are no propositions ε and ζ such that 〈ε, �〉 ≺ 〈ζ, �〉,then 〈α ∨ γ, �〉 � 〈β ∨ δ,�〉 holds trivially. So we suppose now that for someε and ζ, 〈ε, �〉 ≺ 〈ζ, �〉. As � ¬ (α ∧ γ ) and � ¬ (β ∧ δ), we have� γ ≡ (¬α ∧ γ ) and � δ ≡ (¬β ∧ δ) and so, by the strengthening of(iii), 〈¬α ∧ γ, �〉 � 〈¬β ∧ δ,�〉. By (vi2), 〈¬β,�〉 � 〈¬α, �〉 and hence, by(vii)(a), 〈γ, � ∪ {¬α}〉 � 〈δ,� ∪ {¬β}〉—unless 〈¬β,�〉 ≈ 〈⊥,�〉. Suppos-ing, for the moment, that 〈¬β,�〉 ≺ 〈⊥,�〉, if there are propositions ε and ζ

such that 〈ε, � ∪ {¬α}〉 ≺ 〈ζ, � ∪ {¬α}〉 then, either by (vi2) again, if there arepropositions η and θ such that 〈η,� ∪ {¬β}〉 ≺ 〈θ,� ∪ {¬β}〉, or since other-wise 〈¬δ,� ∪ {¬β}〉 ≈ 〈¬β,� ∪ {¬β}〉 , 〈¬δ,� ∪ {¬β}〉 � 〈¬γ, � ∪ {¬α}〉and so, by (v)(c), 〈¬β ∧ ¬δ,�〉 � 〈¬α ∧ ¬γ, �〉.If there are no propositions ε and ζ such that 〈ε, � ∪ {¬α}〉 ≺ 〈ζ, � ∪ {¬α}〉then 〈α, � ∪ {¬α}〉 ≈ 〈¬α, � ∪ {¬α}〉 and, by (v)(a) and the strengthening of(iii), 〈¬α, �〉 ≈ 〈⊥, �〉. Then 〈¬β ∧ ¬δ,�〉 � 〈⊥,�〉 ≈ 〈⊥, �〉 ≈ 〈¬α, �〉 �〈¬α ∧ ¬γ, �〉.And if 〈¬β,�〉 ≈ 〈⊥,�〉 then 〈¬β ∧ ¬δ,�〉 � 〈⊥,�〉 ≈ 〈¬β,�〉 � 〈¬α, �〉� 〈¬α ∧ ¬γ, �〉.Finally, by (vi2) yet again and (iii) applied to De Morgan equivalences, 〈α ∨ γ, �〉� 〈β ∨ δ,�〉.To prove (e). As in the proof of (d), we have that 〈¬α ∧ γ, �〉 � 〈¬β ∧ δ,�〉. By(vi∗2) 〈¬β,�〉 ≺ 〈¬α, �〉 and hence, by (vii)(a), 〈γ, � ∪ {¬α}〉 ≺ 〈δ,� ∪ {¬β}〉,noting that 〈¬β,�〉 does not take the maximum value. If there are propositions ε

and ζ such that 〈ε, � ∪ {¬α}〉 ≺ 〈ζ, � ∪ {¬α}〉 then by (vi∗2) again, noticing that〈¬β,� ∪ {¬β}〉 ≺ 〈δ,� ∪ {¬β}〉 , 〈¬δ,� ∪ {¬β}〉 ≺ 〈¬γ, � ∪ {¬α}〉 and so,by (v)(a), 〈¬β ∧ ¬δ,�〉 ≺ 〈¬α ∧ ¬γ, �〉, noting that 〈α, �〉 cannot take themaximum value. By (vi∗2) yet again, which we may use since 〈α, �〉 ≺ 〈β,�〉,and (iii) applied to De Morgan equivalences, 〈α ∨ γ, �〉 ≺ 〈β ∨ δ,�〉.If there are no propositions ε and ζ such that 〈ε, � ∪ {¬α}〉 ≺ 〈ζ, � ∪ {¬α}〉then 〈¬α, �〉 ≈ 〈⊥, �〉. But then 〈⊥, �〉 ≈ 〈¬α, �〉 � 〈¬α ∧ γ, �〉 ≈ 〈γ, �〉 �〈⊥, �〉; and, as 〈γ, �〉 � 〈δ,�〉 , 〈δ,�〉 ≈ 〈⊥, �〉 ≈ 〈⊥,�〉. Now, 〈α ∨ γ, �〉 ≈〈(α ∧ ¬γ ) ∨ γ, �〉 and since ¬ ((α ∧ ¬γ ) ∧ γ ) , 〈α ∨ γ, �〉 ≈ 〈(α ∧ ¬γ )

∨γ, �〉 ≈ 〈(α ∧ ¬γ ) ∨ ⊥, �〉 ≈ 〈α ∧ ¬γ, �〉, by (d) which has just been dem-onstrated. But 〈α ∨ γ, �〉 � 〈α, �〉 � 〈α ∧ ¬γ, �〉, so 〈α ∨ γ, �〉 ≈ 〈α, �〉.And likewise 〈β ∨ δ,�〉 ≈ 〈β,�〉. And since we’re given that 〈α, �〉 ≺ 〈β,�〉 ,

〈α ∨ γ, �〉 ≺ 〈β ∨ δ,�〉. �

We have made heavy use of (vi2) and its consequence (vi∗2) in deriving these principlesso it’s as well to remark that, were we to take the latter principles as primitive, wewould then be able to derive (vi2) and (vi∗2) on the assumption that � is connected.

So far, so good, but for aught we know, it might be that for any α and consistent �,

either 〈α, �〉 ≈ 〈α, � ∪ {α}〉 or 〈α, �〉 ≈ 〈⊥, �〉;in other words, it might be all or nothing (as it must be if we endorse (vi1) along with(vi2))—α either adds no information to � or, which may not be different, it adds asmuch as a contradiction does. We’d like our comparisons of information added to be

123

Page 12: Probability as a Measure of Information Added

174 P. Milne

a bit more fine-grained than that—in fact, a lot more fine-grained than that. We lookfirst at Koopman’s approach.

3 The Transformation of Quality into Quantity

I haven’t insisted that the ordering � is connected: that for any α, β, � and �,

〈α, �〉 � 〈β,�〉 or 〈β,�〉 � 〈α, �〉 ,

and I won’t. What I do want to insist on is that the ordering as given can always beextended to a connected, reflexive and transitive ordering (a weak ordering). Whileany partial order can be extended to a total order (Szpilrajn 1930), something stron-ger is needed here because we want the constraints listed above on conjunctionsand negations (and perhaps others) to hold good of any extension as well as of theoriginal ordering. Sufficient conditions for the extension of a given order obeying theconstraints listed above to a connected order of the same kind (obeying the same con-straints) are not easily stated. (Malmnäs 1981 comes closest to stating what we need:qualitative conditions for the extensibility to a connected ordering.)—The motivationin requiring such extensibility is that there should be no necessary incomparability ofinformation added. In the light of Szpilrajn’s result, necessary incomparability in thesense in which I mean it must come about due to every totally ordered extension of� failing to meet one or other of the constraints on conjunctions and negations. AsI mean it, it’s not enough to be able to think of an intuitive case where one cannotmake comparisons of information added: merely formal extension has to be ruled outtoo, and, as indicated, that can only be on the grounds of failure to meet the logi-cal constraints. It is hard to think of an intuitive reason why that should be the case.(Nonetheless, as I’ll point out below, on Koopman’s approach we can make significantheadway without the assumption that the ordering as given can always be extended toa connected, reflexive and transitive ordering.)

We may need either to add further constraints or to refine previously given ones.

3.1 Koopman

Koopman has a straightforward way of securing a quantitative representation(Koopman 1940a,b). Adapted to the present setting, he first adds a further principle,his axiom of subdivision, governing, as we employ it, information added:

(viiiK) For any propositions α1, α2, . . . , αn, β1, β2, . . . , βn and bodies of propo-sitions � and �, if � α1∨α2∨· · ·∨αn, � ¬ (

αi ∧ α j), 1 ≤ i < j ≤ n,�

β1∨β2∨· · ·∨βn,� ¬ (βi ∧ β j

), 1 ≤ i < j ≤ n, 〈α1, �〉 � 〈α2, �〉 � · · · �

〈αn, �〉 and 〈β1,�〉 � 〈β2,�〉 � · · · � 〈βn,�〉, then 〈α1, �〉 � 〈βn,�〉—pro-vided there are propositions η and θ such that 〈η,�〉 ≺ 〈θ,�〉.However ungainly this may look, it is in fact readily justified when we think of

information added in terms of the proportion of possibilities ruled out. We are entitledto use that justification here as Koopman adopts (vi2) which we justified invoking the

123

Page 13: Probability as a Measure of Information Added

Probability as a Measure of Information Added 175

same conception of information. A visualizable, geometric analogy may help. Youhave divided two cakes, pie-chart fashion, each into n portions: the angle subtended atthe centre by the largest slice of one cake cannot be smaller than the angle subtendedat the centre by the smallest slice of the other cake.

This far I have been vague, deliberately so, about the domain of definition of therelations ≺ and �. Nearly all representation theorems make an assumption of plen-itude: they lay claim to a rich field of application of the qualitative relationships inquestion in order to obtain an informative “uniqueness” result. Here, adapted to thepresent setting, is Koopman’s:

Koopman’s Plenitude Assumption For each n ∈ N+, there is a set of propositions

�n , and individual propositions αn1 , αn

2 , . . . , αnn , such that �n αn

1 ∨ αn2 ∨ · · · ∨

αnn , �n ¬

(αn

i ∧ αnj

), 1 ≤ i < j ≤ n, and

⟨αn

1 , �n⟩ ≈ ⟨

αn2 , �n

⟩ ≈ · · · ≈⟨αn

n , �n⟩ ≺ 〈⊥, �n〉.

n cards are laid out in a circle, on a circular table, in a circular room; on the side thatis face down each card is decorated with the Chinese character numeral expressionfor exactly one of n natural numbers, and the sides that are uppermost are indistin-guishable; a child is asked to pick exactly one of the cards, any one she wishes, andturn it over. This is what �n tells us. The αi ’s report the various possibilities for whatis on the face of the card turned up: the n different numeral expressions. (αi reportsexactly the numeral, so if you don’t know the Chinese system, you don’t know, as weordinarily say, which number was picked.—But the exercise isn’t about picking num-bers, it’s about numerals, so you don’t lose anything by that.)—There are other waysto motivate introducing the assumption (not that Koopman himself does anything tomotivate it). In the probability literature this mostly involves fair lotteries or perfectlyshuffled decks of cards e.g. Good (1962, p. 142), which may be fair enough in contextbut I want to get away from that. This isn’t about chance. It’s about what you learn,the information added to what you already know, given what you are told about theset-up (�n), when you find out which card was turned up, and no more than whichcard was turned up.11

The Plenitude Assumption gives (viiiK) something to work on. From these and theother principles, (vi1) excluded, Koopman is able to derive

for any μ, m, ν and n, 1 ≤ μ ≤ m, 1 ≤ ν ≤ n,⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

ν , �n⟩

�⟨αm

1 ∨ αm2 ∨ · · · ∨ αm

μ , �m⟩

if, and only if, μm ≤ ν

n .

Let

P (β,�) = sup{q ∈ Q: for some m ∈ N, for some n ∈ N

+,q = m

n and 〈β,�〉 �⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

m, �n⟩}

.12

11 It’s virtually impossible to construct a scenario in which the only added information is the outcome. Youlearn of the outcome by some means. Absent a means to implant propositional information directly in thebrain (mind!), you will be aware of the means by which the information is given to you—which is itselfinformation (most likely added).12 Here

⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

m , �n⟩

is 〈⊥, �n〉 when m = 0.

123

Page 14: Probability as a Measure of Information Added

176 P. Milne

In view of Koopman’s theorem this set is a Dedekind cut, its supremum a realnumber in the interval [0, 1]. What’s more, if we start out with a connected order-ing �, we have imposed enough structure to ensure that (i) P (., �) is a standardprobability measure, provided 〈α, � ∪ {α}〉 ≺ 〈α, �〉 for some α, (ii) P (., �) = 1when 〈α, � ∪ {α}〉 ≈ 〈α, �〉 for all α, (iii) P (α ∧ β, �) = P (β, � ∪ {α})× P (α, �),and (iv) P takes as values at least all rational numbers in the interval [0, 1].—P(., .)

is, to within an insignificant nicety, a Popper function. For present purposes, a Popperfunction can be taken to be a function P : L × ℘ (L) → [0, 1], where L is a set ofpropositions closed under negation and conjunction, satisfying these constraints:

• 0 ≤ P (α, �) ≤ 1;• if α ∈ � then P (α, �) = 1;• for some α and �, P (α, �) < 1;• P (α ∧ β, �) = P (β ∧ α, �);• if, for all β ∈ �,� β and, for all γ ∈ �,� γ then P (α, �) = P (α,�);• if ∃β P (β, �) < 1 then P (α, �) + P (¬α, �) = 1;• P (α ∧ β, �) = P (α, �) × P (β, � ∪ {α}).The nicety concerns the second argument place, here taken by a set of propositionsrather than, as is more common in the literature, a single proposition.13

The function P reverses the order of information added. Moreover, since P (α, �) =P (α ∧ α, �) = P (α, �)×P (α, � ∪ {α}), the natural zero point of information addedis mapped to 1, rather than 0. Thus P is not itself straightforwardly a measure of infor-mation added but, rather, a rescaling of such a measure. The most straightforward wayto rectify P’s failings is to take as measure of information added

i = 1 − P.

Another possibility is to take

i = − log P

where the base of the logarithms may be chosen arbitrarily. The latter is additive inthis sense:

i(α ∧ β, �) = i(α, �) + i(β, � ∪ {α}).

There are, of course, countless other order-reversing bijections mapping [0, 1] intosome subset of R

+ ∪ {0,+∞} with 1 being mapped to 0. Nothing I have said encour-ages us to prefer any one to any other. The important fact is that the function P isrecoverable from all of them.

13 That the nicety is insignificant is seen, by inspection, from Koopman’s proof (Koopman 1940a, Theorem14): it is not essential that there be a single proposition rather than a possibly infinite set of propositions.(We might note that in his account of what I call the the Cox–Good–Aczél method, Kevin Van Horn takes thesecond argument to stand for a rather loosely specified state of information that need not be representableby a single proposition (Van Horn 2003).)

123

Page 15: Probability as a Measure of Information Added

Probability as a Measure of Information Added 177

Uniqueness When � is connected, P is uniquely determined. If � is not connected,we can consider the various probability distributions determined by connected exten-sions of �. For any such distribution P, P (α, �) ≤ P (β,�) when 〈β,�〉 � 〈α, �〉. IfP1 and P2 are two such distributions, the weighted average ζ P1+(1−ζ )P2, 0 < ζ < 1,also “matches” the original ordering. Thus, associated with the original ordering �is a unique convex, closed set of probability distributions. This can be used to defineupper and lower conditional probabilities.14

Faithfulness The numerical representation may not be faithful (as Koopman waswell aware): nothing rules out the possibility, for some q ∈ Q ∩ [0, 1], that (i) for allm ∈ N and n ∈ N

+, if mn < q, 〈β,�〉 �

⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

m, �n⟩, (ii) for all m ∈ N

and n ∈ N+, if q ≤ m

n ,⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

m, �n⟩� 〈γ, �〉, and (iii) 〈γ, �〉 ≺ 〈β,�〉.

The sophisticates among us distinguish between probability zero and impossibility(which was Koopman’s purpose in admitting infidelity of quantitative representa-tion). The ultra-sophisticates don’t; instead they introduce infinitesimals, which are,by definition, non-Archimedean. But for the most part the literature on representationtheorems in the theory of measurement has yet to catch up with infinitesimals.15 Faith-ful representation without infinitesimals requires an assumption that stamps out thesort of behaviour just indicated. We see exactly the form it should take in the presentcontext:

(ixK) If 〈γ, �〉 ≺ 〈β,�〉 then, for some m ∈ N and n ∈ N+, 〈γ, �〉 ≺⟨

αn1 ∨ αn

2 ∨ · · · ∨ αnm, �n

⟩ ≺ 〈β,�〉.With this enforced, we have

P (α, �) = P (β,�) only when 〈α, �〉 ≈ 〈β,�〉 ,

(but I am not saying that we should endorse it).Consistency There are two further constraints one may be inclined to add.

(iiR) If � is consistent then, for some α, 〈α, � ∪ {α}〉 ≺ 〈α, �〉.(x) If � is consistent then 〈α, �〉 ≈ 〈⊥, �〉 only if � ¬α.

14 Koopman, who did not make the assumption that � has a connected extension, had a more directapproach to upper and lower probabilities:

P∗ (β, �)

= sup{

q ∈ Q : for some m ∈ N, for some n ∈ N+, q = m

nand 〈β, �〉 � ⟨

αn1 ∨ αn

2 ∨ · · · ∨ αnm , �n

⟩};

P∗ (β, �)

= inf{

q ∈ Q : for some m ∈ N, for some n ∈ N+, q = m

nand

⟨αn

1 ∨ αn2 ∨ · · · ∨ αn

m , �n⟩ � 〈β, �〉

}.

These satisfy the more obvious of the axioms for upper and lower conditional probabilities that are to befound in the literature, but not all see Good (1962); without the assumption of “completability”, allowingfor the extension of � to a connected ordering satisfying all the adduced constraints, it cannot be shownthat the resulting upper and lower conditional probabilities are the envelope of a set of Popper functions.15 The only exception relevant in the present context of which I am aware is the work on the the Cox–Good–Aczél approach of Stefan Arnborg and Gunnar Sjödin (see, e.g., Arnborg and Sjödin 2001).

123

Page 16: Probability as a Measure of Information Added

178 P. Milne

The effect of the first is to turn the Popper function P into, within the same insig-nificant nicety, what the literature calls a Rényi function. Put another way, for eachconsistent �, the function P(.|�) is a standard probability distribution. The second,when taken together with (ixK), goes further, making P(.|�) strictly positive on theset {α ∈ L : � � ¬α}.16

Koopman’s Plenitude Assumption (as I have called it) isn’t at all crazy. What itdoes is provide the basis of a numerical representation of information added. It doesso bluntly, to be sure, but, aesthetics apart, what is there to fault it?

3.2 Cox, Good, Aczél—and Cantor and Debreu

3.2.1 Cox, Good, Aczél

Making substantial use of (vi2), (viiiK), the plenitude assumption, and the assumptionthat � is connected, we have obtained quantitative measures of information added thatcan all be rescaled as a unique conditional probability distribution. In particular, wehave obtained additive measures of information added. Where we go next is a resultthat tells us that, under a different plenitude assumption, any quantitative measure ofinformation added that is sensitive to the structural features listed in §§1.1–1.3 abovecan be rescaled as an additive measure. Further, if (vi1) (vi2) is to be respected, thereis a unique rescaling to a conditional probability distribution (Popper function), andif, instead, (vi1) and (vi2°) are to be respected there is a non-unique rescaling as afunction having some of the features of a conditional probability distribution. (I shallmake that more exact shortly.)

The mathematical cornerstone is the associativity equation, studied in the earlynineteenth century by the Norwegian mathematician Niels Henrik Abel. It has beenused as the basis for obtaining probability representations by R.T. Cox, I.J. Good, andJános Aczél, amongst others. The most rigorous presentation to date is Jeff Paris’s(1994, pp. 24–32). I’m not going to go into the details, but I do need to say somethingabout how the argument goes.17

A numerical function i faithful to the constraints we have placed on the quali-tative relation � must satisfy these conditions (numbered to match the qualitativeconstraints):

(in) i (α, �) = 0 when α ∈ �.

(iin) For some α and � for which � ∪ {α} is consistent, i (α, �) > 0.

(iiin) If α β then i (β, �) ≤ i (α, �).

(ivn) If, for all β ∈ �,� β and, for all γ ∈ �,� γ then i (α, �) = i (α,�).

16 Unlike Popper functions, Rényi functions are usually taken to be defined over boolean algebras, ratherthan the propositions of a language. This is not an essential restriction. For straightforward axiomatizationsof both Popper and Rényi functions over boolean algebras, see, e.g., Roeper and Leblanc 1991, pp. 1–2,Table of Constraints.17 Van Horn (2003) provides a good overview of what’s involved in the Cox–Good–Aczél approach.

123

Page 17: Probability as a Measure of Information Added

Probability as a Measure of Information Added 179

(vn)(a) i (α ∧ γ, �) is determined by i (α, �) and i (γ, � ∪ {α}). That is, whereI is the domain of definition of i , there is a function F : I 2 → I such that

i (α ∧ γ, �) = F (i (α, �) , i (γ, � ∪ {α})) .

(vn)(b), (vn)(c) F is strictly increasing in both arguments except, perhaps, whenat least one of them takes the maximum value of i permitted it.

(vn)(d) i (α ∧ γ, � ∪ {α}) = i (γ, � ∪ {α}) ≤ i (α ∧ γ, �).

(vi1n) i (α, � ∪ �) ≤ i (α, �) .

(vi2n) If i (., �) is not constant then i (¬β,�) ≤ i (¬α, �) when i (α, �) ≤i (β,�). When i (., �) is constant, i (¬α, �) = i (α, �) = 0. Since, by (ii),there is some � for which i (., �) is not constant, there is a function S : I → I ,such that i (¬α, �) = S (i (α, �)). Since ¬¬α � α, S is an order-invertinginvolution on I , i.e., S (y) < S (x) when x < y and S (S (x)) = x .

(vi2n°) If i (., �) is not constant then i (β,�) ≤ i (⊥, �), for all β and �. Thusi takes a maximum value in I so I contains an attained upper bound which, wemust allow, may be infinite. I shall use � to designate this upper bound.

As the numerical representation encodes a connected ordering, (vii) imposes no newconstraint. But we know that with (vi2) (vii)(a) and (vii)(c) entail that i (α ∨ β, �) isuniquely determined by i (α, �) and i (β, �) when ¬ (α ∧ β). So there is a com-mutative function G defined on some subset of I × I such that G is strictly increasingin both its arguments, save perhaps when one of them assumes the maximum value,�, and G(x, 0) = G(0, x) = x .

To these constraints on the numerical representation i we must add Paris’s PlenitudeAssumption and a plausible further constraint.18

Paris’s Plenitude Assumption For any x, y and z in [0,�]—or in [0,�) when� = +∞—and for any ε > 0 there are propositions α, β and γ and a set � ofpropositions, such that

|i (γ, � ∪ {α, β}) − x | < ε, |i (β, � ∪ {α}) − y| < ε, and |i (α, �) − z| < ε.

Thus I comprises at least a dense subset of the interval [0,�].In the absence of (vi2n°), Paris’s Plenitude Assumption in the form given here may

fail. It does fail for the novelty measure in §1.2. And it certainly fails when we adoptboth (vi1n) and (vi2n), for, as we know, i is then two-valued (all-or-nothing). That said,Paris’s Plenitude Assumption is a consequence of Koopman’s.

A conjunction entailing and being entailed by its conjuncts, we would expect smalldifferences in the information added to some body of propositions by a pair of proposi-tions to result in at best small differences in the information added by their conjunction.That being so, with (vi2n°) adopted, it is natural, possibly through the interpolation ofnotional values of i ,

18 Paris is the most explicit about its role, hence the credit.

123

Page 18: Probability as a Measure of Information Added

180 P. Milne

(viiiP) first, to take the domain in play to be the whole of the interval [0,�],second, to take the function F to be defined on the whole of [0,�]× [0,�] (andstrictly increasing in both arguments throughout [0,�) × [0,�)), and, third, totake F to be continuous in both arguments.19 We take S to be defined on thewhole of [0,�]—and so, being an order-inverting involution on [0,�] , S is nec-essarily continuous. We take the function G to be continuous in both argumentsthroughout its domain of definition, but leave in the lap of the gods exactly whichsubset of [0,�] × [0,�] that is.

The key step is to note that

(α ∧ β) ∧ γ � α ∧ (β ∧ γ ) ,

hence, for all x, y and z in [0,�],

F (F (x, y) , z) = F (x, F (y, z))

—the associativity equation. The key mathematical result is that there must then be afunction f : [0,�] → [0,+∞], unique up to multiplication by a positive real number,such that

f (F (x, y)) = f (x) + f (y) ,

with f (0) = 1 and f (�) = +∞. (See Paris (1994, pp. 26–27)) or (Aczél 1966, §6.2)for the details.)

Let a be any positive real number. The function Q defined by

Q (α, �) = a− f ◦i(α,�)

has these properties, some, at least, of which are familiar:

1. 0 ≤ Q (α, �) ≤ 1;2. Q (α, �) = 1 when � α;3. if, for all β ∈ �,� β and, for all γ ∈ �,� γ then Q (α, �) = Q (α,�);4. for some α and � for which � ∪ {α} is consistent, Q (α, �) < 1;5. Q (α ∧ β, �) = Q (α, �) × Q (β, � ∪ {α}) = Q (β, �) × Q (α, � ∪ {β}).20

The need for 4. is the first formal indication that we are dealing with something likea Popper function here, rather than a completely standard probability distribution.

If a function Q has these properties, so does any function Qb, for any positive realnumber b.—This corresponds to the uniqueness of f up to multiplication by a positivereal number.

19 If � = +∞, continuity at � means that F(�, y) = limx→+∞ F(x, y) and likewise, mutatis mutandis,for the other argument.20 1., 2., and 5. are closely reminiscent of Morgan and Mares’ axioms for a core probability function(Morgan and Mares 1995).

123

Page 19: Probability as a Measure of Information Added

Probability as a Measure of Information Added 181

Obviously, this property is preserved if we go with the constraint (vi1) on �, forthis imposes the additional constraint on Q:

i. Q (α, �) ≤ Q (α, � ∪ �),

and, indeed, this feature is still present when we take note of the constraint (vi2°)which yields

ii. if ∃βQ (β, �) < 1 then Q (⊥, �) = 0.

Any conception of probability that endorses 1. – 5., i., and ii. has as consequencethe remarkable

if ∃βQ (β, �) < 1 and Q (α, �) > 0 then Q (¬α, �) = 0.

What if, instead, we endorse (vi2) as a constraint on �? Now matters are muchmore interesting. We know, for example, that Q (¬α, �) is determined by Q (α, �)

and that, when ¬ (α ∧ β) , Q (α ∨ β, �) is determined by Q (α, �) and Q (β, �).Two logical facts are important here:

a. (α ∨ β) ∨ γ � α ∨ (β ∨ γ );b. (α ∨ β) ∧ γ � (α ∧ γ ) ∨ (β ∧ γ ).

The first means that, on its domain of definition, the function G satisfies

G (G (x, y) , z) = G (x, G (y, z))

—the associativity equation again. The second links the behaviour of F and G:

F (G (x, y) , z) = G (F (x, z) , F (y, z)) .

While it can scarcely be said to be obvious, it is in fact a consequence of theseconditions that there is a unique positive real number c such that

6. if ¬ (α ∧ β) then Qc (α ∨ β, �) = Qc (α, �) + Qc (β, �).

(See (Aczél 1966, §7.1.4)) or (Paris 1994, pp. 29–32) for details.) The value of cis determined by a straightforward consideration. S is an order-inverting involutiondefined on the interval [0,�], and so there is a unique point λ in (0,�) such thatS (λ) = λ. Its significance is this:

if 〈α, �〉 ≺ 〈⊥, �〉 and 〈α, �〉 ≈ 〈¬α, �〉 then i (α, �) = λ.

It must then be the case that

(a− f (λ)

)c = 1

2.

Simplifying, we have that, given the quantitative measure of information i , there isa unique Popper function P , given by

P (α, �) = 2− f ◦i(α,�)f (λ) ,

123

Page 20: Probability as a Measure of Information Added

182 P. Milne

and such that

i (α, �) ≤ i (β,�) if, and only if, P (α, �) ≥ P (β,�) .

3.2.2 Cantor, Debreu

So far, so good, but what entitles us to the assumption that there is a quantitativerepresentation of the connected and transitive ordering �? To answer this, let’s goback to Koopman, for a moment. The important structural feature of Koopman’sPlenitude Assumption is that it entails that there is a countably infinite set of pairs{〈αi , �i 〉 : i ∈ I} which, under �, form a dense, linearly ordered set with end-points.This set forms a “spine” against which, under the assumption of connectedness, allother pairs 〈β,�〉 can be compared and given a unique position. From a classical resultof Cantor’s we have that any countable, dense, linearly ordered set with end-pointscan be mapped isomorphically into both a bounded subset of the real numbers andinto a set [a,+∞] , a ∈ R. Where j is the isomorphism,

j (α, �) ≤ j (β,�) just in case 〈α, �〉 � 〈β,�〉 .

Clearly, the values to which the end-points are mapped are a matter of convention.It is a consequence of Paris’s Plenitude Assumption that there must be such a

countably infinite set of pairs {〈αi , �i 〉 : i ∈ I}, densely and linearly ordered under �.Consequently, the assumption that gives us leave to appeal to solutions of the associa-tivity equation in establishing existence and uniqueness results regarding a “probabil-ity transform” of a measure of information added itself guarantees the existence of aquantitative representation of the sort to which the Cox–Good–Aczél representationtheorem applies.

We saw, in effect, that in order for the resulting quantitative representation to befaithful, it must be the case that

for any β, γ,� and E , if 〈β,�〉 ≺ 〈γ, E〉 then, for some i ∈ I, 〈β,�〉 ≺ 〈αi , �i 〉 ≺〈γ, E〉.

This constraint has the effect of forcing {〈αi , �i 〉 : i ∈ I} to be densely orderedunder �. A similar-looking, but in fact very much weaker, constraint occurs in atheorem of Debreu (1954), cited (Krantz et al. 1971, p. 40):

If � is connected and transitive the following conditions are equivalent:• there is a countable set {〈αi , �i 〉 : i ∈ I} such that for any β, γ,� and E , if

〈β,�〉 ≺ 〈γ, E〉 then, for some i ∈ I, 〈β,�〉 � 〈αi , �i 〉 � 〈γ, E〉;• there is a real-valued function j such that, for any β, γ,� and E, 〈β,�〉 �

〈γ, E〉 if, and only if, j (β,�) ≤ j (γ, E).

While indicating a minimal, necessary and sufficient condition for the existence ofa faithful representation, Debreu’s ordering constraint is nowhere close to being thesort of plenitude assumption that we need for the Cox–Good–Aczél representationtheorem. (It’s compatible with Debreu’s conditions that j takes a single value!)

What emerges from all this is that an assumption guaranteeing the existence of acountable, densely ordered spine, of the sort Koopman’s Assumption gives us very

123

Page 21: Probability as a Measure of Information Added

Probability as a Measure of Information Added 183

directly, is the minimum we need to show that any measure of information addedmust be capable of rescaling as a unique conditional probability distribution (Popperfunction).

4 The Plurality of Measures of Information Added

What we have seen is that, given a rich enough field of application, any well behavedquantitative measure of information added that satisfies (vi2n) must be rescalable as aPopper function. Conversely, given a Popper function P defined on L×℘ (L), whereL is a set of propositions closed under negation and conjunction, any order-invertingbijection i mapping [0, 1] into some subset of R

+ ∪ {0,+∞} with 1 being mapped to0 yields the function i ◦ P which satisfies the constraints (in) − (vn) and (vi2n) andthus serves as a measure of information added.

If we wish to discriminate between these measures, we must find additional con-straints met by some but not others. It might, for example, be thought that additivemeasures are particularly to be prized since under an additive measure the amountof information added by a conjunction just is the sum of the amount added by oneconjunct and the amount added over and above that by the other. Undeniably, that hasa certain appeal, so we would do well to heed the words of Krantz et al. in a relatedcontext:

In some discussions of measurement, great emphasis is placed upon a particularrepresentation and its uniqueness properties—in the case of extensive measure-ment, the emphasis is on the additivity of the representation and its uniquenessup to multiplication by a positive constant. However, despite its great appealand universal acceptance, the additive representation is just one of the infinitelymany, equally adequate representations that are generated by the family of strictlymonotonic increasing functions from the reals onto the positive reals. The essen-tial fact about the uniqueness of a representation is not the particular group ofadmissible transformations, but that all groups are isomorphic and, in the case ofextensive measurement, are all one-parameter groups; that is, there is exactly onedegree of freedom in any particular representation. Krantz et al. (1971, p. 102)

5 Information Added and Matters Mostly Epistemic

In the preface to A Treatise on Probability, Keynes said that there was much that isnovel in the book and went on to warn, ‘being novel, unsifted, inaccurate, or deficient’(Keynes 1921, p. viii). Doubtless there is much that is unsifted, inaccurate, or deficientin the above but a claim to novelty may jar. Probability-based measures of informationare not new. That is indeed so, but there are two points to bear in mind here. Firstly,most work on probability based measures of infomation aims to capture a notion of“intrinsic information”, not the relational notion of information added. It may well bethat with a notion of information added in hand one can then go on to define a notionof intrinsic information—the obvious suggestion is i (α,∅)—but that is at best a by-product. Secondly, we started with qualitative considerations bearing on the notion of

123

Page 22: Probability as a Measure of Information Added

184 P. Milne

information added and arrived at probabilistic structures; we did not start out with theaim of defining information added in probabilistic terms. In this regard the approachis different to that in the bulk of work in information theory in which some statisticalprobability of occurrence is presupposed. In much the same way it differs too fromCarnap and Bar-Hillel’s account of semantic information. They begin their outline of atheory of semantic information by saying, ‘The fundamental concepts of the theory ofsemantic information can be defined in a straightforward way on the basis of the theoryof inductive probability’ (Carnap and Bar-Hillel 1953, p. 148). The approach here isthe antithesis of this. We have obtained, if you like, an information-theoretic concep-tion of probability. We have yet to see how it might relate to “inductive probability”,that is, to an epistemic notion of probability. Here we take only a few preliminarysteps.

5.1 Information-Theoretic Entropy

To emphasise the point that the notion of probability obtained is, as yet, in some senseinformation-theoretic, it is instructive to consider the quantity

−n∑

i=1

P (αi , �) log P (αi , �) ,

when � α1 ∨ α2 ∨ · · · ∨ αn and � ¬ (αi ∧ α j

), 1 ≤ i < j ≤ n. It has the form

of an expectation but in the present context it is pretty much meaningless. It is a sumof products of pairs of terms; one of the terms is a measure of the information αi addsto � on the (vi2) understanding of information added, the other is a rescaling of thatquantity. We have, at this point, no reason to suppose that P (αi , �) is, in any sense,a probability of occurrence of αi .

5.2 Degrees of Belief and Information Added

Let Q be a probability distribution over the field generated from the partition (as �

sees it) {α1, α2, . . . , αn}. The quantity

−n∑

i=1

Q (αi ) log P (αi , �) ,

is maximised when Q (αi ) = P (αi , �) , 1 ≤ i ≤ n—as follows from a theorem dueto Aczél and Pfanzagl (1966), cited (Krantz et al., 1971, p. 401). So if Q representsan agent’s degrees of belief and − log P is her measure of information added, wefind that the expected amount of information added is maximised when Q (αi ) =P (αi , �) , 1 ≤ i ≤ n. However, we cannot without further ado read out of this anyargument for equating Q (αi ) and P (αi , �) , 1 ≤ i ≤ n, for � does not appear as anargument of Q and we have no reason to think that for all � such that � α1 ∨ α2 ∨

123

Page 23: Probability as a Measure of Information Added

Probability as a Measure of Information Added 185

· · · ∨ αn and � ¬ (αi ∧ α j

), 1 ≤ i < j ≤ n, P (αi ,�) = P (αi , �) , 1 ≤ i ≤ n.

If we are to make progress here we need to fix �.Take as � the stock of the agent’s current full beliefs and let us suppose that what

she believes fully is consistent.21 The expected gain in information across the par-tition (as the agent now sees it) {α1, α2, . . . , αn} is maximised when her degree ofbelief in each possible event matches the rescaling of the information it adds to herstock of full beliefs. This is by no means an unhappy outcome. And we can evengo a step further, noting that, because P (αi , �) now is a probability of occurrence ofαi ,−∑n

i=1 P (αi , �) log P (αi , �) now is meaningful, and think how to maximise thisquantity: maximising expected information added is the maximum entropy principleby another name.

But—there’s always a but—but − log is, up to a positive linear transformation, theonly well behaved function that maximises

−n∑

i=1

Q (αi ) log P (αi , �) ,

for all possible distributions Q, when Q (αi ) = P (αi , �) , 1 ≤ i ≤ n (Aczél andPfanzagl 1966; Krantz et al. 1971, p. 401). So this is so much pie in the sky without anargument for giving pride of place to additive measures of information added—andKrantz, Luce, Suppes and Tversky have warned us against too rashly doing that.

5.3 Information Added and Conditionals

Were we to have a good reason for fixing � to comprise the stock of the agent’s cur-rent full beliefs and P (., �) to be her distribution of degrees of belief, we would thenget this neat connexion between information added and the indicative conditional aswe are told, by Dorothy Edgington and others, it is understood on condition that weuse the measure − log P (which is, up to a multiplicative constant, the only additivemeasure):

the difference between the information added to � by α ∧β and by α on its ownis the information added to � by the indicative conditional ‘if α then β’,

for, according Edgington, P (β, � ∪ {α}) just is the agent’s degree of belief in the (non-truth-conditional) indicative conditional ‘if α then β’ when P (., �) is her distributionof degrees of belief (see, e.g., Edgington 1986).

Appealing instead to the measure 1 − P , we obtain a result that is in at least onerespect even better, for it holds irrespective of the interpretation of �. The materialconditionals ¬α ∨β and ¬α ∨ (α ∧ β) are logically equivalent. α ∧β is, up to logicalequivalence, the strongest proposition that can be inferred from the pair {α,¬α ∨ β}.And, in classical propositional logic, ¬α∨β is, up to logical equivalence, the weakestproposition that, jointly with α, entails β (equivalently, α ∧β). Now, when we use the

21 Full belief in a proposition has been characterized as the doxastic attitude or state in which a personcategorically accepts that proposition as true (see, e.g., Joyce 2009, p. 263).

123

Page 24: Probability as a Measure of Information Added

186 P. Milne

measure 1 − P , we have that the difference in information added to � between α ∧ β

and α is the information added to � by the logically weakest proposition that, jointlywith α, permits β (equivalently, α ∧ β), to be deduced. The difference in informationadded to � between α ∧ β and α thus serves as a measure of the deductive “distance”or “gap” between � ∪ {α} and � ∪ {α, β}, equivalently, � ∪ {α,¬α ∨ β}.

Under the measure i = 1 − P, i (α ∧ β, �) − i (α, �) = i (¬α ∨ β, �) = i (¬α∨(α ∧ β) , �), thus the difference i (α ∧ β, �) − i (α, �) nicely maps an essential partof the deductive structure of classical propositional logic in a way that is not availableto measures that are not linear functions of P . Of course, this is not to say that they—measures that are not linear functions of P—do not have other virtues, such as, in thecase of − log P , the identification of the information added by β over and above whatα adds to � with the information added by the Edgington “indicative conditional” ‘ifα then β’ to � (for appropriate �).

5.4 Confirmation

Background knowledge/information � delimits a space of possibilities. Evidence erules out possibilities and thus focuses attention on those possibilities left open by �

that are compatible with e. In both ranges of possibilities, hypothesis h holds in someand not in others (unless it’s trivial in some respect). We may say that e favours h justif, speaking loosely, the proportion of possibilities left open by � ∪ {e} in which hholds is greater than the proportion of possibilities left open by � alone in which itholds. In other words,

e favours h( against background �) if, and only if, 〈h, � ∪ {e}〉 ≺ 〈h, �〉 .

Furthermore, since we are using range-of-possibilities talk, we are in (vi2)’s ambit ofapplication. Hence we have some reason to hold that a measure of confirmation shouldbe a function of i (h, � ∪ {e}) and i (h, �), hence of P (h, � ∪ {e}) and P (h, �),where P(.|�) represents the agent’s degrees of belief, and, indeed, there are directconfirmation-theoretic grounds for thinking that any measure of confirmation shouldbe a function of P (h, � ∪ {e}) and P (h, �) when � comprises the rational agent’sbackground knowledge (see Milne 2011, Corollary 1.1).

6 Conclusion: Probability as a Rescaling of a Measure of Information Added

Starting from intuitive, qualitative considerations bearing on the notion of informationadded, we have isolated two distinct and incompatible ways to conceive informationadded. One notion quite directly gets us to Popper functions on the assumption thatthe ordering

the sentence β adds at least as much information to the set of sentences � as thesentence α adds to the set of sentences �

can be extended, at least formally, to a connected ordering satisfying the intuitiveconsiderations relating to logical structure. On the assumption that both notions yield

123

Page 25: Probability as a Measure of Information Added

Probability as a Measure of Information Added 187

quantities measurable by a single number, we find that quantitative measures mustbe susceptible to rescaling as probability measures, although what exactly we meanby a probability measure differs with the two notions. In one case we have a Popperfunction; in the other a probability-like function satisfying the unusual constraint thatP(¬α|�) = 0 when 0 < P(α|�) < 1. To obtain these representations we have tomake the sort of plenitude assumptions common in the literature on the theory ofmeasurement.

Four topics would benefit from further exploration. In no particular order they are:

• the properties of the highly non-standard probability-like functions obtained fromthe overlap/encyclopaedia (novelty-value) conception of information added;

• the exact conditions under which the ordering extends to a connected ordering (andan investigation of what happens when they are weakened);

• the properties of representations employing infinitesimals to ensure faithfulness tothe ordering (and the conditions under which they may be obtained);

• relations, if any, between the “information-theoretic” reading of probability offeredhere and more familiar epistemic conceptions.

Much, then, remains to be done.

References

Aczél, J. (1966). Lectures on functional equations and their applications. New York and London:Academic Press. (Reprinted 2006, Mineola NY: Dover). (Supplemented English translation ofVorlesungen uber Funktionalgleichungen und ihre Anwendungen. Basel: Birkhäuser, 1961.)

Aczél, J. & Z. Daróczy (1975). On measures of information and their characterization, Volume 115of Mathematics in science and engineering. New York and London: Academic Press.

Aczél, J., & Pfanzagl, J. (1966). Remarks on the measurement of subjective probability and informa-tion. Metrika, 2, 91–105.

Arnborg, S., & Sjödin, G. (2001). On the foundations of Bayesianism. In A. Mohammad-Djarafi (Ed.),Bayesian inference and maximum entropy methods in science and engineering, 20th internationalworkshop, Gif-sur-Yvette (France), 2000. Vol. 568 of AIP conference proceedings (pp. 61–71).American Institute of Physics.

Bar-Hillel, Y. (1952). Semantic information and its measures. In Transactions of the tenth conferenceon cybernetics (pp. 33–48). New York: Josiah Macy Jr., Foundation. (Reprinted In Language andinformation: Selected essays on their theory and application, pp. 298–310, Y. Bar-Hillel, Ed.,1964, Reading, MA: Addison-Wesley.)

Carnap, R., & Bar-Hillel, Y. (1953). Semantic information. British Journal for the Philosophy ofScience, 4, 147–157.

Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10, 261–273.Debreu, G. (1954). Representation of preference ordering by a numerical function. In R. Thrall,

C. Coombs, & R. David, Decision processes (pp. 159–165). New York: Wiley.Dummett, M. A. E. (1976). What is a theory of meaning? (II). In G. Evans, J. McDowell (Eds.), Truth

and meaning: Essays in semantics (pp. 67–137). Oxford: Oxford University Press. (Reprinted inDummett, The seas of language, pp. 34–93, 1996, Oxford: Oxford University Press, Page referenceto the reprint.)

Dummett, M. A. E. (1978). Truth and other enigmas. London: Duckworth.Edgington, D. (1986). Do conditionals have truth conditions? Crítica 18, 3–39. (Reprinted In Conditionals,

pp. 176–201, F. Jackson, (Ed.), 1991, Oxford: Oxford University Press.)Good, I. J. (1962). Subjective probability as the measure of an unmeasurable set. In E. Nagel, P. Suppes,

& A. Tarski (Eds.), Logic, methodology and philosophy of science (pp. 319–329). Stanford: StanfordUniversity Press. (Reprinted In Studies in subjective probability, second edition, pp. 133–146,H. Kyburg, & H. Smokler, Eds., 1980, Huntington NY: Krieger, Page reference to the reprint.)

123

Page 26: Probability as a Measure of Information Added

188 P. Milne

Hacking, I. (1965). The logic of statistical inference. Cambridge: Cambridge University Press.Ingarden, R. S., & Urbanik, K. (1962). Information without probability. Colloquium Mathemati-

cum, 9, 131–150.Joyce, J. M. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In

F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief, Vol. 342 of synthese library (pp. 263–266)New York: Springer.

Keynes, J. M. (1921). A treatise on probability. London: Macmillan. (Reprinted 2004, Mineola NY:Dover.)

Kolmogorov, A. (1929). General measure theory and probability calculus. Sbornik rabot MatematicheskogoRazdela. Kommunisticheskaya Akademiya, Sektsiya Estestvennikh i Tochnikh Nauk 1, 8–21. (InRussian. English translation in A. N. Shiryayev (Ed.), Selected Works of A. N. Kolmogorov. Vol.II, Probability theory and mathematical statistics (pp. 48–59). (G. Lundquist, Trans.). Dordrecht:Kluwer (1992).)

Koopman, B. O. (1940a). The axioms and algebra of intuitive probability. Annals of Mathemat-ics, 41, 269–292.

Koopman, B. O. (1940b). The bases of probability. Bulletin of the American Mathematical Society. 46,763–774. (Reprinted In: H. Kyburg, H. Smokler (Eds.), Studies in subjective probability seconded., pp. 117–131, 1980, Huntington NY: Krieger.)

Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement, Vol. I,additive and polynomial representations. Academic Press, San Diego and London. (Reprinted2006, Mineola NY: Dover.)

Malmnäs, P. -E. (1981). From qualitative to quantitative probability, Vol. 7 of Stockholm Studies inPhilosophy. Stockholm: Almqvist and Wiksell.

Milne, P. (2011). On measures of confirmation. British Journal for the Philosophy of Science, (toappear).

Morgan, C., & Mares, E. (1995). Conditionals, probability, and non-triviality. Journal of PhilosophicalLogic, 24, 455–467.

Osteyee, D. B. & Good, I. J. (1974). Information, weight of evidence, the singularity between probabilitymeasures and signal detection, Vol. 376 of Lecture notes in mathematics. Berlin, Heidelberg andNew York: Springer.

Paris, J. B. (1994). The uncertain reasoner’s companion: A mathematical perspective, Vol. 39 ofCambridge tracts in theoretical computer science. Cambridge: Cambridge University Press

Popper, K. R. (1959). Logic of scientific discovery. London: Hutchinson. (Expanded English translationof Logik der Forschung, Vienna: Springer, 1935.)

Popper, K. R. (1972). Conjectures and refutations (fourth ed.). London: Routledge and Kegan Paul.First edition, 1963.

Roeper, P., & Leblanc, H. (1991). Indiscernibility and identity in probability theory. Notre Dame Journalof Formal Logic, 32, 1–46.

Schroeder, M. J. (2004). An alternative to entropy in the measurement of information. Entropy, 6, 388–412.Szpilrajn, E. (1930). Sur l’expansion de l’ordre partiel. Fundamenta Mathematicae, 16, 386–389.Van Horn, K. S. (2003). Constructing a logic of plausible inference: A guide to Cox’s theorem. International

Journal of Approximate Reasoning, 34, 3–24.Weirich, P. (1983). Conditional probabilities and probabilities given knowledge of a condition. Philosophy

of Science, 50, 82–95.

123