information entropy and granulation co–entropy of

Information Entropy and Granulation

Co–Entropy of Partitions and Coverings:A Summary�

Daniela Bianucci and Gianpiero Cattaneo

Dipartimento di Informatica, Sistemistica e ComunicazioneUniversita di Milano – Bicocca

Viale Sarca 336/U14, I–20126 Milano, Italia{bianucci,cattang}@disco.unimib.it

Abstract. Some approaches to the covering information entropy andsome definitions of orderings and quasi–orderings of coverings will be de-scribed, generalizing the case of the partition entropy and ordering. Theaim is to extend to covering the general result of anti–tonicity (strictlydecreasing monotonicity) of partition entropy. In particular an entropyin the case of incomplete information systems is discussed, with the ex-pected anti-tonicity result, making use of a partial partition strategyin which the missing information is treated as a peculiar value of thesystem.

On the other side, an approach to generate a partition from a coveringis illustrated. In particular, if we have a covering γ coarser than anothercovering δ with respect to a certain quasi order relation on coverings,the induced partition π(γ) results to be coarser than π(δ) with respectto the standard partial ordering on partitions. Thus, one can comparethe two coverings through the entropies of the induced partitions.

Keywords: Measure distributions, probability distributions, partitions,partial partitions, coverings, partial ordering, quasi–ordering, entropy,co–entropy, isotonicity, anti–tonicity.

1 Introduction

Recently in literature [20,21,15] there is a great interest in generalizing to the caseof coverings the notion of entropy, as measure of the information average, largelystudied in the partition context of information theory by Shannon [26] (andsee the textbooks [16,1,25] as interesting introductions to this argument). Theessential behavior we want to generalize is the isotonicity (i.e., strictly increasingmonotonicity) of this information measure with respect to the natural partial

� The author’s work has been supported by MIUR\PRIN project “Automata andFormal languages: mathematical and application driven studies”and by “Funds ofSovvenzione Globale INGENIO allocated by Fondo Sociale Europeo, Ministero delLavoro e della Previdenza Sociale, and Regione Lombardia”.

J.F. Peters et al. (Eds.): Transactions on Rough Sets X, LNCS 5656, pp. 15–66, 2009.c© Springer-Verlag Berlin Heidelberg 2009

16 D. Bianucci and G. Cattaneo

order relation, usually adopted on the family of all possible partitions of a givenuniverse. The aim is to provide a strictly isotonic evaluation of the approximationof a given set in the context of rough set theory.

In [22,23] Pawlak introduced the roughness as a measure which quantitativelyevaluates the boundary region of a set relatively to its upper approximation in agiven partition context. But, as shown by examples in section 2.8, the boundaryregion could remain invariant also if the partition changes. In this way a strictisotonicity of this roughness measure of a set is not granted. Since this evaluationis based on the partition of the universe (which is independent from the set)and on the boundary of the set under consideration, in order to obtain a strictisotonic measure of the roughness of a set a solution is to multiply the strictisotonic measure of the partition granularity by the roughness measure of theset [9,5]. The strictly isotonic granularity measure considered in this work is theco–entropy of partitions.

One of the problems that rises in extending the partition approach to thecovering context is that from mutually equivalent formulations of the partialorder relation on partitions one obtains different orderings and quasi–orderingson coverings . This leading to the fact that if one wants to catch the fundamentalproperty of isotonicity of an entropy in the covering context, the selection of theright (quasi) order relation becomes a crucial choice.

In the last years we have explored different (quasi) partial orderings on cov-erings with the most natural extensions to them of the partition entropy, oftenwith negative results [5,2,4,6,12]. But as a final result of these negative attempts,we presented a new relation of partial ordering on coverings which allows one toobtain the requested isotonicity of the entropy, not directly, but in an indirectway [3]. To be precise on the partition properly induced from a covering by awell defined procedure.

Since these researches only appeared in papers published in various, differentcontexts, and for the lack of space often with only brief descriptions (especiallyof proofs), we think it is now necessary to provide a unified view of these inves-tigations and of the obtained results (see sections 3, 4.2 and 4.4).

1.1 Entropy of Abstract Discrete Probability Distributions

In this subsection we discuss the abstract approach to information theory, ab-stract in the sense that it does not refer to a concrete universe X of objects,with the associate power set P(X) as collection of all its subsets, but only tosuitable finite sequences of numbers from the real unit interval [0, 1], each ofwhich can be interpreted as a probability of occurrence of something. The mainreason of this introduction is that both the cases of partitions and coverings canbe discussed as particular applications of this unified abstract framework.

First of all, let us introduce as information function (also called the Hartleymeasure of uncertainty, see [14]) the mapping I : (0, 1] �→ R assigning to anyprobability value p ∈ (0, 1] the real number

I(p) := − log(p) (1)

Information Entropy and Granulation Co–Entropy of Coverings 17

interpreted as the uncertainty associated with an event whose occurrence prob-ability is p. This is the unique function, up to an arbitrary positive constantmultiplier, satisfying the following conditions:

(F-1) it is non–negative;(F-2) it satisfies the so–called Cauchy functional condition I(p1 ·p2) = I(p1)+

I(p2);(F-3) it is continuous;(F-4) it is non–trivial (∃p0 ∈ (0, 1] s.t. I(p0) �= 0).

The information function is considered as a measure of the uncertainty due tothe knowledge of a probability since if the probability is 1, then there is nouncertainty and so its corresponding measure is 0. Moreover, any probabilitydifferent from 1 (and 0) is linked to some uncertainty whose measure is greaterthan 0 in such a way that the lower is the probability and the greater is the cor-responding uncertainty (strict monotonically decreasing property of uncertaintyinformation): 0 < p1 ≤ p2 implies 0 ≤ I(p2) ≤ I(p1). Let us now introducethe two crucial notions of finite probability distribution and random variable.

A length N probability distribution is a vector p = (p1, p2, . . . , pN) satisfyingthe following conditions:

(pd-1) pi ≥ 0 for every i;(pd-2)

∑ni=1 pi = 1.

Trivially, from (pd-1) and (pd-2) it immediately follows that for every i, 0 ≤pi ≤ 1. In this abstract context, a length N random variable is a vector a =(a1, a2, . . . , aN ) in which each component is a real number: ai ∈ R for any i. Fora fixed length N random variable a and a length N probability distribution p,the numbers ai are interpreted as possible values of the random variable a andthe quantities pi as the probability of occurrence of the event “a = ai” (thus,pi can be considered as a simplified notation of p(ai), further simplification ofanother standard notation p(a = ai)). The pair (p, a) consisting of a N–lengthprobability distribution and a N–length random variable constitute a statisticalscheme which in our finite case can be represented by the associated statisticalmatrix :

(p, a) =[p1 . . . pi . . . pN

a1 . . . ai . . . aN

]

(2)

Hence, the average (or mean or expectation) value of the random variable a withrespect to a probability distribution p is given by the quantity

Av(a, p) =N∑

i=1

ai · pi

In particular, to any probability distribution p = (p1, p2, . . . , pN) it is possibleto associate the uncertainty (information) random variable I[p] = (I(p1), I(p2),. . . , I(pN)), according to the statistical matrix

(p, I[p]) =[

p1 . . . pi . . . pN

I(p1) . . . I(pi) . . . I(pN )

]

(3)


whose average with respect to the probability distribution p is Av(p, I[p]) =∑N

i=1 I(pi) ·pi This is the uncertainty average called, according to Shannon [26],the information entropy of the probability distribution, and simply denoted byH(p) = Av(p, I[p]). Thus, taking into account (1), the entropy of the proba-bility distribution p is explicitly expressed by the formula (with the convention0 log 0 = 0):

H(p) = −N∑

i=1

pi log pi (4)

Since the information I(p) of a probability value p has been interpreted as a mea-sure of the uncertainty due to the knowledge of this probability, the informationentropy of a probability distribution p can be considered as a quantity which ina reasonable way measures the average uncertainty associated with this distribu-tion and expressed as the mean value of the corresponding information randomvariable I[p]. Indeed, given a probability distribution p = (p1, p2, . . . , pN), itsentropy H(p) = 0 iff one of the numbers p1, p2, . . . , pN is one and all the othersare zero, and this is just the case where the result of the experiment can be pre-dicted beforehand with complete certainty, so that there is no uncertainty as toits outcome. These probability distributions will be denoted by the conventionalsymbol pk = (δi

k)i=1,2,...,N , where δik is the Kronecker delta centered in k. On

the other hand, given a probability distribution p = (p1, . . . , pN ) the entropyH(p) = log N iff pi = 1

N for all i = 1, . . . , N and this maximum of uncertaintycorresponds to the uniform probability distribution pu = (1/N, 1/N, . . . , 1/N).

In all the other cases the entropy is a (strongly) positive number upperbounded by log N . In conclusion, the following order chain holds for any proba-bility distribution p :

0 = H(pk) ≤ H(p) ≤ H(pu) = log N

Measure Distributions and Probability Distributions. In investigatingquestions about information entropy, more often one has to do with the so–called measure distributions, i.e., real vectors of the kind m = (m1, m2, . . . , mN)under the conditions:

(md-1) mi ≥ 0 for every i;(md-2) ∃j0 such that mj0 �= 0.

The total measure of a measure distribution m is the quantity M(m) :=∑N

i=1 mi,with M(m) �= 0, which depends from the particular measure distribution m.

For any measure distribution m it is possible to construct the correspondingprobability distribution, which depends from m:

p(m) =( m1

M(m),

m2

M(m), . . . ,

mN

M(m)

)

which turns out to be the normalization of the measure distribution m withrespect to its total measure M(m), i.e., p(m) = 1

M(m)m. The entropy of p(m),


denoted by H(m) instead of H(p(m)) in order to stress its dependence fromthe original measure distribution m, is the sum of two terms

H(m) = log M(m) − 1M(m)

N∑

i=1

mi log mi (5)

If one defines as co-entropy the quantity (also this depending from the measuredistribution m)

E(m) =1

M(m)

N∑

i=1

mi log mi (6)

we have the following identity which holds for any arbitrary measure distribution:

H(m) + E(m) = log M(m) (7)

The name co–entropy assigned to the quantity E(m) rises from the fact that it“complements” the entropy H(m) with respect to the value log M(m), whichdepends from the distribution m. Of course, in the equivalence class of allmeasure distributions of identical total measure (m1 and m2 are equivalentiff M(m1) = M(m2)) this value is constant whatever be their length N .

From the above definition it is trivial that the terms mi in (6) give a negativecontribution if 0 < mi < 1, and so the co–entropy could be a negative quantity.Nothing against this result, but in some applications we shall interpret the co–entropy as a measure of the average granulation and in this case it is interestingto express this quantity as a non–negative number. Therefore, in order to avoidthis drawback of negative co–entropy it is possible to consider the quantity

q(m) = min{mi �= 0 : i = 1, 2, . . . , N} > 0

and to construct the associated measure distribution obtained by a normalizationof the original distribution m according to:

mq :=( m1

q(m),

m2

q(m), . . . ,

mN

q(m)

)

with mi

q(m) equal to 0 if mi = 0 and greater than 1 if mi �= 0. This measure

distribution has the total measure M(mq) =∑N

i=1mi

q(m) = M(m)q(m) and then the

associated co–entropy has the form

E(mq) =1

M(m)

[ N∑

i=1

mi log mi − M(m) log q(m)]

obtaining as a final result the relationship

E(mq) = E(m) − log q(m).

In particular, taking into account that E(mq) ≥ 0, we have the inequalitylog q(m) ≤ E(m), that is the original co–entropy E(m) may be a negativequantity, but lower bounded by log q(m).


From the point of view of the induced probability distributions, there is nochange (they are invariant) with respect to the original distribution:

p (mq ) =( mi

q(m) · M(mq)

)

i=1,...,N=( mi

M(m)

)

i=1,...,N= p(m)

and consequently also the entropies are invariant:

H(mq ) = H(m)

and the above relationship between entropy and co–entropy expressed by equa-tion (7) assumes now the form:

H(m) + E(mq) = logM(m)q(m)

2 Partitions

We treat now the role of entropy and co–entropy, as measure of average un-certainty and granulation respectively, in the concrete case of partitions of afixed universe. First of all let us consider the case of partitions generated byinformation systems according to the Pawlak approach [22,24,17].

2.1 The Information System Approach to Rough Set Theory byPartitions

There is a natural way to induce partitions from (complete) information systems(IS) formalized by a triple IS := 〈X, Att, F 〉 consisting of a nonempty finite setX of objects, the universe of the discourse, a nonempty finite set Att of attributesabout the object of the universe, and a mapping F : X×Att → val which assignsto any object x ∈ A the value F (x, a) ∈ val assumed by the attribute a ∈ Att.

Indeed, in this IS case the partition generated by a set of attributes A, denotedby π(A), consists of equivalence classes of indistinguishable objects with respectto the equivalence relation RA involving pairs of object x, y ∈ X :

(In) (x, y) ∈ RA iff ∀ a ∈ A, F (x, a) = F (y, a).

The equivalence class generated by the object x ∈ X relatively to the set ofattributes A is the granule of knowledge grA(x) := {y ∈ X : (x, y) ∈ RA}characterized by an invariant set of values assumed by any object of the class.We will assume that an IS satisfies the following conditions, in [10] called ofcoherence:

(co1) The mapping F must be surjective; this means that if there exists a valuev ∈ V which is not the result of the application of the information map Fto some pair (x, a) ∈ X ×Att, then this value has no interest with respectto the knowledge stored in the information system.


(co2) For any attribute a ∈ Att there exist at least two objects x1 and x2 suchthat F (x1, a) �= F (x2, a), otherwise this attribute does not supply anyknowledge and can be suppressed.

Example 1. Let us imagine the following situation. Let us say that you are aphysician and that you want to start collecting information about the health ofsome of your patients. The symptoms you are interested in are: the presence offever, a sense of dizziness, blood pressure, headache and chest pain. But you arenot interested in, for example, allergies. So, when organizing the data in yourpossession, you will consider just the first five attributes and omit the allergyattribute. The result is a situation similar to the one presented in Table 1,where the set of objects is X = {p1, p2, p3, p4, p5, p6, p7, p8, p9, p10}, the familyof attributes is Att={Fever, Headache, Dizziness, Blood Pressure, Chest Pain}and the set of all possible values is val={very high, high, low, normal, yes, no}.

Table 1. Medical complete information system

Patient Fever Headache Dizziness Blood Pressure Chest Pain

p1 no yes yes normal yesp2 high no yes low yesp3 very high no no low nop4 low no yes low yesp5 low yes no low nop6 high no yes low yesp7 very high no yes normal nop8 no yes yes normal yesp9 no yes yes low yesp10 no yes no high yes

If one consider the collection Att of all attributes, the universe results to bepartitioned in the following equivalence classes:

π(Att) ={{p1, p8}, {p2, p6}, {p3}, {p4}, {p5}, {p7}, {p9}, {p10}

}

The granule {p2, p6} can be considered as the support of the invariant knowledge:“The patient presents high fever and low blood pressure, but he/she has noheadache; he/she says to feel dizzy and to have chest pain.”.

Similarly, if one consider the subfamily of attributes A = {Fever, Headache,Chest Pain}, the resulting partition of the universe under examination consistsin the equivalence classes:

π(A) ={{p1, p8, p9, p10}, {p2, p6}, {p3, p7}, {p4}, {p5}

}

where for instance the granule {p3, p7} is the support of the knowledge “Thepatient has very high fever, but he/she does not have headache, nor chest pain.”,invariant for any object of this equivalence class.


In any IS, given an attribute a ∈ Att one can define the set val(a) := {α : ∃x ∈X s.t. F (x, a) = α} containing all the possible values of a and so the observationof this attribute a on an object x ∈ X yields the value F (x, a) ∈ val(a). To thisfixed attribute a it is possible to assign a (surjective) mappings fa : X → val(a)defined by the correspondence x �→ fa(x) := F (x, a). Noting that the global setof possible values of the information system is related to the “single” val(a) bythe relation val = ∪a∈Attval(a), then each attribute a can be identified withthe mapping fa ∈ valX and so, introducing the collection of all such mappingsAtt(X) := {fa ∈ valX : a ∈ Att} in which Att plays the role of index set, aninformation system can be formalized also as a structure 〈X, Att(X)〉.

Thus, any fixed attribute a ∈ Att, with set of values val(a)={α1, α2, . . . , αN},generates a partition of the universe of objects

π(a) = {f−1a (α1), f−1

a (α2), . . . , f−1a (αN )}

where the generic elementary set of π(a) is f−1a (αi) := {x ∈ X : fa(x) = αi}, i.e.,

the collection of all objects with respect to which the attribute a assumes thefixed value αi. The pair (a, αi) is interpreted as the elementary proposition “theattribute a has the value αi” and Ai := f−1

a (αi) as the elementary event whichtests the proposition (a, αi), in the sense that it is constituted by all objects withrespect to which the proposition (a, αi) is “true” (x ∈ Ai iff fa(x) = αi). Theevent Ai is then the equivalence class (also denoted by [a, αi]), of all objects onwhich the attribute a assumes the value αi.

If we consider a set A consisting of two attributes a ∈ Att and b ∈ Att withcorresponding set of values val(a) = {α1, α2, . . . , αN} and val(b) = {β1, β2, . . . ,βM}, then it is possible to define the mapping fa,b : X �→ val(a, b), withval(a, b) := {(α, β) ∈ val(a) × val(b) : ∃x ∈ X s. t. fa,b(x) = (α, β)} ⊆ val(a) ×val(b), which assigns to any object x the “value” fa,b(x) := (fa(x), fb(x)). Inthis case we can consider the pair (a, b) ∈ Att2 as a single attribute of the newinformation system

⟨X, Att2, {fa,b | a, b ∈ Att2}⟩, always based on the original

universe X . The partition generated by the attribute (a, b) is then the collectionof all nonempty subsets of X , π(a, b) = {f−1

a,b (αi, βj) �= ∅ : αi ∈ val(a) and βj ∈val(b)}. The elementary events of the partition π(a, b) are so the subsets of theuniverse of the following form, under the condition of being nonempty:

f−1a,b (αi, βj) : = {x ∈ X : fa,b(x) = (αi, βj)} = f−1

a (αi) ∩ f−1b (βj) (8)

Example 2. Making reference to the above example (1), let us consider the twoattributes F =Fever, with set of values val(F ) = {veryhigh, high, low, no}, andBP =Blood Pressure, with set of values val(BP ) = {high, normal, low}. Thecorresponding two partitions are

π(F ) ={

[F = very high] = {p3, p7}, [F = high] = {p2, p6}, [F = low] = {p4, p5},[F = no] = {p1, p8, p9, p10}

}

π(BP ) ={

[BP = high] = {p10}, [BP = normal] = {p1, p7, p8},[BP = low] = {p2, p3, p4, p5, p6, p9}

}


The set of potential values of the pair of attributes A = {F, BP} is val{F} ×val{BP} = {(very high, high), (very high, normal), (very high, low), (high,high), (high, normal), (high, low), (low, high), (low, normal), (low, low),(no, high), (no, normal), (no, low)}, with corresponding classes(v h=very high):

f−1F,BP (v h, high) = ∅ f−1

F,BP (v h, normal) = {p7} f−1F,BP (v h, low) = {p3}

f−1F,BP (high, high) = ∅ f−1

F,BP (high, normal) = ∅ f−1F,BP (high, low) = {p2, p6}

f−1F,BP (low, high) = ∅ f−1

F,BP (low, normal) = ∅ f−1F,BP (low, low) = {p4, p5}

f−1F,BP (no, high) = {p10} f−1

F,BP (no, normal) = {p1, p8} f−1F,BP (no, low) = {p9}

where in particular f−1F,BP (very high, high), f−1

F,BP (high, high), f−1F,BP (high,

normal), f−1F,BP (low, high) and f−1

F,BP (low, normal) are empty, so they are notpossible equivalence classes of a partition. Thus, we can erase the pair (very high,high), (high, high), (high, normal), (low, high) and (low, normal) as possiblevalues obtaining val(R, F )={(very high, normal), (very high, low), (high, low),(low, low), (no, high), (no, normal),(no, low)} ⊂ val(F ) × val(BP ), in order togain again the coherence condition of surjectivity (co1).

Hence, adopting the notation of [(a, αi) & (b, βj)] to denote the elementaryevent f−1

a,b (αi, βj) �= ∅, we have that [(a, αi) & (b, βj)] = [a, αi] ∩ [b, βj ]. If(a, αi) & (b, βj) is interpreted as the conjunction “the attribute a has the valueαi and the attribute b has the value βj” (i.e., & represents the logical connec-tive “and” between propositions), then this result says that the set of objects inwhich this proposition is verified is just the set of objects in which simultane-ously “a has the value αi” and “b has the value βj”. On the other hand, makinguse of the notations Ci,j := f−1

a,b (αi, βj), Ai = f−1a (αi) and Bj = f−1

b (βj) wecan reformulate the previous result as Ci,j = Ai ∩ Bj . In other words, elemen-tary events from the partition π(a, b) are obtained as nonempty set theoreticintersection of elementary events from the partitions π(a) and π(b). This fact isdenoted by π(a, b) = π(a) · π(b).

The generalization to any family of attributes of this procedure is straight-forward. Indeed, let A = {a1, a2, . . . , ak} be such a family of attributes froman information system. Then, it is possible to define the partition π(A) =π(a1, a2, . . . , ak) = π(a1) · π(a2) · . . . · π(ak) := {Ai ∩ Bj ∩ . . . ∩ Kp �= ∅ : Ai ∈π(a1), Bj ∈ π(a2), . . . , Kp ∈ π(ak)}. If now one considers another family of at-tributes B = {b1, b2, . . . , bh} then

π(A ∪ B) := π(a1, . . . , ak, b1, . . . , bh) = π(A) · π(B) (9)

2.2 The Partition Approach to Rough Set Theory

So the usual approach to rough set theory as introduced by Pawlak turns out tobe a particular case of a more general approach formally (and essentially) basedon a concrete partition space, that is a pair (X, π) consisting of a nonempty(non necessarily finite) set X , the universe with corresponding power set P(X),forming the collection of sets which can be approximated , and a partition π :=


{Ai ∈ P(X) : i ∈ I} of X (indexed by the index set I) whose elements are theelementary sets . The partition π can be characterized by the induced equivalencerelation R ⊆ X × X , defined as

(x, y) ∈ Rπ iff ∃Aj ∈ π s.t. x, y ∈ Aj (10)

In this case x, y are said to be indistinguishable with respect to Rπ and theequivalence relation Rπ is called the indistinguishability relation induced by thepartition π. In this indistinguishability context the partition π is considered asthe support of some knowledge available on the objects of the universe and soany equivalence class (i.e., elementary set) is interpreted as a granule (or atom)of knowledge contained in (or supported by) π. For any object x ∈ X we shalldenote by grπ(x), called the granule generated by x relatively to π, the (unique)equivalence class from π which contains x (if x ∈ Ai, then grπ(x) = Ai).

A crisp set (we prefer, also for the forthcoming probability considerations,event) is any subset of X obtained as the set theoretic union of elementarysubsets: EJ = ∪{Aj ∈ π : j ∈ J ⊆ I}. The collection of all such crisp sets plusthe empty set ∅ will be denoted by Eπ(X) and it turns out to be a Booleanalgebra 〈Eπ(X),∩,∪, c, ∅, X〉 with respect to set theoretic intersection, union,and complementation. This Boolean algebra is atomic whose atoms are justthe elementary sets from the partition π. From the topological point of viewEπ(X) contains both the empty set and the whole space, moreover is closed withrespect to any arbitrary set theoretic union and intersection, i.e., it is a familyof clopen subsets for a topology on X . In this way we can construct the concreteapproximation space Rπ := 〈P(X), Eπ(X), lπ, uπ〉, consisting of:

(1) the Boolean (complete) atomic lattice P(X) of all approximable subsetsof the universe X , whose atoms are the singletons;

(2) the Boolean (complete) atomic lattice Eπ(X) of all definable subsets of X ,whose atoms are the equivalence classes of the partition π(X);

(3) the lower approximation map lπ : P(X) �→ Eπ(X) associating with anysubset Y of X its lower approximation defined by the (clopen) crisp set

lπ(Y ) := ∪{E ∈ Eπ(X) : E ⊆ Y } = ∪{A ∈ π : A ⊆ Y }

(4) the upper approximation map uπ : P(X) �→ Eπ(X) associating with anysubset Y of X its upper approximation defined by the (clopen) crisp set

uπ(Y ) := ∩{F ∈ Eπ(X) : Y ⊆ F} = ∪{B ∈ π : Y ∩ B �= ∅}

The rough approximation of a subset Y of X is then the clopen pair rπ(Y ) :=〈lπ(Y ), uπ(Y )〉, with lπ(Y ) ⊆ Y ⊆ uπ(Y ) , which is the image of the subset Yunder the rough approximation mapping rπ : P(X) �→ Eπ(X)×Eπ(X) describedby the following diagram:


Y ∈ P(X)lπ

��uπ

��

rπ

��

lπ(Y ) ∈ Eπ(X)

��uπ(Y ) ∈ Eπ(X)

��

〈lπ(Y ), uπ(Y )〉

The boundary of Y is defined as the set bπ(Y ) := uπ(Y ) \ lπ(Y ), whereasits exterior is eπ(Y ) = uπ(y)c. Trivially, for any Y the collection π(Y ) :={lπ(Y ), bπ(Y ), eπ(Y )} is a new partition of X generated by Y inside the originalpartition π. The elements E from Eπ(X) has been called crisp since their roughrepresentation is of the form rπ(E) = (E, E) with empty boundary.

2.3 Measures and Partitions

Let us recall that from the general theory of measure and integration on a (non–necessarily finite) universe X (see [27]), a measurable space is a pair 〈X, E(X)〉,where E(X) is a σ–algebra of all measurable sets, i.e., a collection of subsetsof X satisfying the conditions: (ms1) ∅ ∈ E(X); (ms2) E ∈ E(X) impliesEc = X \ E ∈ E(X); (ms3)

⋃n∈N

En ∈ E(X) for any countable family ofmeasurable subsets En ∈ E(X).

A measure on a measurable space is a mapping m : E(X) �→ R+ satisfyingthe conditions:

(m1) m(∅) = 0; and the additivity condition(m2) for every countable family {An ∈ E(X) : n ∈ N} of pairwise disjoint

subsets of X (Ai ∩ Aj = ∅ for i �= j) it is m(⋃

n An) =∑

n m(An).We are particularly interested in nontrivial and finite measures, i.e., those

measures satisfying the finiteness condition:(m3) m(X) < ∞. In this way the structure 〈X,P(X), m〉 is a finite measure

space, and the non-triviality condition:(m4) m(X) �= 0.

Any nontrivial finite measure induces a probability measure p : E(X) �→ [0, 1]defined for any measurable set E ∈ E(X) as p(E) = m(E)

m(X) , obtaining in this waythe probability space 〈X, E(X), p〉. In this probability context the measurablesets from E(X) are also called events.

The Finite Universe Case. Let us assume that the universe of discourse isfinite (|X | < ∞). The set Eπ(X) of all crisp elements induced from a (necessarilyfinite) partition π = {A1, A2, . . . , AN} of the universe X has the structure of analgebra of sets (the condition ms2 is necessarily applied to finite family only) fora measurable space 〈X, Eπ(X)〉. In this finite universe context, and looking atthe possible probabilistic applications, the elements from Eπ(X) are the eventsgenerated by the partition π, and so the ones from the original partition π arealso said to be the elementary events. Since we are particularly interested in the


class Π(X) of all possible partitions of a finite universe X , with associated mea-sure and probability considerations, we must take into account the two peculiarpartitions which can be introduced, whatever be the universe X : the trivial par-tition πg = {X} and the discrete partition πd consisting of all singletons {x},for x ranging in X .

Of course, if on the same universe one considers two different partitions, thenthe corresponding families of events are different between them.

Example 3. Let us consider the universe X = {1, 2, 3, 4}, and the two parti-tions π1 = {{1, 2, 3}, {4}} and π2{{1, 3}, {2, 4}}. The corresponding algebras ofevents are Eπ1(X) = {∅, {1, 2, 3}, {4}, X} and Eπ2(X) = {∅, {1, 3}, {2, 4}, X},trivially different between them.

In order to have a unique algebra of sets containing the algebras of sets inducedby all possible partitions we assume the power set P(X) of X as the commonalgebra of sets, obtaining the measurable space (X,P(X)). If we consider a non-trivial finite measure m : P(X) �→ R+ on this measurable space, since for anymeasurable subset (event) A =

⋃a∈A{a} of X (with the involved singletons {a}

pairwise disjoint among them and in a finite number), it is possible to apply theadditivity condition (m2) obtaining

m(A) = m(∪a∈A{a}) =∑

a∈A

m(a) (11)

where for simplicity we have set m(a) instead of m({a}). This means that, ifX = {x1, x2, . . . , xN}, in order to construct the measure space under study it issufficient the knowledge of the vector:

m(πd) = (m(x1), m(x2), . . . , m(xN ))

characterized by the two conditions of being a measure distribution according tothe notion introduced in subsection 1.1.

An interesting example of this kind of measure is the so–called counting mea-sure assigning to any event E ∈ P(X) the measure mc(E) = |E|, i.e., thecardinality of the measurable set (event) under examination, which is obtainedby the uniform measure distribution ∀x ∈ X , mc(x) = 1.

2.4 Entropy (As Measure of Average Uncertainty) and Co–Entropy(As Measure of Average Granularity) of Partitions

From now on, also if not explicitly stated, the measure space 〈X,P(X), m〉 wetake into account is finite |X | < ∞ and we will also assume that the measure mis not degenerate in the sense that the following condition of strict positivity issatisfied:

(m4a) ∀x ∈ X, m(x) > 0.

Thus, from any partition π = {A1, A2, . . . , AN}, where each event Ai representsa granule of knowledge supported by π, the following two N–component vectorscan be constructed:


(md) the measure distribution

m(π) = (m(A1), m(A2), . . . , m(AN )) .

The quantity m(Ai) > 0 expresses the measure of the granule Ai, whosetotal sum is

∑Ni=1 m(Ai) = m(X), which is constant with respect to the

variation of the partition π in Π(X);(pd) the probability distribution

p(π) = (p(A1), p(A2), . . . , p(AN )), with p(Ai) =m(Ai)m(X)

.

The quantity p(Ai) ∈ (0, 1] describes the probability of the event Ai, andp(π) is a finite collection of non–negative real numbers (∀ i, p(Ai) ≥ 0),whose sum is one (

∑Ni=1 p(Ai) = 1).

One must not confuse the measure m(Ai) of the “granule” Ai with the oc-currence probability p(Ai) of the “event” Ai. They are two semantical conceptsvery different between them. Of course, both these distributions depend on thechoice of the partition π and if one changes the partition π inside the collectionΠ(X) of all possible partitions of X , then different distributions m(π) and p(π)are obtained.

Fixed the partition π, on the basis of these two distributions it is possible tointroduce two really different discrete random variables :

(RV-G) The granularity random variable

G(π) := (log m(A1), log m(A2), . . . , log m(AN ))

where the real quantity G(Ai) := log m(Ai) represents the measure of thegranularity associated to the knowledge supported by the “granule” Ai of thepartition π. Some of these measures could be negative, but, as stressed at theend of subsection 1.1, it is possible to “normalize” this measure distributionwithout affecting the entropy in such a way that every granule turns out tohave a non–negative granularity measure. From now on, also if not explicitlystated, we shall assume that all the involved measure distributions satisfythis condition.

(RV-U) The uncertainty random variable

I(π) := (− log p(A1),− log p(A2), . . . ,− log p(AN ))

where the non–negative real quantity I(Ai) := − log p(Ai) is interpreted (see[14], and also [1,25]) as a measure of the uncertainty related to the probabilityof occurrence of the “event” Ai of the partition π.

Also in the case of these two discrete random variables, their semantical/termi-nological confusion should be avoided. Indeed, G(Ai) involves the measure m(Ai)(granularity measure of the “granule” Ai), contrary to I(Ai) which involves the


M

G(m)

1

log(M)

0m 11/M

0

log (M)

I(p)

p

Fig. 1. Graphs of the granularity G(m) and the uncertainty I(p) measures in the“positivity” domains m ∈ [1, M ] and p = m/M ∈ [1/M, 1], with M = m(X)

probability p(Ai) of occurrence of Ai (uncertainty measure of the “event” Ai).Note that under the assumption that the measure of the event Ai ∈ π satisfiesm(Ai) ≥ 1, the corresponding granularity and uncertainty measures are bothnon–negative (see figure (1)).

Moreover they are mutually “complementary” with respect to the fixed quan-tity log m(X), invariant with respect to the choice of the partition π:

G(Ai) + I(Ai) = log m(X) (12)

The granularity measure G is strictly isotonic (monotonic increasing) with re-spect to the set theoretic inclusion: i.e., A ⊂ B implies G(A) < G(B). On thecontrary, the uncertainty measure I is strictly anti–tonic (monotonic decreasing):A ⊂ B implies I(B) < I(A).

As it happens for any discrete random variable, it is possible to calculate itsaverage with respect to the fixed probability distribution p(π), obtaining thetwo results:

(GA) the granularity average with respect to p(π) expressed by the quantity

Av(G(π), p(π)) :=N∑

i=1

G(Ai) · p(Ai) =1

m(X)

N∑

i=1

m(Ai) · log m(Ai) (13)

which in the sequel will be simply denoted by E(π);(UA) the uncertainty average with respect to p(π) expressed by the quantity

Av(I(π), p(π)) :=N∑

i=1

I(Ai) · p(Ai) = − 1m(X)

N∑

i=1

m(Ai) · logm(Ai)m(X)

(14)

which is the information entropy H(π) of the partition π according to theShannon approach to information theory [26] (and see also [16,1] for intro-ductive treatments).

Thus, the quantity E(π) furnishes a measure of the average granularity carried bythe partition π as a whole, whereas the entropy H(π) furnishes the measure of the


average uncertainty associated to the same partition. In conclusion, also in thiscase the average granularity must not be confused with the average uncertaintysupported by π. Analogously to the (12), related to a single event Ai, theseaverages satisfy the following identity, which holds for any arbitrary partition πof the universe X :

H(π) + E(π) = log m(X) (15)

Also in this case the two measures complement each other with respect to theconstant quantity log m(X), which is invariant with respect to the choice of thepartition π of X .

Remark 1. Let us recall that in [28] Wierman has interpreted the entropy H(π)of the partition π as a granularity measure, defined as the quantity which “mea-sures the uncertainty (in bits) associated with the prediction of outcomes whereelements of each partition sets Ai are indistinguishable.” On the contrary, weprefer to distinguish the uncertainty measure of the partition π given by H(π)from the granularity measure of the same partition described by E(π).

Note that in [19] it is remarked that the Wierman “granularity measure”coincides with the Shannon entropy H(π), more correctly interpreted as the“information measure of knowledge” furnished by the partition π.

The co–entropy (average granularity) E(π) ranges into the real (closed) inter-val [0, logm(X) ] with the minimum obtained by the discrete partition πd ={{x1}, {x2}, . . . , {x|X|}}), collection of all singletons from X , and the maximumobtained by the trivial partition πg = {X}, consisting of the unique element X :that is ∀π ∈ Π(X), 0 = E(πd) ≤ E(π) ≤ E(πg) = log m(X).

From the point of view of rough set theory, the discrete partition is the onewhich generates the “best” sharpness of any subset Y of the universe X (∀Y ∈P(X), rπd

(Y ) = 〈Y, Y 〉), formalized by the fact that the boundary of any Y isbπd

(Y ) = uπd(Y ) \ lπd

(Y ) = ∅ (i.e., any subset is sharp), On the contrary, thetrivial partition is the one which generates the “worst” sharpness of any subsetY of X (∀Y ∈ P(X) \ {∅, X}, rπg (Y ) = 〈∅, X〉; with ∅ and X the unique crispsets since rπg (∅) = 〈∅, ∅〉 and rπg (X) = 〈X, X〉), formalized by the fact that theboundary of any nontrivial subset Y (�= ∅, X) is the whole universe bπg(Y ) = X .For these reasons, the interval [0, logm(X)] is assumed as the reference scalefor measuring roughness (or sharpness): the less is the value the worst is theroughness (or the best is the sharpness).

0

maximum sharpnessminimum roughness

◦ ◦

minimum sharpnessmaximum roughness

log m(X)

2.5 Ordering on Partitions

On the family Π(X) of all partitions of the finite universe X , equipped with anon degenerate measure m (∀x ∈ X , m(x) > 0) on the set of events P(X), onecan introduce some binary relations according to the following.


Definition 1. Let us consider a universe X and two partitions π1, π2 ∈ Π(X)of X. We introduce the following four binary relations on Π(X):

(por1) π1 � π2 iff ∀A ∈ π1, ∃B ∈ π2 s.t. A ⊆ B;(por2) π1 � π2 iff ∀B ∈ π2, ∃{Ai1 , Ai2 , . . . , Aih

} ⊆ π1 s.t.

B = Ai1 ∪ Ai2 ∪ . . . ∪ Aih;

(por3) π1 � π2 iff ∀x ∈ X, grπ1(x) ⊆ grπ2(x);(por4) π1 ≤W π2 iff ∀Ai ∈ π1, ∀Bj ∈ π2,

Ai ∩ Bj �= ∅ implies Ai ⊆ Bj .

As a first result we have that the just introduced binary relations on Π(X) aremutually equivalent among them, and so define the same binary relation whichturns out to be a partial order relation:

π1 � π2 iff π1 � π2 iff π1 � π2 iff π1 ≤W π2 (16)

Remark 2. Let us stress that the introduction on Π(X) of these partial orderbinary relations �, �, �, and ≤W might seem a little bit redundant, but thereason of listing them in this partition context is essentially due to the fact thatin the case of coverings of X they give rise to different relations as we will seein the covering section.

Partition Lattice. Given a finite universe X we want to investigate the struc-ture of all its partitions Π(X) from the point of view of the ordering �, withparticular regards to the eventual of some lattice structure.

In words we can say that the meet of two partitions π1 = {A1, A2, . . . , AM}and π2 = {B1, B2, . . . , BN} in Π(X) which are one finer than the other withrespect to �, is a partition such that each of its elementary sets (or blocks, forsimplicity) is contained both in some blocks of π1 and in some blocks of π2,and such that there is no larger block sharing the same property. In particular,let us observe that the partition π1 ∧ π2 can be realized taking into account allthe possible intersections of a granule from π1 and a granule from π2, wheresome of the possible intersections Ai ∩ Bj , for Ai ∈ π1 and Bj ∈ π2, could beempty, against the standard requirement of a partition. In this case, triviallythis intersection is not considered. Hence, we can state the following result withrespect to the meet of two partitions.

Proposition 1 (Meet). Given two partition π1 = {A1, A2, . . . , AM} and π2 ={B1, B2, . . . , BN} in Π(X), the lattice meet of π1 and π2 with respect to thepartial ordering � is given by

π1 ∧ π2 := {Ai ∩ Bj �= ∅ : Ai ∈ π1 and Bj ∈ π2}.This result can be extended to an arbitrary (necessarily finite) number of parti-tions {πj : j ∈ J} in order to have that ∧j∈Jπj is the corresponding meet.


A join of two partitions π1 and π2 in Π(X) is a partition π such that each blockof π1 and π2 is contained in some blocks of π, and such that no other partitionwith smaller blocks shares the same property. The formal result is the following.

Proposition 2 (Join). Given two partitions π1 = {A1, A2, . . . , AM} and π2 ={B1, B2, . . . , BN} in Π(X), the join of π1 and π2 is given as

π1 ∨ π2 = ∧{π ∈ Π(X) : π1 � π and π2 � π}.

Example 4. In the universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} let us consider the twopartitions π1 = {{1, 2, 3}, {4, 5, 6}, {7}, {8}, {9, 10}} and π2 = {{1, 2}, {3, 4},{5, 6, 7}, {8, 9, 10}}. Then their lattice meet is the new partition π1 ∧ π2 ={{1, 2}, {3}, {4}, {5, 6}, {7}, {8}, {9, 10}} and the lattice join is the partitionπ1 ∨ π2 = {{1, 2, 3, 4, 5, 6, 7}, {8, 9, 10}}.

The extension of proposition 2 to any family of partitions is now straightforward.As an important result we have the following.

Theorem 1 (Partition lattice). The structure 〈Π(X),�, πd, πt〉 is a posetwith respect to the partial ordering �, bounded by the least partition πd, thediscrete one, and the greatest partition πg, the trivial one. Formally

∀π ∈ Π(X), πd � π � πt .

This poset is a (complete) lattice with respect to the lattice meet π1 ∧ π2 and thelattice join π1 ∨ π2 introduced above.

As usual in a poset the induced strict ordering on partitions, denoted by π1 ≺ π2,is defined as π1 � π2 and π1 �= π2. This means that it must exists at least anequivalence class Bi ∈ π2 such that its partition with respect to π1 is formedat least of two subsets, i.e., ∃{Ai1 , Ai2 , . . . , Aip} ⊆ π1, with p ≥ 2, s.t. Bi =Ai1 ∪ Ai2 ∪ . . . ∪ Aip .

2.6 Isotonic Behavior of Entropies and Co–Entropies of Partitions

Let us now consider two partitions π1 = {A1, A2, . . . , AM} and π2 = {B1, B2, . . . ,BN} with corresponding probability distributions giving the two finite probabilityschemes

π1 =[

A1 A2 . . . AM

p(A1) p(A2) . . . p(AM )

]

π2 =[

B1 B2 . . . BN

p(B1) p(B2) . . . p(BN )

]

According to (4), the entropies of π1 and π2 are, respectively,

H(π1) = −M∑

l=1

p(Al) log p(Al) H(π2) = −N∑

k=1

p(Bk) log p(Bk).


If we consider the probabilities of an event Bk of π2, we have the followingconditional probabilities for Al on π1:

p(Al|Bk) =p(Al ∩ Bk)

p(Bk)=

m(Al ∩ Bk)m(Bk)

Let us recall that these quantities represent “the probability that the event Al ofthe scheme π1 occurs, given that the event Bk of the scheme π2 occurred.” [16].From the fact that π2 is a partition we have that Al = Al∩(∪kBk) = ∪k(Al∩Bk),with these latter pairwise disjoints, and so we get that p(Al) =

∑k p(Al ∩ Bk),

leading to the resultp(Al) =

∑

k

p(Al|Bk) p(Bk) (17)

which can be expressed in the matrix form:⎛

⎜⎜⎜⎜⎜⎜⎝

p(A1)...

p(Al)...

p(AM )

⎞

⎟⎟⎟⎟⎟⎟⎠

=

⎛

⎜⎜⎜⎜⎜⎜⎝

p(A1|B1) . . . p(A1|Bk) . . . p(A1|BN )...

. . ....

. . ....

p(Al|B1) . . . p(Al|Bk) . . . p(Al|BN )...

. . ....

. . ....

p(AM |B1) . . . p(AM |Bk) . . . p(AM |BN )

⎞

⎟⎟⎟⎟⎟⎟⎠

⎛

⎜⎜⎜⎜⎜⎜⎝

p(B1)...

p(Bk)...

p(BN )

⎞

⎟⎟⎟⎟⎟⎟⎠

Taking inspiration from [13] we can interpret this result as describing a channelof a system which takes in input the events Bk with a given probability andproduces as output an event Al characterized by some probability. Indeed, “achannel is described by a set of conditional probability p(Al|Bk), which are theprobability that an input Bk [...] will appear as some Al [...]. In this model achannel is completely described by the matrix of conditional probabilities [...]. Arow contains all the probabilities that a particular input Bk becomes the outputAl.”

Trivially we also have that the following conditions are satisfied:

(1) ∀ k, l, p(Al|Bk) ≥ 0;(2) the sum of the elements of a column is always 1:

∀ k,M∑

l=1

p(Al|Bk) = 1

“this merely means that for each input Bk we are certain that something willcome out, and the P (Al|Bk) [for k fixed and l varying] give the distributionof these probabilities.” [13].

(3) if p(Al) is the probability of the input Al of occurring, then

N∑

k=1

M∑

l=1

p(Al|Bk)p(Bk) = 1

“This means that when something is put into the system, then certainlysomething comes out.” [13].


From the condition (2) we also have that, for any fixed Bk ∈ π2, the vector ofconditional probabilities generated by the probability distribution π1 given theoccurrence of the event Bk:

p(π1|Bk) :=(p(A1|Bk), . . . , p(Al|Bk), . . . , p(AM |Bk)

)(18)

is a probability distribution. Hence, we have the following entropies of π1 con-ditioned by Bk (or k–conditional entropies of π1):

H(π1|Bk) = −M∑

l=1

p(Al|Bk) log p(Al|Bk). (19)

This is a particular result of a more general one. Indeed, let π = {A1, A2, . . . ,AN} be a partition of a universe X , and let C be a nonempty subset of X .Let p(π|C) = (p(A1|C), . . . , p(AN |C)), be the vector whose elements are (∀ i =1, . . . , N):

p(Ai|C) =p(Ai ∩ C)

p(C)=

m(Ai ∩ C)m(C)

(i.e. each p(Ai|C) represents the probability of an Ai conditioned by C). Thenp(π|C) is a probability distribution whose entropy, called the entropy of π con-ditioned by C, according to the general definition (4) is

H(π|C) = −∑

Ai∈π

p(Ai|C) log p(Ai|C) (20)

Conditioned Entropy of Partitions. Given two partitions π1 =(Al

)l=1,...,M

and π2 =(Bk

)k=1,...,N

of the universe X , let us consider their meet partition

π1 ∧ π2 =(Al ∩ Bk

)

k=1,...,Nl=1,...,M

where some of the events Al ∩Bk could be empty, against the usual definition ofpartition. But without any loss of generality we assume here a weaker position inwhich some set of a partition can be empty, with the corresponding probabilityequal to 0. With respect to the meet partition π1 ∧ π2 we have the “probabilitydistribution of the joint occurrence p(Al ∩ Bk) of the events Al and Bk” [16]:

p(π1 ∧ π2) =(p(Al ∩ Bk) = p(Bk) · p(Al|Bk)

)

k=1,...,Nl=1,...,M

(21)

“Then the set of [meet ] events Al ∩ Bk (1 ≤ l ≤ M , 1 ≤ k ≤ N), with theprobabilities qlk := p(Al ∩Bk) [of the joint occurrence of the events Al and Bk]represents another finite scheme, which we call the product of the schemas π1

and π2.” [16].We can now consider two discrete random variables.


(RV-1) The uncertainty random variable of the partition π1 ∧ π2

I(π1 ∧ π2) =(− log p(Al ∩ Bk)

)

k=1,...,Nl=1,...,M

(RV-2) The uncertainty random variable of the partition π1 conditioned by thepartition π2

I(π1|π2) =(− log p(Al|Bk)

)

k=1,...,Nl=1,...,M

The uncertainty of the partition π1 ∧ π2, as average of the random variable(RV-1) with respect to the probability distribution p(π1 ∧ π2), is so expressedby the meet entropy

H(π1 ∧ π2) = −∑

l,k

p(Al ∩ Bk) log p(Al ∩ Bk) (22)

whereas we define as entropy of π1 conditioned by π2 the average of the discreterandom variable (RV-2) with respect to the probability distribution p(π1 ∧ π2)expressed by the non–negative quantity:

H(π1|π2) := −∑

l,k

p(Al ∩ Bk) log p(Al|Bk) (23)

In the case of the meet partition π1∧π2 of the two partitions π1 and π2 with as-sociated probability distribution p(π1∧π2) of (21), and according to the generaldefinition of entropy (4), after some easy calculations we obtain the followingresult about the entropy of the meet in which the conditioned entropy (19) isinvolved

H(π1 ∧ π2) = H(π2) +N∑

k=1

p(Bk) · H(π1|Bk) (24)

On the other hand the following result about the conditioned entropy holds

H(π1|π2) =N∑

k=1

p(Bk) · H(π1|Bk) (25)

As a consequence, the following interesting relationship between the meet en-tropy and the conditioned entropy is stated by the identity:

H(π1 ∧ π2) = H(π2) + H(π1|π2) = H(π1) + H(π2|π1) (26)

If one takes into account that the condition π1 � π2 is equivalent to π1∧π2 = π1,one has both the following results

π1 � π2 implies H(π1) = H(π2) +N∑

k=1

p(Bk) · H(π1|Bk)

π1 � π2 implies H(π1) = H(π2) + H(π1|π2) = H(π1) + H(π2|π1)


From which it immediately follows that H(π1) ≥ H(π2). Moreover, the conditionof strict inclusion π1 ≺ π2 implies that at least one of the addendum p(Bk) ·H(π1|Bk) must be different from 0, obtaining the strict anti–isotonicity condition

π1 ≺ π2 implies H(π2) < H(π1)

hence, taking into account the relationship (15), we have that the co–entropy isa strictly isotonic mapping with respect to the partition ordering, i.e.,

π1 ≺ π2 implies E(π1) < E(π2)

Trivially, the probability distribution (21) leads to to relationships

N∑

k=1

p(Al ∩ Bk) = p(Al) andM∑

l=1

p(Al ∩ Bk) = p(Bk)

which lead to the result

p(π1) =( N∑

k=1

p(A1 ∩ Bk),N∑

k=1

p(A2 ∩ Bk), . . . ,N∑

k=1

p(AN ∩ Bk))

p(π2) =( M∑

l=1

p(Al ∩ B1),M∑

l=1

p(Al ∩ B2), . . . ,M∑

l=1

p(Al ∩ BM ))

from which the following result follows:

Proposition 3. Let π1 and π2 be two partitions of the same universe X. Then

H(π1 ∧ π2) ≤ H(π1) + H(π2) (27)

The equality holds iff, whatever be h and k, it is p(Al ∩Bk) =∑N

k=1 p(Al ∩Bk) ·∑M

l=1 p(Al ∩ Bk) = p(Al) · p(Bk), the so–called condition of (mutual) indepen-dence of all the events Al and Bk. In this case we will say that the two partitionsπ1 and π2 are independent (to be precise, two partitions π1, π2 ∈ Π(X) are saidto be independent iff for any A ∈ π1, B ∈ π2 we have p(A∩B) = p(A) · p(B) ).

Let us now give a direct proof of the case of two independent partitions.

Proof. We have:

H(π1 ∧ π2) = −∑

A∈πB∈σ

p(A ∩ B) · log p(A ∩ B).

Under the hypothesis that π1 and π2 are independent partitions, we obtain

H(π1 ∧ π2) = −∑

A∈πB∈σ

[p(A) · p(B)] · log[p(A) · p(B)]

= −(∑

A∈π

p(A) ·∑

B∈σ

p(B) · log p(B) +∑

B∈σ

p(B) ·∑

A∈π

p(A) · log p(A))


Since we have that ∑

A∈π

p(A) =∑

B∈σ

p(B) = 1

then we get the required result.

Comparing (26) with (27) we obtain the further inequality

H(π1|π2) ≤ H(π1)

with the equality iff the two partitions π1 and π2 are independent.

Conditioned Co–Entropy of Partitions. As a first result about meet co–entropy, according to the general relationship (7) applied to (22), we have thatthe following hold:

E(π1 ∧ π2) =1

m(X)

∑

l,k

m(Al ∩ Bk) · log m(Al ∩ Bk)

= E(π2) − H(π1|π2) (28)

Moreover, introduced the co–entropy of the partition π1 conditioned by the par-tition π2 as the quantity

E(π1|π2) : =1

m(X)

∑

lk

m(Al ∩ Bk) · log[

m(X)m(Bk)

m(Al ∩ Bk)]

=1

m(X)

∑

lk

[ m(Bk) · p(Al|Bk) ] log [ m(X) · p(Al|Bk) ]

it is easy to show that E(π1|π2) ≥ 0. Furthermore, the expected relationshipholds:

H(π1|π2) + E(π1|π2) = log m(X)

Note that from (28) it follows that

π1 � π2 implies E(π2) = E(π1) + H(π1|π2) (29)

2.7 A Pointwise Approach to the Entropy and Co–Entropy ofPartitions

In order to better understand the application to coverings of the Liang–Xu (LX)approach to quantify information in the case of incomplete systems [21], let usnow introduce (and compare with (13)) a new form of pseudo co–entropy relatedto a partition π = {A1, . . . , AN} by the following definition in which the suminvolves the “local” information given by all the equivalence classes grπ(x) forthe point x ranging on the universe X :

ELX(π) :=1

m(X)

∑

x∈X

m(grπ(x)) · log m(grπ(x)) (30)


Trivially, ∀π ∈ Π(X), 0 = ELX(πd) ≤ ELX(π) ≤ ELX(πg) = m(X) · log m(X).Moreover, it is easy to prove the following result which shows that, at least atthe level of partitions, this “local” notion of entropy is a little bit “pathological”:

ELX(π) =1

m(X)

N∑

i=1

m(Ai)2 · log m(Ai) (31)

and so from the fact that 1 ≤ m(Ai) ≤ m(Ai)2 it follows that ∀π, 0 ≤ E(π) ≤ELX(π). The comparison between (13) and (31) put in evidence the very pro-found difference of these definitions, and in some sense the “uselessness” of thislatter notion.

With the aim to capture some relationship with respect to a pseudo–entropy,for any partition π let us consider the vector

μπ :=(

μπ(x) :=m(grπ(x))

m(X)s.t. x ∈ X

)

(32)

which is a pseudo–probability distribution since ∀x, 0 ≤ μπ(x) ≤ 1, but μ(π) :=∑

x∈X μπ(x) =∑N

i=1m(Ai)

2

m(X) ≥ 1; this latter quantity is equal to 1 when∑N

i=1 m(Ai)2 = m(X), and in this case the vector μπ defines a real probabilitydistribution. Moreover, for any partition it is μ(π) ≤ m(X), with μ(πt) = m(X).Applying in a pure formal way the formula (4) to this pseudo–distribution oneobtains

HLX(π) = −∑

x∈X

μπ(x) · log μπ(x) = −N∑

i=1

m(Ai)2

m(X)· log

m(Ai)m(X)

(33)

from which it follows that (and compare with (15)):

HLX(π) + ELX(π) = log m(X) · μ(π)

Hence, ELX(π) is complementary to the “pseudo–entropy” HLX(π) with re-spect to the quantity log m(X) · μ(π), which is not invariant but depends fromthe partition π by its “pseudo–measure” μ(π). For instance in the case of thetrivial partition it is HLX(πg) = 0 and ELX(πg) = m(X) · log m(X), withHLX(πg) + ELX(πg) = m(X) · log m(X). On the other hand, in the case of thediscrete partition it is HLX(πd) = log m(X) and ELX(πd) = 0, with HLX(πd)+ELX(πd) = log m(X).

Of course, the measure distribution (32) can be normalized by the quantityμ(π) obtaining a real probability distribution

μ(n)(π) =

(

μ(n)π (x) =

|grπ(x)|∑N

i=1 m(Ai)2: x ∈ X

)

But in this case the real entropy H(n)LX(π) = −∑x∈X μ

(n)π (x)·log μ

(n)π (x) is linked

to the above pseudo co–entropy (30) by the relationship H(n)LX(π)+ 1

μ(π) ELX(π) =log[μ(π)·m(X) ], in which the dependence from the partition π by the “measure”μ(π) is very hard to handle in applications.


2.8 Local Rough Granularity Measure in the Case of Partitions

From the point of view of the rough approximations of subsets Y of the universeX (considered as a measure space 〈X,P(X), m〉) with respect to its partitions π,we shall consider now the situation in which during the time evolution t1 → t2one tries to relate the corresponding variation of partitions πt1 → πt2 with, forinstance, the corresponding boundary modification bt1(Y ) → bt2(Y ) (see alsoFigure 2). Let us note that i = 1, 2, are such that

if π1 � π2, then lπ2(Y ) ⊆ lπ1(Y ) ⊆ Y ⊆ uπ1(Y ) ⊆ uπ2(Y )

i.e., the rough approximation of Y with respect to the partition π1, rπ1(Y ) =(lπ1(Y ), uπ1(Y )), is better than the rough approximation of the same subsetwith respect to π2, rπ2(Y ) = (lπ2(Y ), uπ2(Y )). This fact can be denoted by thebinary relation of partial ordering on subsets: rπ1(Y ) � rπ2(Y ).

This leads to a first but only qualitative valuation of the roughness expressedby the following general law involving the boundaries of Y relatively to the twopartitions:

π1 � π2 implies that ∀Y, bπ1(Y ) ⊆ bπ2(Y )

π2b (H)

π1 π2<π1

b (H)

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Fig. 2. Qualitative variation of boundaries with variation of partitions

The delicate point is that the condition of strict ordering π1 ≺ π2 does notassure that the corresponding strict ordering ∀Y , bπ1(Y ) ⊂ bπ2(Y ) holds. It ispossible to give some very simple counter–examples (see for instance example 5)in which notwithstanding π1 ≺ π2 one has that ∃Y0: bπ1(Y0) = bπ2(Y0) [9,5], andthis is not a desirable behavior of such a qualitative valuation of roughness.

Example 5. In the universe X = {1, 2, 3, 4, 5, 6}, let us consider the two par-titions π1 = {{1}, {2}, {3}, {4, 5, 6}} and π2 = {{1, 2}, {3}, {4, 5, 6}}, with re-spect to which π1 ≺ π2. The subset Y0 = {1, 2, 4, 6} is such that lπ1(Y0) =lπ2(Y0) = {1, 2} and uπ1(Y0) = uπ2(Y0) = {1, 2, 4, 5, 6}. This result implies thatbπ1(Y0) = bπ2(Y0) = {4, 5, 6}.On the other hand, in many practical applications (for instance in the attributereduction procedure), it is interesting not only to have a possible qualitative


valuation of the roughness of a generic subset Y , but also a quantitative val-uation formalized by a mapping E : Π(X) × 2X → [0, K] (with K suitablenon–negative real number) assumed to satisfy (at least) the following two mini-mal requirements:

(re1) The strict monotonicity condition: for any Y ∈ 2X , π1 ≺ π2 impliesEπ1(Y ) < Eπ2(Y ).

(re2) The boundary conditions : ∀Y ∈ 2X , Eπd(Y ) = 0 and Eπg (Y ) = 1.

In the sequel, sometimes we will use Eπ : 2X → [0, K] to denote the abovemapping in the case in which the partition π ∈ Π(X) is considered fixed oncefor all. The interpretation of condition (re2) is possible under the assumptionthat a quantitative valuation of the roughness Eπ(Y ) should be directly relatedto its boundary by the measure m(bπ(Y )). From this point of view, the value 0corresponds to the discrete partition for which the boundary of any subset Y isempty, and so its rough approximation is rπd

(Y ) = (Y, Y ) with m(bπd(Y )) = 0,

i.e., a crisp situation. On the other hand, the trivial partition is such that theboundary of any nontrivial subset Y (�= ∅, X) is the whole universe, and so itsrough approximation is rπg (Y ) = (∅, X) with m(bπg(Y )) = m(X). For all otherpartitions π we must recall that πd � π ≺ πg and 0 = m(bπd

(Y )) ≤ m(bπ(Y )) ≤m(bπg(Y )) = m(X), i.e., the maximum of roughness (or minimum of sharpness)valuation is reached by the trivial partition πg.

This being stated, in literature one can find a lot of quantitative measures ofroughness of Y relatively to a given partition π ∈ Π(X) formalized as mappingsρπ : 2X → [0, 1] such that:

(rm1) the monotonicity condition holds: π1 � π2 implies that ∀Y ∈ 2X ,ρπ1(Y ) ≤ ρπ2(Y );

(rm2) ∀Y ∈ 2X , ρπd(Y ) = 0 and ρπg (Y ) = 1.

The accuracy of the set Y with respect to the partition π is then defined asαπ(Y ) = 1− ρπ(Y ). The interpretation of the condition (rm2) is that in generala roughness measure directly depends from a valuation of the measure of theboundary bπ(Y ) of Y relatively to π. Two of the more interesting roughnessmeasures are

ρ(P )π (Y ) :=

m(bπ(Y ))m(uπ(Y ))

and ρ(C)π (Y ) :=

m(bπ(Y ))m(X)

with the latter (considered in [5]) producing a better description of the former(introduced by Pawlak in [24]) with respect to the absolute scale of sharpnesspreviously introduced, since whatever be the subset Y it is ρ

(C)π (Y ) ≤ ρ

(P )π (Y ).

These roughness measures satisfy the above “boundary” condition (re2), buttheir drawback is that the strict condition on partitions π1 ≺ π2 does not assure acorresponding strict behavior ∀Y , bπ1(Y ) ⊂ bπ2(Y ), and so the strict correlationρπ1(Y ) < ρπ2(Y ) cannot be inferred. It might happen that notwithstanding thestrict partition order π1 ≺ π2, the two corresponding roughness measures for acertain subset Y0 turn out to be equal ρπ1(Y0) = ρπ2(Y0) as illustrated in thefollowing example.


Example 6. Making reference to example 5 we have that although π1 ≺ π2, forthe subset Y0 we get ρπ1(Y0) = ρπ2(Y0) (for both roughness measures ρ

(P )π (Y0)

and ρ(C)π (Y0)).

Summarizing we can only state the following monotonicity with respect to thepartition ordering:

π1 ≺ π2 implies ∀ Y ⊆ X, ρπ1(Y ) ≤ ρπ2(Y )

Taking inspiration from [9] a local co–entropy measure of Y , in the sense of a“co–entropy” assigned not to the whole universe X but to any possible of itssubset Y , is then defined as the product of the above (local) roughness measuretimes the (global) co–entropy:

Eπ(Y ) := ρπ(Y ) · E(π) (34)

For a fixed partition π of X also this quantity ranges into the closed real interval[0, logm(X) ] whatever be the subset Y , with the extreme values reached forEπd

(Y ) = 0 and Eπg (Y ) = log m(X), i.e., ∀Y ⊆ X it is

0 = Eπd(Y ) ≤ Eπ(Y ) ≤ Eπg (Y ) = log m(X)

Moreover, for any fixed subset Y this local co–entropy is strictly monotonic withrespect to partitions:

π1 ≺ π2 implies ∀Y ⊆ X, Eπ1(Y ) < Eπ2(Y ) (35)

Making use of the above interpretation (see the end of the section 2.3) of the realinterval [0, log m(X) ] as an absolute scale of sharpness, from this result we havethat, according to our intuition, the finer is the partition the best is the sharpnessof the rough approximation of Y , i.e., Eπ : Y ∈ P(X) → Eπ(Y ) ∈ [0, log2 m(X) ]can be considered as a (local) rough granularity mapping.

Example 7. Let us consider the universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, itssubset Y = {2, 3, 5, 8, 9, 10, 11}, and the following three different partitions ofthe universe X by granules:

π1 = {{2, 3, 5, 8, 9}, {1, 4}, {6, 7, 10, 11}},π2 = {{2, 3}, {5, 8, 9}, {1, 4, }, {6, 7, 10, 11}},

π3 = {{2, 3}, {5, 8, 9}, {1, 4, }, {7, 10}, {6, 11}}with π3 ≺ π2 ≺ π1. The lower and upper approximations of Y with respectto π1, π2 and π3 are equal, and given, respectively by: iπk

(Y ) = {2, 3, 5, 8, 9}and oπk

(Y ) = {2, 3, 5, 6, 7, 8, 9, 10, 11} , for k = 1, 2, 3. Note that necessarilyeπ1(Y ) = eπ2(Y ) = eπ3(Y ) = {1, 4}. Therefore, the corresponding roughnessmeasures are exactly the same: ρπ1(Y ) = ρπ2(Y ) = ρπ3(Y ), even though fromthe point of view of the granularity knowledge we know that the lower ap-proximations of Y are obtained by different collections of granules: grπ2

i (Y ) =


{{2, 3}, {5, 8, 9}} = grπ3i (Y ), as collection of two granules, are better (finer)

than grπ1i (Y ) = {{2, 3, 5, 8, 9}}, a single granule, this fact formally written as

grπ2i (Y ) = grπ3

i (Y ) ≺ grπ1i (Y ). Similarly, always from the granule knowledge

point of view, we can see that the best partitioning for the upper approximationof Y is obtained with π3 since grπ1

o (Y ) = {{2, 3, 5, 8, 9}, {6, 7, 10, 11}}, grπ2o (Y ) =

{{2, 3}, {5, 8, 9}, {6, 7, 10, 11}}, and grπ3o (Y ) = {{2, 3}, {5, 8, 9}, {7, 10}, {6, 11}},

and thus grπ3o (Y ) ≺ grπ2

o (Y ) ≺ grπ1o (Y ).

It is clear that the roughness measure ρπ(Y ) is not enough when we wantto catch any possible advantage in terms of granularity knowledge given bydifferent partitioning, even when the new partitioning does not increase thecardinality of the internal and the closure approximation sets. On the contrary,this difference is measured by the local co–entropy (34) since according to (35),and recalling that π3 ≺ π2 ≺ π1, we have the following strict monotonicity:Eπ3(Y ) < Eπ2(Y ) < Eπ1(Y ).

2.9 Application to Complete Information Systems: The Case ofFixed Universe

These considerations can be applied to the case of a complete Information System(IS) on the finite universe. Let us stress that in this subsection the universeis considered fixed, whereas is the collection of attributes applied to X whichchanges. Indeed, in many applications it is of a certain interest to analyze thevariations occurring inside two information systems labelled with two parameterst1 and t2, each of which is based on the same universe X . In particular, one hasto do mainly with the following two cases:

(1) dynamics (see [11]), in which ISt1 = (X, Att1, F1) and ISt2 = (X, Att2, F2)are under the conditions that Att1 ⊂ Att2 and ∀x ∈ X , ∀ a1 ∈ Att1, F2(x, a1) =F1(x, a1). This situation corresponds to a dynamical increase of knowledge (t1and t2 are considered as time parameters, with t1 < t2) for instance in a medicaldatabase in which one fixed decision attribute d ∈ Att1∩Att2 is selected to statea certain disease related to all the resting condition attributes (i.e., symptoms)Ci = Atti\{d}. In this case the increase Att1\{d} ⊆ Att2\{d} corresponds to thefact that during the researches on the disease some symptoms which have beenneglected at time t1 become relevant at time t2 under some new investigations.

(2) reduct, in which ISt1 = (X, Att1, F1) and ISt2 = (X, Att2, F2) are underthe conditions that Att2 ⊂ Att1 and ∀x ∈ X , ∀ a2 ∈ Att2, F2(x, a2) = F1(x, a2).In this case it is of a certain interest to verify if the corresponding partitions areinvariant πAtt2(ISt2) = πAtt1(ISt1), or not. In the former case one can considerISt2 as the result of the reduction of the initial attributes Att1 obtained by thesuppression from ISt1 of the superfluous attributes Att1 \ Att2.

From a general point of view, a reduction procedure can be formalized bya (strictly) monotonically decreasing sequence of attribute families RP :={At ⊆ Att s.t. t ∈ N and At ⊃ At+1}, with A0 = Att. In this case it holdsthe following diagram, linking the family At with the generated partition π(At)whose co–entropy is E(At):


A0 = Att ⊃ A1 ⊃ . . . ⊃ At ⊃ At+1 . . . ⊃ AT = ∅↓ ↓ ↓ ↓ ↓

π(A0) � π(A1) � . . . � π(At) � π(At+1) . . . � {X}↓ ↓ ↓ ↓ ↓

E(A0) ≤ E(A1) ≤ . . . ≤ E(At) ≤ E(At+1) . . . ≤ log m(X)

The first row constitutes the attribute channel, the second row the partitionchannel (measured by the corresponding co–entropy, whose upper bound is thetrivial partition πt = {X}), and the last row the granularity channel (whose up-per bound corresponds to the maximum of roughness log m(X)) of the reductionprocedure. After the finite number of steps T = |Att|, one reaches the emptyset AT = ∅ with corresponding π(AT ) = πt = {X}, the trivial partition, andE(AT ) = log m(X). In this reduction context, the link between the situation atstep t and the corresponding one at t + 1 relatively to the co–entropy is givenby equation (29) which assumes now the form:

E(At+1) = E(At) + H(At|At+1) (36)

From a general point if view, a practical procedure of reduction consists of start-ing from an initial attribute family A0, and according to some algorithmic cri-terium Alg, step by step, one “constructs” the sequence of At, with this lattera subset of the previous At−1. It is possible to fix a priori a suitable approx-imation value ε and then to stop the procedure at the first step t0 such thatlog m(X) − E(At0) ≤ ε. This assures that for any other further step t > t0 it isalso log m(X)−E(At) ≤ ε. The family of attributes A(t0) is the ε–approximatereduct with respect to the procedure Alg. Note that in terms of approximationthe following order chain holds: ∀ t > t0, E(At)−E(At0 ) ≤ log m(X)−E(At0) ≤ε. On the other hand, for any triplet of steps t0 < t1 < t2 it is

H(At1 |At2) = E(At2) − E(At1) ≤ log m(X) − E(At0) ≤ ε

Example 8. In the complete information system illustrated in table 2

Table 2. Flats complete information system

Flat Price Rooms Down-Town Furniture Floor Lift

1 high 3 yes yes 3 yes2 high 3 yes yes 3 no3 high 2 no no 1 no4 high 2 no no 1 yes5 high 2 yes no 2 no6 high 2 yes no 2 yes7 low 1 no no 2 yes8 low 1 no yes 3 yes9 low 1 no no 2 no10 low 1 yes yes 1 yes


let us consider the following five (decreasing) families of attributes: A0 = Att ={Price, Rooms, Down − Town, Furniture, F loor, Lift} ⊃

A1 = {Price, Rooms, Down − Town, Furniture, F loor} ⊃A2 = {Price, Rooms, Down − Town, Furniture} ⊃A3 = {Price, Rooms, Down − Town} ⊃A4 = {Price, Rooms} ⊃ A5 = {Price}

The corresponding probability partitions areπ(A1) = π(A2) = {{1, 2}, {3, 4}, {5, 6}, {7, 9}, {8}, {10}},π(A3) = {{1, 2}, {3, 4}, {5, 6}, {7, 8, 9}, {10}},π(A4) = {{1, 2}, {3, 4, 5, 6}, {7, 8, 9, 10}}, andπ(A5) = {{1, 2, 3, 4, 5, 6}, {7, 8, 9, 10}}.

Note that π(A0) corresponds to the discrete partition πd. We can easily ob-serve that π(A0) ≺ π(A1) = π(A2) ≺ π(A3) ≺ π(A4) ≺ π(A5) and thatE(A0) = 0.00000 < E(A1) = 0.80000 = E(A2) < 1.07549 = E(A3) <1.80000 = E(A4) < 2.35098 = E(A5) < log m(X) = 3.32193. Moreover, tak-ing for instance E(A3) and E(A4) and according to (36), we have H(A3|A4) =E(A4) − E(A3) = 0.72451.

A0 = Att ⊃ A1 ⊃ A2 ⊃ A3 ⊃ A4 ⊃ A5 ⊃ AT = ∅↓ ↓ ↓ ↓ ↓ ↓ ↓

π(A0) ≺ π(A1) = π(A2) ≺ π(A3) ≺ π(A4) ≺ π(A5) ≺ {X}↓ ↓ ↓ ↓ ↓ ↓ ↓

E(A0) < E(A1) = E(A2) < E(A3) < E(A4) < E(A5) < log m(X)

2.10 Application to Complete Information Systems: The Case ofFixed Attributes

We have just studied complete information systems whose universe X is fixed,taking into account the possibility of variation of to set of attributes, for instancewhen there is an increasing of their collections. Let us now consider the otherpoint of view in which the set of attributes is fixed, whereas it is the universeof objects which increases in time. This approach can describe the situation ofan hospital, specialized in some disease whose symptomatology has been char-acterized by a well recognized set of symptoms described as attributes. In thiscase the information table has the set of attributes fixed, and it is the set ofpatients which varies in time: ISt0 = 〈Xt0 , {fa : Xt0 → val(a) | a ∈ Att}〉 andISt1 = 〈Xt1 , {fa : Xt1 → val(a) | a ∈ Att}〉.

For a fixed information system IS = 〈X, {fa : X → val(a) | a ∈ Att}〉, withX = {x1, x2, . . . , xN}, the finite set of attributes Att = {a1, a2, . . . , aH} givesrise to the collection of attribute values V := val(a1) × val(a2) × . . .× val(aH),which can be considered as an alphabet. Let us consider the information functionfAtt : X �→ V al which associates with any object x ∈ X the corresponding value

fAtt(x) = (fa1(x), fa2(x), . . . , faH (x)) ∈ V

Then, the information table generates the string of N letters in the alphabet V

x ≡ fAtt(x1), fAtt(x2), . . . , fAtt(xN ) ∈ V N


which describes a microstate of length N which can be represented in a gridof N cells in such a way that the site j of the grid is labelled by the letterfAtt(xj) ∈ V . The set V N is then the phase state of the IS.

If we denote by V = {v1, v2, . . . , vL} the collection of all the letters of thealphabet V , then we can calculate the number ni(x) of cells of the microstate xwith the same letter vi, obtaining in this way the new string, called configuration,

n(x) ≡ n1(x), n2(x), . . . , nL(x) ∈ NL

Since the cardinality N = |X | of the universe X is related to this configuration bythe identity N =

∑i ni(x), we also refers to n(x) as a configurations involving

N objects.Clearly, two configurations x and y must be considered as belonging to the

same macrostate if for any i = 1, 2, . . . , L they have the same number of cellswith the letter vi, i.e., it is ni(x) = ni(y). This is of course an equivalencerelation on the family of all possible N–length configurations.

The total number W (n) of microstates from the phase space V N which arecharacterized by the macrostate n = (n1, n2, . . . , nL) ∈ NL takes the followingform:

W (n) =N !

n1! n2! . . . nL!where ni represents the number of cells with potentially the letter vi of thealphabet. If we define as entropy of the configuration n the quantity

h(n) = log W (n)

making use of the Stirling formula and setting pi = ni/N , one obtains theapproximation

h(n) = log W (n) ∼= −N

L∑

i=1

pi log pi

from which the average entropy is defined as

H(n) =h(n)N

∼= −L∑

i=1

pi log pi

3 Partial Partitions

Real information systems are often incomplete, meaning that they present lack ofinformation. An incomplete information system is formalized as a triple IIS :=〈X, Att, F 〉. Differently from the case of a complete information system, F is amapping partially defined on a subset D(F ) of X × Att. In this way also themapping representation of an attribute a is partially defined on a subset Xa ofX . We denote the subset Xa := {x ∈ X : (x, a) ∈ D(F )} by definition domainof the attribute a.

In order to extend to an incomplete information system the previously de-scribed properties and considerations about entropy of partitions, we have atleast two different possibilities.


(i) First of all (see also [4]), let a, b be two attributes. Then, it is possible todefine the (non–surjective) mapping fa,b : Xa ∪ Xb �→ val(a) × val(b) as

fa,b(x) :=

⎧⎪⎨

⎪⎩

(fa(x), fb(x)) x ∈ Xa ∩ Xb

(fa(x), ∗) x ∈ Xa ∩ (Xb)c

(∗, fb(x)) x ∈ (Xa)c ∩ Xb

The generalization to any subset A of attributes is now straightforward,obtaining a mapping fA : XA �→ val(A), with XA =

⋃a∈A Xa and

val(A) = Πa∈A val(a). Now, for any possible “value” α = (αi) ∈ val(A),one can construct the granule f−1

A (α) = {x ∈ XA : fA(x) = α} of Xlabelled by α, also denoted by [A, α]. The family of granules gr(A) ={[A, α] : α ∈ val(A)} plus the null granule [A, ∗] = X \ XA (i.e., thecollection of the states in which all the attributes are unknown) constitutea partition of the universe X , in which gr(A) is a partition of the subsetXA of X (which can be considered as a “partial” partition of X).

(ii) Another possibility could be the following. We can consider the coveringgenerated by a similarity relation. In case of incompleteness it is oftenused the similarity relation described by [18], according to which, we havethat two objects x, y ∈ X are said to be similar if and only if

∀ a ∈ A ⊆ Att, either fa(x) = fa(y) or fa(x) = ∗ or fa(y) = ∗We will start investigating the first of these two options, leaving the treatment

of the second one to a further section (section 4) dedicated to coverings .According to the considerations just discussed at point (i), any incomplete

information system on the universe X can be formalized by a collection of surjec-tive mappings fa : Xa �→ val(a), for a ∈ Att, with Att an index set of attributes,each of which is partially defined on a subset Xa of X . The attributes whichare of certain interest for the objects of the universe are then identified withthe corresponding mappings. Adding to val(a) the further null value ∗, we canextend the partially defined mapping fa to the global defined one, denoted byf∗

a : X �→ val∗(a), which, to any object x ∈ X , assigns the value f∗a (x) = fa(x)

if x ∈ Xa, and the value f∗a (x) = ∗ otherwise.

For any family of attributes A one can construct the “common” definitiondomain XA =

⋃a∈A Xa and then it is possible to consider the multi–attributes

mapping fA assigning to any object x ∈ XA the corresponding collection ofvalues fA(x) = (f∗

a (x))a∈A. Note that for x ∈ XA at least one of the f∗a (x) �= ∗.

Formally we can state that x /∈ XA iff ∀ a ∈ A, f∗a (x) = ∗.

Let us denote by V ∗A the range of the multi–attribute mapping f∗

A; then on thesubset XA of the universe it generates a family of granules f−1

A (α) = {x ∈ XA :fA(x) = α} labelled by the multi–value α ∈ V ∗

A. Let us denote by Aα = f−1A (α)

the generic of the above granules, then in general⋃

α∈V ∗A

Aα = XA ⊆ X and so,also if their collection consists of nonempty and pairwise disjoint subsets of X ,they are a partition of the domain XA, but not a partition of the universe X .

Recalling that X∗A = X\XA, we can define the measures mA(Aα) = |Aα| and

mA(X∗A) = 0, and so mA(X) = mA(

⋃α Aα∪X∗

A) =∑

α mA(Aα)+mA(X∗A) =


|XA|, with the natural extension to the σ–algebra of events EA(X) from Xgenerated by the elementary events {Aα : α ∈ V ∗

A} ∪ {X∗A} plus all the subsets

(of measure 0) of X∗A, obtaining in this way a complete measure depending from

the set of attributes A. In particular, the measure of the whole universe changeswith the choice of A. The corresponding (globally normalized) probabilities arethen p∗(Aα) = |Aα|

|X| and p∗(X∗A) = 0. According to a widely used terminology,

the collection π(A) = {Aα : α ∈ V ∗A} is a pseudo probability partition because

we have p∗(⋃

α Aα) = |XA||X| ≤ 1, i.e., we do not always have the equality to 1. It

is possible to define the entropy and related co–entropy of the pseudo probabilitypartition generated by A as follows:

H(A) =|XA||X| log |X| − 1

|X|∑

α∈V ∗A

|Aα | log |Aα | = −∑

α∈V ∗A

p∗(Aα) log p∗(Aα) (37a)

E(A) =|X| − |XA|

|X| · log |X| +1

|X|∑

α∈V ∗A

|Aα | log |Aα | (37b)

Also in this case we have that

H(A) + E(A) = log |X |. (38)

Remark 3. In the context of partial partitions generated by families of attributesfrom an incomplete information system, let us consider two families of attributesA and B, with A ⊆ B. The mapping fB is defined on XB which contains thedefinition domain XA of fA: XA ⊆ XB; also the two corresponding σ–algebrasare related in the same way: EA(X) ⊆ EB(X). This latter result assures thatπ(B) � π(A) according to 1 (por2) generalized to the present case, but ingeneral it is not assured that π(B) � π(A).

The following important result about isotonicity holds.

Theorem 2. Given an incomplete information system, let A ⊆ B be two col-lections of attributes, and π(B) and π(A) the corresponding pseudo–probabilitypartitions. Then we have

H(A) ≤ H(B).

Moreover, under the condition |XB| > |XA| the following strict isotonicityholds:

A ⊂ B implies H(A) < H(B).

Proof. Since A ⊆ B, following the discussion of remark 3, in the present case oneobtains that, according to 1 (por2), π(B) � π(A), whereas π(B) � π(A) is notassured in general. Thus, we can state that it is true that there exists at least oneAh ∈ π(A) for which there exists {B1

h, . . . , Bmh } ⊆ π(B) s.t. Ah = B1

h∪. . .∪Bmh .

From the point of view of probabilities one has that

p∗(Ah) =|Ah||X | =

m∑

i=1

p∗(Bih) (39)


If one follows the proof of the property of additivity of entropy of partitions in[25, p.84], taking into account (39), it is possible to prove that

p∗(Ah) log p∗(Ah) −m∑

k=1

p∗(Bkh) log p∗(Bk

h) =

=

(m∑

i=1

p∗(Bih)

)

log p∗(Ah) −m∑

k=1

p∗(Bkh) log p∗(Bk

h)

= p∗(Ah) · H(

p∗(B1h)

p∗(Ah), . . . ,

p∗(Bmh )

p∗(Ah)

)

.

From this result one obtains that

H(B) = −∑

α∈V ∗B

|Bα ||X| log

|Bα ||X| =

|XB||X| log |X| − 1

|X|∑

α∈V ∗B

|Bα | log |Bα |

= H(p∗(A1), . . . , p∗(B1

h), . . . , p∗(Bmh ), . . . , p∗(AN), p∗(Bz1), . . . , p∗(Bzk)

)

= H(p∗(A1), . . . , p∗(Ah), . . . , p∗(AN)

)+ p∗(Ah) · H

(p∗(B1

h)

p∗(Ah), . . . ,

p∗(Bmh )

p∗(Ah)

)

+ H(p∗(Bz1), . . . , p∗(Bzk)

)

Let us denote the last term H(p∗(Bz1), . . . , p∗(Bzk)) by H(B \ A). This term

represents the part of the entropy H(B) regarding the classes of the partitionπ(B) that are not in A, i.e., Bz1 , . . . , Bzk

∈ π(B) such that Bz1 , . . . , Bzk/∈ A.

Thus we have

H(B) = H(A) + p∗(Ah) · H(

p∗(B1h)

p∗(Ah), . . . ,

p∗(Bmh )

p∗(Ah)

)

+ H(B \ A)

This result means that we always have H(B) ≥ H(A), and thus we can write

A ⊆ B implies H(A) ≤ H(B) (40)

In particular, if |XB| > |XA| then we have that H(B \ A) > 0, and from thistrivially H(A) < H(B) follows.

As a direct consequence of theorem 2, and making use of (40) and (38), we havethe following corollary regarding the co–entropy E(A).

Corollary 1. Let A ⊆ B be two collections of attributes. Then we have E(B) ≤E(A).

Remark 4. Let us stress that our newly defined co–entropy (37b) is a general-ization of the co–entropy (13) defined for complete information systems. In fact,in the particular case of a complete information system we have that XA = X ,and thus (37b) becomes (13).


3.1 Local Rough Granularity Measure in the Case of PartialPartitions

Let us consider a subset Y ⊆ X . Relatively to a definition domain XA, we maydistinguish a subset YA = Y ∩XA and Y ∗

A = Y \YA, thus obtaining Y = YA∪Y ∗A.

Similarly to what we have discussed in subsection 2.8, one can introduce twonew notions of accuracy and roughness of Y ⊆ X relatively to the probabilitypartition πA, as follows:

Definition 2. Given an incomplete information system, let us take Y ⊆ X. LetXA be a definition domain corresponding to A ⊆ Att. We can define the followingaccuracy and roughness coefficients relative to the rough approximation of Y :

απA(Y ) :=|lπA(Y )| + |eπA(Y )|

|XA| (41a)

ρπA(Y ) := 1 − απA(Y ) =|bπA(Y )||XA| . (41b)

The following holds

Proposition 4. Given Y ⊆ X, let A ⊆ B be two collections of attributes of anincomplete information system, and π(B) and π(A) the corresponding probabilitypartitions. Then,

∀Y ⊆ XA implies ρπB(Y ) ≤ ρπA(Y ).

Proof. Since we have that A ⊆ B, then we know that π(B) � π(A) and thatXA ⊆ XB. Thus |XA| ≤ |XB| and |bπB(Y )| ≤ |bπA(Y )|. Hence we obtain

∀Y ⊆ XA and A ⊆ B implies ρπB(Y ) ≤ ρπA(Y ). (42)

From a probability partition of a complete information system, we can define arough entropy of Y relatively to the probability partition πA of an incompleteinformation system as:

Definition 3. Given an incomplete information system, let us consider Y ⊆ X,and A ⊆ Att. We can define the following rough entropy of Y relatively to theprobability partition πA:

EπA(Y ) = ρπA(Y ) · E(A). (43)

We have the following:

Proposition 5. Let us consider an incomplete information system. Given Y ⊆X, let A ⊆ B be two collections of attributes, XA ⊆ XB the correspondingdefinition domains, and π(B) and π(A) the corresponding probability partitions.If Y ⊆ XA then we have

EπB (Y ) ≤ EπA(Y )


Proof. As previously shown, we have that 1 holds, and that, under the conditionof Y ⊆ XA, ρπB(Y ) ≤ ρπA(Y ) holds too. Hence, trivially we have that

∀Y ⊆ XA and A ⊆ B implies EπB(Y ) ≤ EπA(Y ). (44)

If Y � XA we have in general that A ⊆ B does not imply ρπB(Y ) ≤ ρπA(Y )as illustrated in example 9.

Example 9. Let us consider the incomplete information system illustrated intable 3 with universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}.

Table 3. Flats incomplete information system

Flat Price Rooms Down-Town Furniture

1 high 2 * no2 high 2 * *3 high 2 * no4 low 3 yes no5 low 3 yes *6 low 3 yes no7 low 1 no no8 low 1 * *9 * * yes yes10 * * yes *11 * * * *12 * * * no13 * * * no14 * * * *

Let us choose the set Y = {1, 2, 7, 8, 9} ⊆ X (we can think of having a decisionattribute, for instance “Close to the railway station” which partitions X in flatswhich are close to the railway station (flats {1, 2, 7, 8, 9}) and in flats which areconsidered far ({3, 4, 5, 6, 10, 11, 12, 13, 14})).

Now let us consider two subsets of attributes, A, B ⊆ Att, with A = {Price,Rooms} and B = {Price, Rooms, Down − Town}. We thus obtain two def-inition domains XA = {1, 2, 3, 4, 5, 6, 7, 8} ⊆ XB = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10},and the corresponding probability partitions πB = {{1, 2, 3}, {4, 5, 6}, {7}, {8},{9, 10}} and πA = {{1, 2, 3}, {4, 5, 6}, {7, 8}}, with respect to which πB � πA.The subset Y is such that lπA(Y ) = lπB(Y ) = {7, 8}, uπA(Y ) = {1, 2, 3, 7, 8},uπB(Y ) = {1, 2, 3, 7, 8, 9, 10}, eπA(Y ) = eπB(Y ){4, 5, 6}. This result implies thatαπB(Y ) = 5/10 < απA(Y ) = 5/8, ρπA(Y ) = 3/8 < ρπB(Y ) = 5/10. Moreover, asillustrated in table 4, the entropies and co–entropies in the case presented resultto be respectively anti–tone and isotone as expected, whereas the local roughentropies do not respect the desired order.

Hence, we have a local rough entropy (43) which behaves isotonically only underthe restriction Y ⊆ XA. Since this restriction is due to the definition we gaveof ρπA(Y ), we propose the following possible solution: we consider the whole


Table 4. Entropies with different attribute families

H E Eπ

A 1.35 2.45 0.92

B 1.89 1.91 0.95

set of attributes Att for the accuracy and roughness measure, thus obtainingconstant values (given an incomplete information system), independent fromthe relationship between Y and the definition domains.

απAtt(Y ) =|lπAtt(Y )| + |eπAtt(Y )|

|XAtt| (45a)

ρπAtt(Y ) := 1 − απAtt(Y ) =|bπAtt(Y )||XAtt| . (45b)

In this way, we can define the following measure of rough entropy, of Y relativelyto the probability partition πA of an incomplete information system.

EAttπA (Y ) = ρπAtt(Y ) · E(A). (46)

This new local rough entropy behaves isotonically, with respect to the partitionorder, for any Y ⊆ X , as trivially shown by the following proposition.

Proposition 6. Let us consider an incomplete information system. Given Y ⊆X, let A ⊆ B be two collections of attributes, XA ⊆ XB the correspondingdefinition domains, and π(B) and π(A) the corresponding probability partitions.Then, we have

EAttπB (Y ) ≤ EAtt

πA (Y ).

Proof. Given Att, (45) is a constant value thus from 1, we trivially obtain theresult

∀Y ⊆ X, A ⊆ B implies EAttπB (Y ) ≤ EAtt

πA (Y ). (47)

Example 10. Let us consider the example 9. If we take the whole set of attributesAtt = {Price, Rooms, Down − −Town, Furniture}, the corresponding parti-tion is π(Att) = {{1, 3}, {2}, {4, 6}, {5}, {7}, {8}, {9}, {10}, {12, 13}}. Withrespect to the set Y , the lower approximation is lπAtt(Y ) = {2, 7, 8, 9}, the up-per is uπAtt(Y ) = {1, 2, 3, 7, 8, 9}, the external region is eπAtt(Y ) = {4, 5, 6, 10,12, 13}. From this result we have that απAtt(Y ) = 10/12, and ρπAtt(Y ) = 1/6.Thus we obtain that, in the present case, the local rough entropies (46) for π(A)and π(B) are respectively EAtt

πA (Y ) = 0.40898 > EAttπB (Y ) = 0.31832 as expected.

4 Coverings

In this section we will treat the approach of extracting coverings from incompleteinformation systems by using the so–called similarity relation introduced in [18]


and previously illustrated. We will also illustrate an approach that allows toextract a collection of “similarity” classes from coverings that are not inducedby a similarity relation.

If one wants to use entropy or rough entropy - for example in order to reducethe information system - it is necessary to find an entropy that behaves “well”in the context of coverings. That is why in this section we will summarize themain approaches and attempts of approaches to entropies for coverings.

4.1 Genuine Coverings

It is interesting to observe that in the case of a covering γ = {B1, B2, . . . , BN}of the universe X it may happen that some of its elements are in some senseredundant. In particular, if for two of its elements Bi ∈ γ and Bj ∈ γ it resultsthat Bi ⊆ Bj , then from the covering point of view the subset Bi can be consid-ered “irrelevant” since its covering role is performed by the larger set Bj . It isthus interesting to select those particular “genuine” coverings for which this re-dundancy is not involved. To this purpose we introduce the following definition:A covering γ = {B1, B2, . . . , BN} is said to be genuine iff there is no elementBi ∈ γ equal to the whole X and the following condition is satisfied:

∀Bi ∈ γ, ∀Bj ∈ γ, Bi �= Bi ∩ Bj

or equivalently,

∀Bi ∈ γ, ∀Bj ∈ γ, Bi ⊆ Bj implies Bi = Bj .

The collection of all genuine coverings of X will be denoted by Γg(X) in thesequel. A canonical procedure to obtain a genuine covering γg from a coveringγ is given by the following procedure. Let γ = {B1, B2, . . . , BN} be a genericcovering of X , then

(G1) construct all the maximal chains from γ with respect to the set theoreticinclusion Ci(γ) = {Bi1 , Bi2 , . . . , BiM }, Cj(γ) = {Bj1 , Bj2 , . . . , BjM }, . . .,Cz(γ) = {Bz1 , Bz2 , . . . , BzM }.

(G2) collect all the maximal elements {BiM , BjM , . . . , BzM }, then this is a gen-uine covering of X induced by γ and denoted by γg. Trivially, γg ⊆ γ fromwhich the following follow:

∀Bi ∈ γ, ∃BjM ∈ γg s.t. Bi ⊆ BjM (48a)∀BjM ∈ γg, ∃BjM ∈ γ s.t. BjM ⊆ BjM (48b)

Example 11. Let us consider the universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and thecovering γ = {B1 = {1, 2, 3, 4, 5, 6}, B2 = {3}, B3 = {2, 3}, B4 = {3, 4, 6}, B5 ={4, 5, 6, 7, 8, 9, 10}, B6 = {7, 8, 9}, B7 = {7, 9}}. Then there are 3 maximal chainsC1 = {B2 = {3}, B3 = {2, 3}, B1 = {1, 2, 3, 4, 5, 6}}, C2 = {B2 = {3}, B4 ={3, 4, 6}, B1 = {1, 2, 3, 4, 5, 6}}, and C3 = {B7 = {7, 9}, B6 = {7, 8, 9}, B5 ={4, 5, 6, 7, 8, 9, 10}} and so the genuine covering induced by γ is γg = {B1 ={1, 2, 3, 4, 5, 6}, B5 = {4, 5, 6, 7, 8, 9, 10}}.


Let us observe that a genuine covering is not always a minimal covering, wherea minimal covering γm is a subcovering of γ in which the sets have the smallestcardinality.

Example 12. Let us consider the universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and thecovering γ = {B1 = {2, 3}, B2 = {3, 4, 6}, B3 = {1, 4, 5, 6, 7}, B4 = {5, 7, 8, 9},B5 = {7, 9}, B6 = {6, 7, 8, 9, 10}}. The minimal covering of γ is γm = {B1, B3,B6}, whereas the genuine covering is γg = {B1, B2, B3, B4, B6}. In fact, accord-ing to the above definition, in the genuine covering we have to add both sets B2

and B4 since they are not represented by any other set, i.e., there is no Bi, Bj ∈ γsuch that B2 ⊆ Bi and B4 ⊆ Bj .

4.2 Orderings and Quasi–Orderings on Coverings

Since an entropy behaves “well” when it is isotonic or anti–tonic with respect to acertain (quasi) order relation, we will now introduce some definitions of orderingsand quasi–orderings for coverings. In [5] and [3] one can find the definitions ofsome quasi–orderings (i.e., a reflexive and transitive, but in general non anti–symmetric relation [8, p. 20]) and an ordering for generic coverings, as extensionto this context of the formulations (por1)–(por4) of the previously discussedordering on partitions, with the first two and the fourth of the “global” kind andthe third one of the “pointwise” kind.

“Global” Orderings and Quasi–Orderings on Coverings. In the presentsection we take into account the generalization of the only first two global cases.The first quasi–ordering is the extension of (por1) given by the following binaryrelation for γ, δ ∈ Γ (X):

γ � δ iff ∀Ci ∈ γ, ∃Dj ∈ δ s.t. Ci ⊆ Dj (49)

The corresponding strict quasi–order relation is γ ≺ δ iff γ � δ and γ �= δ.Let us observe that in the class of all genuine coverings Γg(X) the binary

relation � is an ordering [5]. In fact let γ, δ be two genuine coverings of X suchthat γ � δ and δ � γ. Then for ∀C ∈ γ, and using γ � δ, we have that ∃D ∈ δsuch that C ⊆ D; but from δ � γ it follows that there is also a C′ ∈ γ suchthat D ∈ C′, and so C ⊆ D ⊆ C′ and by the genuine condition of γ necessarilyC = D. Vice versa, for every D ∈ δ there exists C ∈ γ such that D = C.

Another quasi–ordering on Γ (X) which generalizes to coverings the (por2)is defined by the following binary relation:

γ � δ iff ∀D ∈ δ, ∃{C1, C2, . . . , Cp} ⊆ γ s.t. D = C1∪C2∪ . . .∪Cp (50)

In the covering context, there is no general relationship between (49) and (50)since it is possible to give an example of two (genuine) coverings γ, δ for whichγ � δ but γ �� δ, and of two other (genuine) coverings η, ξ for which η � ξ butη � ξ.


Let us now illustrate a binary relation introduced by Wierman (in the coveringcontext in an unpublished work which he kindly sent to us) (see also [3]):

γ �W δ iff ∀Ci ∈ γ, ∀Dj ∈ δ, Ci ∩ Dj �= ∅ implies Ci ⊆ Dj (51)

This binary relation, that corresponds to (1) (por4) in Π(X), has the advantageof being anti–symmetric on the whole Γ (X); but it presents the drawback (asexplained by Wierman himself) that it is not reflexive in the covering context,as illustrated in the following example.

Example 13. Let us consider a universe X = {1, 2, 3, 4, 5} and a covering γ ={C1 = {1, 2, 3}, C2 = {3, 4}, C3 = {2, 5}}. For the reflexivity of (51) we shouldhave that ∀ δ ∈ Γ (X), δ �W δ. But from this simple example we can see thatthis is not true. In fact γ �W γ since, for instance, C1 ∩C2 but we have neitherC1 ⊆ C2 nor C2 ⊆ C1.

For this reason, in order to define an ordering on coverings, Wierman added thefurther condition γ = δ in the following way:

γ ≤W δ iff γ = δ or γ �W δ (52)

So, we can see that in the covering case it is difficult to maintain the threeproperties of reflexivity, transitivity and anti–symmetry at the same time unlessone adds more conditions in the definition of the binary relation, or restricts theapplicability on a subclass of coverings, such as the class of all genuine ones.

Another advantage of (52) (as illustrated by Wierman himself) is that thepair (Γ (X),≤W ) is a poset lower bounded by the discrete partition πd = {{x1},{x2}, . . . , {xm(X)}}, which is the least element, and upper bounded by the trivialpartition πg = X which is the greatest element. Moreover, it is a lattice.

Let us now illustrate how one can extract a partition π(γ) from a covering γ.We thought of a method consisting in two main steps (see [7]): first we createthe covering completion γc, which consists of all the sets Ci of γ and of allthe complements Cc

i ; then, for each x ∈ X , we generate the granule gr(x) =⋂(C ∈ γc : x ∈ C). The collection of all granules gr(x) is a partition. The

Wierman approach to generate a partition from a covering, presents a differentformulation, i.e., gr(x) =

⋂x∈C C\⋃x/∈C C, which is equal to the just introduced

granule.The following important proposition holds:

Proposition 7. Given two coverings γ and δ one has that

γ ≤W δ implies π(γ) � π(δ)

This property (not described by Wierman) is very important since it allows us tocompare two coverings through the entropies of their induced partitions, whichbehave anti–tonically with respect to the standard order relation (1) on Π(X).Hence we obtain the following important result:

Proposition 8. Given two coverings γ and δ the following holds:

γ ≤W δ implies H(π(δ)) ≤ H(π(γ)) and E(π(γ)) ≤ E(π(δ))


“Pointwise” Quasi–Orderings on Coverings. Let us now consider a cov-ering γ = {C1, C2, . . . , CN} of X and let us see whether one can construct acollection of |X | similarity classes, where each class is generated by an objectx ∈ X . The aim is to obtain a new covering induced by the original one, i.e.,a new covering which expresses the original one via a collection of some kindof similarity classes. In [5] one can find the description of two possible kinds ofsimilarity classes induced by an object x of the universe X : the lower granuleγl(x) := ∩{Ci ∈ γ : x ∈ Ci} and the upper granule γu(x) = ∪{Cj ∈ γ : x ∈ Cj}generated by x. Of course, in the case of a trivial covering the upper granuleof any point x is the whole universe X , and so this notion turns out to be“significant” in the only case of non trivial coverings.

Thus, given a covering γ of a universe X , for any x ∈ X we can define the gran-ular rough approximation of x induced by γ as the pair rγ(x) := 〈γl(x), γu(x)〉,where x ∈ γl(x) ⊆ γu(x). The collections γu := {γu(x) : x ∈ X} and γl :={γl(x) : x ∈ X} of all such granules are both coverings of X , called the uppercovering and the lower covering generated by γ. In particular, we obtain thatfor any covering γ of X the following hold: γl � γ � γu and γl � γ � γu .We can introduce now two more quasi–order relations on Γ (X) defined by thefollowing binary relations:

γ �u δ iff ∀x ∈ X, γu(x) ⊆ δu(x)γ �l δ iff ∀x ∈ X, γl(x) ⊆ δl(x)

In [5] we have shown that γ � δ implies γ �u δ, but it is possible to give anexample of two coverings γ, δ such that γ � δ and for which γ �l δ does nothold. So it is important to consider a further quasi–ordering on coverings definedas

γ � δ iff δ �l γ and γ �u δ. (53)

which can be equivalently formulated as:

γ � δ iff ∀x ∈ X, δl(x) ⊆ γl(x) ⊆ (???) ⊆ γu(x) ⊆ δu(x)

where the question marks represent an intermediate covering granule γ(x), whichis something of “hidden” in the involved structure. This pointwise behavior canbe formally denoted by ∀x ∈ X, rγ(x) := 〈γl(x), γu(x)〉 � 〈δl(x), δu(x)〉 =:rδ(x) . In other words, � means that for any point x ∈ X the local approximationrγ(x) given by the covering γ is better than the local approximation rδ(x) givenby the covering δ. So equation (53) can be summarized by γ � δ iff ∀x ∈X, rγ(x) � rδ(x) (this latter simply written in a more compact form as rγ � rδ).

Orderings on Coverings in the Case of Incomplete Information Sys-tems. Let us now consider which (quasi) order relations can be defined in thecase of incomplete information systems IIS = 〈X, Att, F 〉. Let us start by de-scribing how one can extract coverings from an incomplete information system.For any family A of attributes it is possible to define on the objects of X thesimilarity relation SA introduced in [18]:


xSAy iff ∀ a ∈ A, either fa(x) = fa(y) or fa(x) = ∗ or fa(y) = ∗. (54)

This relation generates a covering of the universe X through the granules ofinformation (also similarity classes) sA(x) = {y ∈ X : (x, y) ∈ SA}, sinceX = ∪{sA(x) : x ∈ X} and x ∈ sA(x) �= ∅. In the sequel this kind of coveringwill be denoted by γ(A) := {sA(x) : x ∈ X} and their collection by Γ (IS) :={γ(A) ∈ Γ (X) : A ⊆ Att}.

Let us observe that, if in an incomplete information system 〈X, Att, F 〉 weconsider two subfamilies of attributes A, B ⊆ Att, with the induced coverings ofX denoted by γ(A) and γ(B), the following holds:

B ⊆ A implies γ(A) � γ(B) (55)

Unfortunately in general B ⊆ A does not imply γ(A) � γ(B), as illustrated inthe following example.

Example 14. Let us consider the incomplete information system represented intable 5.

Table 5. Flats incomplete information system

Flat Price Rooms Down-Town Furniture

f1 high 2 yes *f2 high * yes nof3 * 2 yes nof4 low * no nof5 low 1 * nof6 * 1 yes *

If one considers the set of all attributes (i.e., A = Att(X)) and the in-duced similarity relation we obtain the following covering: γ(A) = {sA(f1) =sA(f3) = {f1, f2, f3}, sA(f2) = {f1, f2, f3, f6}, sA(f4) = {f4, f5}, sA(f5) ={f4, f5, f6}, sA(f6) = {f2, f5, f6}} This covering is not genuine since, for in-stance, sA(f1) = sA(f3) ⊂ sA(f2).

Let us now take the subfamily of attributes D = {Price, Rooms} and the in-duced similarity relation. The resulting covering is γ(D) = {sD(f1) = {f1, f2, f3},sD(f2) = {f1, f2, f3, f6}, sD(f3) = {f1, f2, f3, f4}, sD(f4) = {f3, f4, f5, f6},sD(f5) = {f4, f5, f6}, sD(f6) = {f2, f4, f5, f6}} of X . Also this covering is notgenuine since, for instance, sD(f1) ⊂ sD(f2).

It is easy to see that γ(A) � γ(D), but it is not true that γ(A) � γ(D): infact, there is no collection of subsets sA(fi) ∈ γ(A) for which we obtain that theset union is sD(f4) = {f3, f4, f5, f6} ∈ γ(D).

On the coverings generated from an incomplete information system we can usethe following pointwise binary relation [21], which corresponds to the general-ization to incomplete information systems of the formulation (por3) (1) of thestandard order relation on partitions; let us consider A, B ⊆ Att, we define:

γ(A) ≤s γ(B) iff ∀x ∈ X, sA(x) ⊆ sB(x) (56)


This is a partial order relation, and we have that

B ⊆ A implies γ(A) ≤s γ(B), (57)

but in general γ(A) ≤s γ(B) does not imply π(γ(A)) � π(γ(B)), as illustratedin the following example.

Example 15. Let us consider the incomplete information system of table 5 fromthe previous example 14. Let us consider the whole collection of attributesA = Att(X) and its induced covering γ(A), previously illustrated. As to thegranules generated by the completion γc(A) of γ(A), according to the procedureabove described, in the present case we have π(γ(A)) = {grA(f1) = grA(f3) ={f1, f3}, grA(f2) = {f2}, grA(f4) = {f4}, grA(f5) = {f5}, grA(f6) = {f6}}.

Let us now consider the subfamily of attributes B = {Price, Down − Town,Furniture}. As for A = Att(X), let us consider the induced similarity relationand the resulting covering: γ(B) = {sB(f1) = sB(f2) = {f1, f2, f3, f6}, sB(f3) =sB(f6) = {f1, f2, f3, f5, f6}, sB(f4) = {f4, f5}, sB(f5) = {f3, f4, f5, f6}} of X .Trivially we have γ(A) ≤s γ(B). The partition generated by the completion γc(B)of γ(B), according to the same procedure used for the granules generated by thecompletion γc(A) of γ(A), in this example is: π(γ(B)) = {grB(f1) = grB(f2) ={f1, f2}, grB(f3) = grB(f6) = {f3, f6}, grB(f4) = {f4}, grB(f5) = {f5}}. Let usobserve that there is no ordering relation of any kind between the two partitionsπ(γ(A)) = {grA(fi)} and π(γ(B)) = {grB(fj)}, although this partitions havebeen generated from the same information system starting from two families ofattributes B and A = Att with B ⊂ A = Att.

The same happens when considering the quasi–orderings (49), (50): for instance,either γ(A) � γ(B) or γ(A) � γ(B) do not imply π(γ(A)) � π(γ(B)).

Let us now see what happens considering the partial order relation ≤W ofequation (52) in the case of coverings induced from an incomplete informationsystem. Let us start from an example:

Example 16. Let us consider the incomplete information system illustrated intable 6. Let us take the whole set of attributes (i.e., A = Att(X)) and the inducedsimilarity relation. The resulting covering is γ(A) = {sA(p1) = {p1, p4, p9, p10},sA(p2) = {p2, p9}, sA(p3)={p3}, sA(p4) = {p4, p1, p9}, sA(p5)={p5}, sA(p6)={p6, p10}, sA(p7) = {p7}, sA(p8) = {p8}, sA(p9) = {p9, p1, p2, p4}, sA(p10) ={p10, p1, p6}}.

Let us now consider the subset of attributes B = {Fever, Headache,Dizziness, BloodPressure} and the induced similarity relation. In this case thecovering is γ(B) = {sB(p1) = {p1, p4, p8, p9, p10}, sB(p2) = {p2, p9}, sB(p3) ={p3}, sB(p4) = {p4, p1, p7, p9}, sB(p5) = {p5}, sB(p6) = {p6, p10}, sB(p7) ={p7, p2, p4, p9, p10}, sB(p8) = {p8, p1, p10}, sB(p9) = {p9, p1, p2, p4, p7}, sB(p10) ={p10, p1, p6, p7, p8}}.

Thus we have: B ⊆ A, and γ(A) � γ(B), but we do not have γ(A) ≤W γ(B).In fact we can see that, for instance, sA(p9) ∩ sB(p1) �= ∅, but sA(p9) � sB(p1).


Table 6. Medical incomplete information system

Patient Fever Headache Dizziness Blood Pressure Chest Pain

p1 low yes yes * yesp2 high * yes low yesp3 * no no low nop4 low * yes low yesp5 low yes no low nop6 high yes no * yesp7 * no yes * nop8 * yes yes high nop9 * * yes low yesp10 * * * high yes

This means that, given two families of attribute sets A and B from an incompleteinformation system, and the corresponding induced coverings γ(A) and γ(B)respectively, unfortunately condition B ⊆ A does not imply γ(A) ≤W γ(B).

4.3 Entropies and Co–Entropies of Coverings: The “Global”Approach

In [5] one can find the following definitions of entropies with corresponding co–entropies for coverings, whose restrictions to partitions induce the standard en-tropy and co–entropy.

Let us consider a covering γ = {B1, B2, . . . , BN} of the universe X . Let usstart from an entropy based on a probability distribution in which the probabilityof an elementary event was represented by p(Bi) = m(Bi)

|X| , where m(Bi) =∑

x∈X1∑N

i=1 χBi(x)

χBi(x) (χ(Bi) being the characteristic functional of the set Bi

for any point x ∈ X) (see also [5,2,6,12]). The resulting entropy is then:

H(γ) = −N∑

i=1

p(Bi) log p(Bi) (58)

The corresponding co–entropy, which complements the entropy with respect tothe quantity log |X |, is

E(γ) =1|X |

N∑

i=1

m(Bi) log m(Bi) (59)

A drawback of this co–entropy is that it can assume negative values due to thefact that the quantities m(Bi) can also be in the real interval [ 0, 1 ] (see [5,6]). Letus now describe a second approach to entropy and co–entropy for coverings. Wecan define the total outer measure of X induced from γ as m∗(γ) :=

∑Ni=1 |Bi| ≥

|X | = mc(X) > 0. An alternative probability of occurrence of the elementaryevent Bi from the covering γ can be described as p∗(Bi) := |Bi|

m∗(γ) obtaining


that the vector p∗(γ) :=(p∗(B1), p∗(B2), . . . , p∗(BN )

). p∗(γ) is a probability

distribution since trivially: (1) every p∗(Bi) ≥ 0; (2)∑N

i=1 p∗(Bi) = 1. Hencewe can define a second entropy of a covering as

H∗(γ) = log m∗(γ) − 1m∗(γ)

N∑

i=1

|Bi| log |Bi| (60)

and the corresponding co–entropy as

E∗(γ) :=1

m∗(γ)

N∑

i=1

|Bi| log |Bi| (61)

This co–entropy complements the entropy (60) with respect to the quantitylog m∗(γ), which depends from the choice of the covering γ. This fact representsa potential disadvantage when one studies the behavior of the entropy and co–entropy with respect to some quasi–ordering of coverings. In fact, showing ananti–monotonic behavior for the entropy would not lead automatically to a de-sired behavior of the co–entropy. Thus we now introduce a new definition for theco–entropy, starting from the imposition that this new co–entropy complementsthe entropy (60) with respect to the fixed quantity log |X |.

E∗c (γ) := log |X | − log m∗(γ) +

1m∗(γ)

N∑

i=1

|Bi| log |Bi| (62)

Let us now see a third approach to entropy and co–entropy for coverings basedon the probability of the elementary event Bi defined as pLX(Bi) := |Bi|

|X| (see[5,2]). In this case one can observe that the probability vector pLX(γ) :=(pLX(B1), pLX(B2), . . . , pLX(BN )) does not define a probability distributionsince in general

∑Ni=1 pLX(Bi) ≥ 1. Keeping in mind this characteristic, we

can define the following pseudo–entropy (originally denoted by H(g)LX)

H(g)(γ) = m∗(γ)log |X ||X | − 1

|X |N∑

i=1

|Bi| log |Bi| (63)

The following quantity (originally denoted by E(g)LX) was firstly introduced as

co–entropy

E(g)(γ) :=1|X |

N∑

i=1

|Bi| log |Bi| (64)

But, in this case also we have the drawback of a co–entropy that complementsthe entropy with respect to a quantity, m∗(γ) log |X|

|X| , which depends from thecovering γ. Again, in order to avoid this unpleasant situation, let us now defineanother co–entropy such that it complements the entropy (63) with respect tolog |X |.

E(g)c (γ) := log |X |

( |X | − m∗(γ)|X |

)+

1|X |

N∑

i=1

|Bi| log |Bi| (65)


Isotonic Behavior of Global Entropies and Co–Entropies of Coverings.In the following example it is illustrated the non isotonicity of the co–entropyE described in equation (59) with respect to both quasi–orderings � and � ofequations (49) and (50).

Example 17. In the universe X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, letus consider the two genuine coverings γ = {C1 = {1, 4, 5}, C2 = {2, 4, 5}, C3 ={3, 4, 5}, C4 = {14, 15}, C5 = {4, 5, . . . , 13}} and δ1 = {D1 = {1, 4, 5} = C1,D2 = {2, 4, 5} = C2, D3 = {3, 4, . . . , 13, 14} = C3∪C5, D4 = {4, 5, . . . , 14, 15} =C4∪C5, }. Trivially, γ ≺ δ1 and γ � δ1. In this case E(γ) = 2.05838 < 2.18897 =E(δ1), as desired.

In the same universe, let us now take the genuine covering δ2 = {F1 ={1, 4, 5, . . . , 12, 13} = C1 ∪ C5, F2 = {2, 4, 5, . . . , 12, 13} = C2 ∪ C5, F3 ={3, 4, . . . , 12, 13} = C3 ∪C5, F4 = {4, 5, . . . , 14, 15} = C4 ∪C5}. Trivially, γ ≺ δ2

and γ � δ2. Unfortunately, in this case we obtain E(γ) = 2.05838 > 1.91613 =E(δ2).

As for the behavior of the entropies H , H∗ and H(g) of equations (58), (60) and(63) and of the co–entropies E, E∗ and E(g) of equations (59), (61) and (64) withrespect to the quasi–orderings for coverings (49) and (50), the reader can find adeep analysis and various examples in [5,2]. We here only recall that the entropiesH , H∗ and H(g) for coverings behave neither isotonically nor anti–tonically withrespect to these two quasi–order relations, even in the more favorable context ofgenuine coverings; the same unfortunately happens for the co–entropies E, E∗

and E(g).Let us now observe what happens with the order relation ≤W described in

(52) on Γ (X), starting from two examples.

Example 18. In the universe X = {1, 2, 3, . . . , 23, 24, 25}, let us consider the twogenuine coverings γ1 = {{1}, {2}, {5}, {3, 4, 24, 25}, {6, 7, . . . , 12, 13, 15, 16, . . . ,22, 23}, {14}, {24, 25}} and δ1 = {{1, 6, 7, . . . , 22, 23}, {2, 3, 4, 24, 25}, {3, 4, 5,24, 25}, {6, 7, . . . , 12, 13, 15, 16, . . . , 22, 23}}. Trivially, γ1 ≤W δ1. From the re-sults illustrated in table 7, we can observe that in this case the co–entropy Ebehaves anti–tonically, whereas the other co–entropies behave isotonically.

Table 7. Entropies and co–entropies for γ1 and δ1, with γ1 ≤W δ1

E H E∗ E∗c H∗ E(g) E

(g)c H(g)

γ1 2.96967 1,67419 2.94396 2.83293 1.81093 3.17947 2.80796 1.83589

δ1 2.84882 1,79504 3.76819 2.88848 1.75538 6.93346 3.03262 1.61123

Example 19. Let us now consider the universe X = {1, 2, 3, . . . , 28, 29, 30} andthe genuine coverings γ2 = {{1, 4}, {1, 5}, {6}, {2, 3, 15, 16, . . . , 28, 29}, {15, 16,. . . , 29, 30}} and δ2 = {{1, 4, 6}, {1, 5}, {6, 7, . . . , 13, 14}, {2, 3, 15, 16, . . . ,

29, 30}}. Trivially, γ2 ≤W δ2. Except for the co–entropies E∗c and E

(g)c , in this

example we can observe an opposite behavior for the other co–entropies: in fact


we now have that co–entropy E behaves isotonically, while the other co–entropiesbehave anti–tonically, as illustrated in table 8.

Table 8. Entropies and co–entropies for γ2 and δ2, with γ2 ≤W δ2

E H E∗ E∗c H∗ E(g) E

(g)c H(g)

γ2 2.76179 2.14510 3.51058 2.89391 2.01298 5.38290 2.76589 2.14100

δ2 3.47265 1.43424 3.44821 3.35511 1.55179 3.67810 3.35097 1.55592

Let us also observe that the entropies H∗ and H(g) behave anti–tonically inboth situations.

Example 20. In the universe X = {1, 2, . . . , 25}, let us consider the covering γ3 ={{1, 2, 6, 7, 8, 9, 10, 11, 16, , 17, 18, 19, 20, 21, 22, 23}, {13, 14, 15, 24}, {12, 13, 24},{3, 4, 5, 25}}, and the induced partition π(γ3) = {{1, 2, 6, 7, 8, 9, 10, 11, 16, 17, 18,19, 20, 21, 22, 23}, {13, 24}, {14, 15}, {12}, {3, 4, 5, 25}}. Trivially we haveπ(γ3) ≤W γ3 (this happens in general as shown by Wierman). The entropiesare: H∗(γ3) = 1.61582 > 1.60386 = H∗(π(γ3)); H(g)(γ3) = 1.62517 > 1.60386 =H(g)(π(γ3)). Thus, contrary to the previous examples, in this case we haveobtained a behavior of isotonicity for these two entropies.

From these examples we can make some observations: the entropy (58) and co–entropy (59), although very interesting for the way the measure m(Bi) describeseach set of a covering, unfortunately do not behave isotonically, neither anti–tonically with respect to any of the considered ordering (52) and quasi–orderings(49) and (50) (see examples in [5,2]). The other two entropies (60) and (63) donot behave isotonically nor anti–tonically with respect to the quasi–orderings(49) and (50) as you can see from the examples in [5,2]; for what concerns theordering (52) we found examples in which these entropies behave anti–tonically,and a simple counterexample in which they behave isotonically. In this simplecounterexample we compared a covering with its induced partition and, althoughthe generated partition is clearly finer than the covering, the entropies H∗ andH(g) of the partition π(γ3), according to (60) and (63), are smaller than theentropies H∗ and H(g) of the covering γ3 itself.

4.4 Pointwise Approaches to the Entropy and Co–Entropy ofCoverings

In the section dedicated to partitions we have described a pointwise approachto the entropy and co–entropy of partitions. We have anticipated that the aimwas to better understand the Liang–Xu (LX) approach to entropy in the case ofincomplete information systems [21]. In the following subsections we will describeand analyze the Liang–Xu approach and some variations of it, including anapproach described in [20].


Pointwise Lower and Upper Entropy and Co–Entropy of Coverings.Making use of the lower granules γl(x) and upper granules γu(x) for x rangingon the space X for a given covering γ, it is possible to introduce two (pointwisedefined) LX entropies (resp., co–entropies), named the lower and upper LX en-tropies (resp., co–entropies) respectively (LX since we generalize in the coveringcontext the Liang–Xu approach to quantify information in the case of incompleteinformation systems – see [21]) according to the following:

HLX(γj) : = −∑

x∈X

|γj(x)||X | log

|γj(x)||X | for j = l, u (66a)

ELX(γj) : =1|X |

∑

x∈X

|γj(x)| log |γj(x)| for j = l, u (66b)

with the relationships (and compare with the case of partitions (15)):

HLX(γj) + ELX(γj) =∑

x∈X |γj(x)||X | · log |X |

Since for every point x ∈ X the following set theoretic inclusions hold: γl(x) ⊆γu(x), with 1 ≤ |γl(x)| ≤ |γu(x)| ≤ |X |, it is possible to introduce the rough co–entropy approximation of the covering γ as the ordered pair of non–negativenumbers: rE(γ) := (ELX(γl), ELX(γu)) , with 0 ≤ ELX(γl) ≤ ELX(γu) ≤|X | · log |X |. For any pair of coverings γ and δ of X such that γ � δ, one hasthat ELX(δl) ≤ ELX(γl) ≤ (???) ≤ ELX(γu) ≤ ELX(δu) , and so we have thatγ � δ implies rE(γ) � rE(δ), which expresses a condition of isotonicity of lower–upper pairs of co–entropies relatively to the quasi–ordering � on coverings [5,2].

As a final remark, recalling that in the rough approximation space of cover-ings, partitions are the crisp sets since πl = π = πu for any π ∈ Π(X), thenthe pointwise entropies (66a) and co–entropies (66b) collapse in the pointwiseentropy and co–entropy for partitions described in subsection 2.7.

Pointwise Entropy and Co–Entropy of Coverings in the Case of Incom-plete Information Systems. We will here illustrate two pointwise entropiesand corresponding co–entropies of coverings generated by a similarity relation(54).

We will start from the entropy and co–entropy in analogy with (66) (see also[21]):

HLX(γ(A)) := −∑

x∈X

|sA(x)||X | log

|sA(x)||X | (67a)

ELX(γ(A)) :=1|X |

∑

x∈X

|sA(x)| log |sA(x)| (67b)

with the following relationship:

HLX(γ(A)) + ELX(γ(A)) =∑

x∈X |sA(x)||X | · log |X |


These just introduced entropy and co–entropy, when applied to complete infor-mation systems, reduce to the pointwise entropy and co–entropy of partitionsof equations (33) and (31), but not to the standard partition entropy and co–entropy expressed by equations (14) and (13).

Another pointwise entropy has been introduced in [20] and it is described bythe following equation:

HLSLW (γ(A)) := −∑

x∈X

1|X | log

|sA(x)||X | (68)

Since we are also interested in co–entropies as measure of the granularity, wehere introduce the corresponding co–entropy of (68):

ELSLW (γ(A)) :=1|X |

∑

x∈X

log |sA(x)| (69)

Moreover we obtain:

HLSLW (γ(A)) + ELSLW (γ(A)) =log |X ||X | (70)

These entropy and corresponding co–entropy in the complete case reduce to thestandard entropy and co–entropy of partitions of equations (14) and (13).

Isotonic Behavior of Pointwise Entropy and Co–Entropy in the Caseof Incomplete Information Systems. In the following propositions it willbe shown that the co–entropy (67b) behaves anti–tonically with respect to setinclusion of subfamilies of attributes. Moreover, from (55) and (57) we obtainthat this co–entropy is isotonic with respect to the quasi–ordering on coverings� of equation (49) and to the order relation ≤s of equation (56). A further resultis that in this context of incomplete information systems, the co–entropy (67b)is also isotonic with respect to the quasi ordering � of equation (50) [2].

Proposition 9. [21] Let 〈X, Att, F 〉 be an incomplete information system andlet A, B ⊆ Att be two families of attributes with corresponding coverings of Xbe γ(A) and γ(B). Then

γ(A) <s γ(B) implies ELX(γ(A)) ≤ ELX(γ(B)).

As an immediate consequence we have:

Corollary 2. [21] In an incomplete information system 〈X, Att, F 〉 let us con-sider two families of attributes A, B ⊆ Att. Let the induced coverings of X beγ(A) and γ(B). The following holds:

B ⊆ A implies ELX(γ(A)) ≤ ELX(γ(B)).

From these two properties we obtain the following further result related to the(quasi) order relation � on Γ (X).


Proposition 10. Given an incomplete information system 〈X, Att, F 〉 and twosimilarity relations defined on the objects of X for the two subfamilies of at-tributes A and B (A,B ⊆ Att), let the induced coverings of X be respectivelyγ(A) and γ(B). For any A,B ⊆ Att the following holds:

B ⊆ A and γ(A) � γ(B) implies ELX(γ(A)) ≤ ELX(γ(B)). (71)

Moreover, if ∃ sB(xk) = sA(xk1) ∪ sA(xk2) ∪ . . . ∪ sA(xkp) such that p ≥ 2, thestrict anti–tonicity ELX(γ(A)) < ELX(γ(B)) holds.

Proof. Let the two coverings be respectively γ(A) = {sA(x1), sA(x2), . . . ,sA(xN )} and γ(B) = {sB(x1), sB(x2), . . . , sB(xN )}. We have γ(A) � γ(B),hence ∀ sB(xi) ∈ γ(B), ∃{sA(xi1 ), sA(xi2), . . . , sA(xip)} ⊆ γ(A) s.t. sB(xi) =sA(xi1 ) ∪ sA(xi2) ∪ . . . ∪ sA(xip). Let us consider the simple case in whichsB(xk) = sA(xk1 )∪ sA(xk2 )∪ . . .∪ sA(xkp) with p ≥ 2, and sB(xj) = sA(xj) forany j �= k. Then we have that

ELX(γ(B)) =1|X |

∑

xi∈X

|sB(xi)| · log |sB(xi)|

=1|X |( ∑

xj∈X, j �=k

|sA(xj)| · log |sA(xj)| +

+|sA(xk1 ) ∪ . . . ∪ sA(xkp)| · log |sA(xk1 ) ∪ . . . ∪ sA(xkp)|)

Since |sA(xk1)∪. . .∪sA(xkp)|·log |sA(xk1)∪. . .∪sA(xkp)| > |sA(xk)| log |sA(xk)|,we obtain that ELX(γ(A)) < ELX(γ(B)). Hence in general, given two coveringsγ(A) and γ(B) induced by two similarity relations based respectively on the sub-families of attributes A and B, with B ⊆ A, we have ELX(γ(A)) ≤ ELX(γ(B)).Inparticular, when p ≥ 2 the strict anti–tonicity ELX(γ(A)) < ELX(γ(B)) holds.

Similarly, for the entropy (68) the following holds [20]:

Proposition 11. Given an incomplete information system 〈X, Att, F 〉, two sub-families of attributes A and B (A,B ⊆ Att) and the induced coverings of X,respectively γ(A) and γ(B), the following holds:

γ(A) <s γ(B) implies HLSLW (γ(B)) ≤ HLSLW (γ(A)). (72)

Moreover, if ∃ k ∈ {1, 2, . . . , |X |} such that sB(xk) ⊂ sA(xk) the strict anti–tonicity HLSLW (γ(B)) < HLSLW (γ(A)) holds.

This result implies that:

Corollary 3. Let 〈X, Att, F 〉 be an incomplete information system, and letA,B ⊆ Att be two subfamilies of attributes and let B ⊆ A. Then it followsthat:

B ⊆ A implies HLSLW (γ(B)) ≤ HLSLW (γ(A)).


In analogy with proposition 9, corollary 2 and proposition 10 and thanks to therelationship (70), one could easily proove the following three properties aboutco–entropy (69):

γ(A) <s γ(B) implies ELSLW (γ(A)) ≤ ELSLW (γ(B))B ⊆ A implies ELSLW (γ(A)) ≤ ELSLW (γ(B))

B ⊆ A and γ(A) � γ(B) implies ELSLW (γ(A)) ≤ ELSLW (γ(B))

5 Conclusions

We have analized the approach to information entropy in case of abstract discreteprobability distributions. Then we have stepped to the entropy of partition of auniverse X . We have then illustrated a first approach of information entropiesto the case of incomplete information systems, through the here called “partial”partitions. Finally we extended to the covering case the approach of entropyof partitions. For what concerns the study of entropies (resp. co–entropies) forcoverings, we have here shown the important result that allows one to comparetwo coverings that are in the ≤W order relation (52) through the entropy and co–entropy of the induced partitions. On the other side, the global entropies and co–entropies for coverings do not behave isotonically nor anti–tonically with respectto the (quasi) orderings here illustrated. But when we are in the incompleteinformation system context we can use one of the two pointwise entropies andcorresponding co–entropies which all behave well with respect to all (quasi) orderrelations on coverings.

References

1. Ash, R.B.: Information theory. Dover Publications, New York (1990) (originallypublished by John Wiley & Sons, New York (1965))

2. Bianucci, D., Cattaneo, G.: Monotonic behavior of entropies and co-entropies forcoverings with respect to different quasi-orderings. In: Kryszkiewicz, M., Peters,J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585,pp. 584–593. Springer, Heidelberg (2007)

3. Bianucci, D., Cattaneo, G.: On non-pointwise entropies of coverings: Relationshipwith anti-monotonicity. In: Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D.,Skowron, A., Yao, Y. (eds.) RSKT 2008. LNCS (LNAI), vol. 5009, pp. 387–394.Springer, Heidelberg (2008)

4. Bianucci, D., Cattaneo, G., Ciucci, D.: Entropies and co–entropies for incompleteinformation systems. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone,N.J., Slezak, D. (eds.) RSKT 2007. LNCS (LNAI), vol. 4481, pp. 84–92. Springer,Heidelberg (2007)

5. Bianucci, D., Cattaneo, G., Ciucci, D.: Entropies and co–entropies of coveringswith application to incomplete information systems. Fundamenta Informaticae 75,77–105 (2007)

6. Bianucci, D., Cattaneo, G., Ciucci, D.: Information entropy and co–entropy of crispand fuzzy granulations. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS(LNAI), vol. 4578, pp. 9–19. Springer, Heidelberg (2007)


7. Bianucci, D.: Rough entropies for complete and incomplete information systems.Ph.D. thesis (2007)

8. Birkhoff, G.: Lattice theory, 3rd edn. American Mathematical Society ColloquiumPublication, vol. XXV. American Mathematical Society, Providence (1967)

9. Beaubouef, T., Petry, F.E., Arora, G.: Information–theoretic measures of uncer-tainty for rough sets and rough relational databases. Journal of Information Sci-ences 109, 185–195 (1998)

10. Cattaneo, G., Ciucci, D.: Algebraic structures for rough sets. In: Dubois, D.,Gryzmala-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on RoughSets II. LNCS, vol. 3135, pp. 208–252. Springer, Heidelberg (2004)

11. Cattaneo, G., Ciucci, D.: Investigation about Time Monotonicity of Similarityand Preclusive Rough Approximations in Incomplete Information Systems. In:Tsumoto, S., S�lowinski, R., Komorowski, J., Grzyma�la-Busse, J.W. (eds.) RSCTC2004. LNCS (LNAI), vol. 3066, pp. 38–48. Springer, Heidelberg (2004)

12. Cattaneo, G., Ciucci, D., Bianucci, D.: Entropy and co-entropy of partitions andcoverings with applications to roughness theory. In: Bello, R., Falcon, R., Pedrycz,W., Kacprzyk, J. (eds.) Granular Computing: at the Junction of Fuzzy Sets andRough Sets. Studies in Fuzziness and Soft Computing, vol. 224, pp. 55–77. Springer,Heidelberg (2008)

13. Hamming, R.W.: Cading and information theory. Prentice–Hall, New Jersey (1986)(Second edition of the 1980 first edition)

14. Hartley, R.V.L.: Transmission of information. The Bell System Technical Journal 7,535–563 (1928)

15. Huang, B., He, X., Zhong, X.Z.: Rough entropy based on generalized rough setscovering reduction. Journal of Software 15, 215–220 (2004)

16. Khinchin, A.I.: Mathematical foundations of information theory. Dover Publica-tions, New York (1957); translation of two papers appeared in Russian in UspekhiMatematicheskikh Nauk 3, 3–20 (1953); 1, 17–75 (1965)

17. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial.In: Pal, S., Skowron, A. (eds.) Rough Fuzzy Hybridization, pp. 3–98. Springer,Singapore (1999)

18. Kryszkiewicz, M.: Rough set approach to incomplete information systems. Infor-mation Sciences 112, 39–49 (1998)

19. Liang, J., Shi, Z.: The information entropy, rough entropy and knowledge gran-ulation in rough set theory. International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems 12, 37–46 (2004)

20. Liang, J., Shi, Z., Li, D., Wierman, M.J.: Information entropy, rough entropy andknowledge granulation in incomplete information systems. International Journal ofGeneral Systems 35(6), 641–654 (2006)

21. Liang, J., Xu, Z.: Uncertainty measure of randomness of knowledge and rough setsin incomplete information systems. Intelligent Control and Automata 4, 2526–2529(2000); In: Proc. of the 3rd World Congress on Intelligent Control and Automata

22. Pawlak, Z.: Information systems - theoretical foundations. Information Systems 6,205–218 (1981)

23. Pawlak, Z.: Rough sets. Int. J. Inform. Comput. Sci. 11, 341–356 (1982)24. Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer Aca-

demic Publishers, Dordrecht (1991)25. Reza, F.M.: An introduction to information theory. Dover Publications, New York

(1994) (originally published by Mc Graw-Hill, New York (1961))26. Shannon, C.E.: A mathematical theory of communication. The Bell System Tech-

nical Journal 27, 379–423, 623–656 (1948)


27. Taylor, A.E.: General theory of functions and integration. Dover Publications, NewYork (1985)

28. Wierman, M.J.: Measuring uncertainty in rough set theory. International Journalof General Systems 28, 283–297 (1999)

29. Zhao, Y., Luo, F., Wong, S.K.M., Yao, Y.Y.: A general definition of an attributereduct. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Slezak,D. (eds.) RSKT 2007. LNCS, vol. 4481, pp. 101–108. Springer, Heidelberg (2007)

information entropy and granulation co–entropy of

Documents