Maximum entropy principle, Copy lemma and information inequalities

The Maximum entropy principle (MEP or MAXE) is a principle in statistical mechanics. It is applied in situations where partial information about a complex system (i.e., a joint distribution of some random variables) is available, for instance some of its marginals can be measured. Among all distributions which are compatible with these constraints, MAXE suggests to select the one with maximal (Gibbs) entropy. This is the same as distributing probability mass as equally as the constraints allow.

This article is about its application to information theory where some marginal entropies of a system of random variables are prescribed. Crucially, it turns out that the maximum-entropy distribution under such assumptions satisfies certain “special position relations”. MAXE is used as a tool to tweak a given distribution into satisfying more independence relations at the expense of, naturally, changing some of its marginal entropies. This creates a trade-off between imposing independence relations and preserving marginal entropies which can sometimes be exploited. An example of this is given at the end.

This result is not new. In fact, it is a widely used technique and usually appears under the name Copy lemma, although the formulation I want to present here appears to be slightly more general. All of this is based on conversations I had at WUPES ’25 with Laszlo Csirmaz and Milan Studený as well as when I visited Andrei Romashchenko.

Basics of the almost-entropic region

We will work in the almost-entropic region. To get there, fix a finite ground set NN and consider discrete random vectors ξ=(ξi:iN)\xi = (\xi_i : i \in N), where the ξi\xi_i can have any finite number of states. This vector determines a set function hξ:2NR2Nh_\xi\colon 2^N \to \mathbb{R}^{2^N} which sends II to the Shannon entropy of the marginal random vector ξI\xi_I. This function hξh_\xi is the entropy profile of ξ\xi. The collection of all entropy profiles of NN-variate discrete distributions is the entropy region HNH_N^* and its closure in the euclidean topology is a convex cone HN\overline{H_N^*} called the almost-entropic region.

The elements of the dual cone of HN\overline{H_N^*} are known as information inequalities. They are essential in proving bounds for optimization problems in information theory and thus certify the optimality of a given communication protocol.

The most fundamental information inequalities come from the non-negativity of conditional mutual information. This just means that entropy profiles are monotone with respect to set inclusion and submodular over the boolean lattice. Additionally h()=0h(\emptyset) = 0. Linear combinations of these basic inequalities are called Shannon-type inequalities. We will use the following abbreviations

The basic inequalities are the non-negativities of all possible h(IK)h(I\mid K) and h(I:JK)h(I:J\mid K). We will also treat the symbols (IK)(I\mid K) and (I:JK)(I:J\mid K) as functionals on the space R2N\mathbb{R}^{2^N}.

Maximum entropy principle

We will need a basic and well-known construction:

Conditional product lemma. Let NN be a finite set, hHNh \in H_N^* and pairwise disjoint sets X,Y,WNX, Y, W \subseteq N. There exists hHNh' \in H_N^* such that h(I)=h(I)h'(I) = h(I) for all IXWI \subseteq XW, all IYWI \subseteq YW and all INXYWI \subseteq N \setminus XYW, and also h(XW)+h(YW)=h(XYW)h'(X\mid W) + h'(Y\mid W) = h'(XY\mid W).

Proof. Let ξ\xi be a system of random variables with h=hξh = h_\xi and let pp denote its probability density function. Denote the state space of any marginal ξI\xi_I by QIQ_I and set Z=NXYWZ = N \setminus XYW. We define a new density pp' by

p(x,y,z,w)={p(xw)p(y,zw)p(w),if p(w)>0,0,otherwise,p'(x,y,z,w) = \begin{cases} p(x\mid w) p(y,z\mid w) p(w), & \text{if $p(w) > 0$}, \\ 0, & \text{otherwise}, \end{cases}

where (x,y,z,w)QX×QY×QZ×QW(x,y,z,w) \in Q_X \times Q_Y \times Q_Z \times Q_W. Let ξ\xi' be the random variable defined by pp' and h=hξh' = h_{\xi'}. It is easy to see that the maringal distributions ξXW\xi_{XW} and ξYZW\xi_{YZW} coincide with the respective marginals of ξ\xi', so in particular hh' takes the same values as hh on all of their subsets. Moreover, each conditional density pXYZW=wp'_{XYZ\mid W=w}, for p(w)>0p'(w) > 0, factors into pXW=wp'_{X\mid W=w} and pYZW=wp'_{YZ\mid W=w} which proves the slightly stronger conditional independence h(X:YZW)=0h'(X:YZ\mid W) = 0. Using the semigraphoid axioms (which follow from the submodular inequalities) yields h(X:YW)=0h'(X:Y\mid W) = 0 which is equivalent to the required equation on conditional entropies. \blacksquare

First note that whenever XYZXYZ is a strict subset of NN, we may add the difference to either XX or YY and obtain a stronger result. Secondly, the result also holds for almost-entropic points by continuity.

With NN a fixed finite set, consider a simplicial complex Δ\Delta over NN. Pairwise disjoint X1,,XkNX_1, \dots, X_k \subseteq N are dependent in Δ\Delta if there exists FΔF \in \Delta which intersects two distinct XiX_i’s; otherwise they are independent.

Maximum entropy principle. Let hHNh \in \overline{H_N^*} and Δ\Delta be given with Δ=N\bigcup \Delta = N. There exists hHNh' \in \overline{H_N^*} such that h(I)=h(I)h'(I) = h(I) for all IΔI \in \Delta and h(X1XkW)=iIh(XiW)h'(X_1\cdots X_k\mid W) = \sum_{i \in I} h'(X_i\mid W) for all X1,,XkNX_1, \dots, X_k \subseteq N independent and W=Ni=1kXiW = N \setminus \bigcup_{i=1}^k X_i.

Proof. Let hh' be a maximizer of h(N)h'(N) subject to the constraints h(I)=h(I)h'(I) = h(I) for IΔI \in \Delta. This is a linear optimization problem over the closed convex cone HN\overline{H_N^*}. By submodularity, we have the upper bound h(N)ImaxΔh(I)h'(N) \le \sum_{I \in \max \Delta} h(I), where maxΔ\max \Delta are the facets of Δ\Delta, since N=maxΔN = \bigcup \max \Delta. Hence, the domain is compact and such a maximizer exists.

Take any collection of sets X1,,Xk,WX_1, \dots, X_k, W as in the claim. If the claim is violated, then the basic inequalities imply h(X1XkW)<iIh(XiW)h'(X_1\dots X_k\mid W) < \sum_{i \in I} h'(X_i\mid W). By dividing the XiX_i’s into two parts, we may assume that k=2k=2 and call X1=XX_1 = X and X2=YX_2 = Y. Thus h(XYW)<h(XW)+h(YW)h'(XY\mid W) < h'(X\mid W) + h'(Y\mid W). The conditional product of XX and YY over WW yields hh'' with

h(N)=h(XYW)=h(XW)+h(YW)h(W)=h(XW)+h(YW)h(W)>h(XYW)=h(N).\begin{alignedat}{2} h''(N) = h''(XYW) &= h''(XW) + h''(YW) - h''(W) \\ = h'(XW) + h'(YW) - h'(W) > h'(XYW) = h'(N). \end{alignedat}

At the same time, hh'' preserves the value of hh' on all subsets of XWXW and of YWYW. Consider any IΔI \in \Delta. If it were not a subset of either of the two, then it must contain an element xXx \in X as well as an element yYy \in Y. But this directly contradicts the independence of XX and YY in Δ\Delta which follows from the independence of the initial kk-tuple X1,,XkX_1, \dots, X_k. Hence hh'' is feasible in the optimization problem we set up and contradicts the maximality of hh'. \blacksquare

This proof is non-constructive but it yields multiple instances of the conditional product independence statement simultaneously. An important (sort of) special case is the Copy lemma which is really just a reformulation of the conditional product:

Copy lemma. Let (ξ,ω)(\xi, \omega) be two random variables. This system can be extended to a triple (ξ,ξ,ω)(\xi, \xi', \omega) such that (ξ,ω)(\xi', \omega) has the same distribution as (ξ,ω)(\xi, \omega) and [ξξω][\xi \CI \xi' \mid \omega].

The picture to have in mind is that a copy of the variables (ξ,ω)(\xi, \omega) is created and then this copy is amalgamated with the original over ω\omega in such a way that ξ\xi and ξ\xi' are “orthogonal” (i.e., independent) given the ω\omega they share.

Amalgamation in the Copy lemma.

It is worth noting that the conditional product does not increase the state space of the random variables. We have no such control in the maximum entropy principle.

Application: strengthening information inequalities

I learned the following example from Laszlo Csirmaz. Let [abcd]=(a:bc)+(a:bd)+(c:d)(a:b)[abcd] = (a:b\mid c) + (a:b\mid d) + (c:d) - (a:b) denote the Ingleton functional. It may be checked by linear programming that the following is a valid Shannon inequality for 5 random variables:

[abcd]+(z:bc)+(z:cb)+(b:cz)3(z:adbc).[abcd] + (z:b\mid c) + (z:c\mid b) + (b:c\mid z) \ge -3 (z:ad\mid bc).

This can be checked using oXitip using the one-line query "I(C;D) <= I(C;D|A)+I(C;D|B)+I(A;B)+I(B;Z|C)+I(C;Z|B)+I(B;C|Z) + 3 I(Z;A,D|B,C)".

It appears possible that the left-hand side, let’s call it ψ\psi, could become negative, but this actually cannot happen due to MAXE! Suppose that there is an almost-entropic point hh such that ψ(h)<0\psi(h) < 0. Let Δ\Delta be the simplicial complex generated by the marginals appearing in ψ\psi and notice that zz is independent of adad in Δ\Delta, i.e., neither azaz nor dzdz appear in any entropy evaluation in ψ\psi. We can apply MAXE to obtain an almost-entropic point hh' with (z:adbc)=0(z:ad\mid bc) = 0 and ψ(h)=ψ(h)\psi(h') = \psi(h). But this point will contradict the Shannon inequality above which is absurd.

This reasoning proves the stronger inequality

[abcd]+(z:bc)+(z:cb)+(b:cz)0,[abcd] + (z:b\mid c) + (z:c\mid b) + (b:c\mid z) \ge 0,

which simplifies to the well-known Zhang–Yeung inequality for z=az=a. This is very close to how Zhang and Yeung originally proved their inequality which was the first example of a non-Shannon information inequality.

For the purposes of proving unconditional information inequalities, the distinction between entropic and almost-entropic points is immaterial. Thus, MAXE appears to be more general than the Copy lemma. As far as I am aware, there is no known instance to date of a valid information inequality which can be proved with MAXE but not with iterated applications of the Copy lemma. In fact, even independently of MAXE, I do not know of any valid information inequality which cannot be proved by iterated use of the Copy lemma.