Maximum entropy principle, Copy lemma and information inequalities

The Maximum entropy principle (MEP or MAXE) is a principle in statistical mechanics. It is applied in situations where partial information about a complex system (i.e., a joint distribution of some random variables) is available, for instance some of its marginals can be measured. Among all distributions which are compatible with these constraints, MAXE suggests to select the one with maximal (Gibbs) entropy. This is the same as distributing probability mass as equally as the constraints allow.

This article is about its application to information theory where some marginal entropies of a system of random variables are prescribed. Crucially, it turns out that the maximum-entropy distribution under such assumptions satisfies certain “special position relations”. MAXE is used as a tool to tweak a given distribution into satisfying more independence relations at the expense of, naturally, changing some of its marginal entropies. This creates a trade-off between imposing independence relations and preserving marginal entropies which can sometimes be exploited. An example of this is given at the end.

This result is not new. In fact, it is a widely used technique and usually appears under the name Copy lemma, although the formulation I want to present here appears to be slightly more general. All of this is based on conversations I had at WUPES ’25 with Laszlo Csirmaz and Milan Studený as well as when I visited Andrei Romashchenko.

Basics of the almost-entropic region

We will work in the almost-entropic region. To get there, fix a finite ground set $N$ and consider discrete random vectors $\xi = (\xi_i : i \in N)$ , where the $\xi_i$ can have any finite number of states. This vector determines a set function $h_\xi\colon 2^N \to \mathbb{R}^{2^N}$ which sends $I$ to the Shannon entropy of the marginal random vector $\xi_I$ . This function $h_\xi$ is the entropy profile of $\xi$ . The collection of all entropy profiles of $N$ -variate discrete distributions is the entropy region $H_N^*$ and its closure in the euclidean topology is a convex cone $\overline{H_N^*}$ called the almost-entropic region.

The elements of the dual cone of $\overline{H_N^*}$ are known as information inequalities. They are essential in proving bounds for optimization problems in information theory and thus certify the optimality of a given communication protocol.

The most fundamental information inequalities come from the non-negativity of conditional mutual information. This just means that entropy profiles are monotone with respect to set inclusion and submodular over the boolean lattice. Additionally $h(\emptyset) = 0$ . Linear combinations of these basic inequalities are called Shannon-type inequalities. We will use the following abbreviations

For $i \in N$ we use $i$ and $\{i\}$ interchangeably.
For $I, K \subseteq N$ we write $IK := I \cup K$ .
$h(I\mid K) = h(IK) - h(K)$ .
$h(I:J\mid K) = h(IK) + h(JK) - h(IJK) - h(K)$ .

The basic inequalities are the non-negativities of all possible $h(I\mid K)$ and $h(I:J\mid K)$ . We will also treat the symbols $(I\mid K)$ and $(I:J\mid K)$ as functionals on the space $\mathbb{R}^{2^N}$ .

Maximum entropy principle

We will need a basic and well-known construction:

Conditional product lemma. Let $N$ be a finite set, $h \in H_N^*$ and pairwise disjoint sets $X, Y, W \subseteq N$ . There exists $h' \in H_N^*$ such that $h'(I) = h(I)$ for all $I \subseteq XW$ , all $I \subseteq YW$ and all $I \subseteq N \setminus XYW$ , and also $h'(X\mid W) + h'(Y\mid W) = h'(XY\mid W)$ .

Proof. Let $\xi$ be a system of random variables with $h = h_\xi$ and let $p$ denote its probability density function. Denote the state space of any marginal $\xi_I$ by $Q_I$ and set $Z = N \setminus XYW$ . We define a new density $p'$ by

$p'(x,y,z,w) = \begin{cases} p(x\mid w) p(y,z\mid w) p(w), & \text{if $p(w) > 0$}, \\ 0, & \text{otherwise}, \end{cases}$

where $(x,y,z,w) \in Q_X \times Q_Y \times Q_Z \times Q_W$ . Let $\xi'$ be the random variable defined by $p'$ and $h' = h_{\xi'}$ . It is easy to see that the maringal distributions $\xi_{XW}$ and $\xi_{YZW}$ coincide with the respective marginals of $\xi'$ , so in particular $h'$ takes the same values as $h$ on all of their subsets. Moreover, each conditional density $p'_{XYZ\mid W=w}$ , for $p'(w) > 0$ , factors into $p'_{X\mid W=w}$ and $p'_{YZ\mid W=w}$ which proves the slightly stronger conditional independence $h'(X:YZ\mid W) = 0$ . Using the semigraphoid axioms (which follow from the submodular inequalities) yields $h'(X:Y\mid W) = 0$ which is equivalent to the required equation on conditional entropies. $\blacksquare$

First note that whenever $XYZ$ is a strict subset of $N$ , we may add the difference to either $X$ or $Y$ and obtain a stronger result. Secondly, the result also holds for almost-entropic points by continuity.

With $N$ a fixed finite set, consider a simplicial complex $\Delta$ over $N$ . Pairwise disjoint $X_1, \dots, X_k \subseteq N$ are dependent in $\Delta$ if there exists $F \in \Delta$ which intersects two distinct $X_i$ ’s; otherwise they are independent.

Maximum entropy principle. Let $h \in \overline{H_N^*}$ and $\Delta$ be given with $\bigcup \Delta = N$ . There exists $h' \in \overline{H_N^*}$ such that $h'(I) = h(I)$ for all $I \in \Delta$ and $h'(X_1\cdots X_k\mid W) = \sum_{i \in I} h'(X_i\mid W)$ for all $X_1, \dots, X_k \subseteq N$ independent and $W = N \setminus \bigcup_{i=1}^k X_i$ .

Proof. Let $h'$ be a maximizer of $h'(N)$ subject to the constraints $h'(I) = h(I)$ for $I \in \Delta$ . This is a linear optimization problem over the closed convex cone $\overline{H_N^*}$ . By submodularity, we have the upper bound $h'(N) \le \sum_{I \in \max \Delta} h(I)$ , where $\max \Delta$ are the facets of $\Delta$ , since $N = \bigcup \max \Delta$ . Hence, the domain is compact and such a maximizer exists.

Take any collection of sets $X_1, \dots, X_k, W$ as in the claim. If the claim is violated, then the basic inequalities imply $h'(X_1\dots X_k\mid W) < \sum_{i \in I} h'(X_i\mid W)$ . By dividing the $X_i$ ’s into two parts, we may assume that $k=2$ and call $X_1 = X$ and $X_2 = Y$ . Thus $h'(XY\mid W) < h'(X\mid W) + h'(Y\mid W)$ . The conditional product of $X$ and $Y$ over $W$ yields $h''$ with

$\begin{alignedat}{2} h''(N) = h''(XYW) &= h''(XW) + h''(YW) - h''(W) \\ = h'(XW) + h'(YW) - h'(W) > h'(XYW) = h'(N). \end{alignedat}$

At the same time, $h''$ preserves the value of $h'$ on all subsets of $XW$ and of $YW$ . Consider any $I \in \Delta$ . If it were not a subset of either of the two, then it must contain an element $x \in X$ as well as an element $y \in Y$ . But this directly contradicts the independence of $X$ and $Y$ in $\Delta$ which follows from the independence of the initial $k$ -tuple $X_1, \dots, X_k$ . Hence $h''$ is feasible in the optimization problem we set up and contradicts the maximality of $h'$ . $\blacksquare$

This proof is non-constructive but it yields multiple instances of the conditional product independence statement simultaneously. An important (sort of) special case is the Copy lemma which is really just a reformulation of the conditional product:

Copy lemma. Let $(\xi, \omega)$ be two random variables. This system can be extended to a triple $(\xi, \xi', \omega)$ such that $(\xi', \omega)$ has the same distribution as $(\xi, \omega)$ and $[\xi \CI \xi' \mid \omega]$ .

The picture to have in mind is that a copy of the variables $(\xi, \omega)$ is created and then this copy is amalgamated with the original over $\omega$ in such a way that $\xi$ and $\xi'$ are “orthogonal” (i.e., independent) given the $\omega$ they share.

It is worth noting that the conditional product does not increase the state space of the random variables. We have no such control in the maximum entropy principle.

Application: strengthening information inequalities

I learned the following example from Laszlo Csirmaz. Let $[abcd] = (a:b\mid c) + (a:b\mid d) + (c:d) - (a:b)$ denote the Ingleton functional. It may be checked by linear programming that the following is a valid Shannon inequality for 5 random variables:

$[abcd] + (z:b\mid c) + (z:c\mid b) + (b:c\mid z) \ge -3 (z:ad\mid bc).$

It appears possible that the left-hand side, let’s call it $\psi$ , could become negative, but this actually cannot happen due to MAXE! Suppose that there is an almost-entropic point $h$ such that $\psi(h) < 0$ . Let $\Delta$ be the simplicial complex generated by the marginals appearing in $\psi$ and notice that $z$ is independent of $ad$ in $\Delta$ , i.e., neither $az$ nor $dz$ appear in any entropy evaluation in $\psi$ . We can apply MAXE to obtain an almost-entropic point $h'$ with $(z:ad\mid bc) = 0$ and $\psi(h') = \psi(h)$ . But this point will contradict the Shannon inequality above which is absurd.

This reasoning proves the stronger inequality

$[abcd] + (z:b\mid c) + (z:c\mid b) + (b:c\mid z) \ge 0,$

which simplifies to the well-known Zhang–Yeung inequality for $z=a$ . This is very close to how Zhang and Yeung originally proved their inequality which was the first example of a non-Shannon information inequality.

For the purposes of proving unconditional information inequalities, the distinction between entropic and almost-entropic points is immaterial. Thus, MAXE appears to be more general than the Copy lemma. As far as I am aware, there is no known instance to date of a valid information inequality which can be proved with MAXE but not with iterated applications of the Copy lemma. In fact, even independently of MAXE, I do not know of any valid information inequality which cannot be proved by iterated use of the Copy lemma.

Addendum (16 Sep 2025): Laszlo Csirmaz has recently published a manuscript on Exploring the entropic region which discusses this topic, among others, in more detail.