Miscellaneous Thoughts

This page contains a disorganized collection of ideas that I'm learning and thinking about, but have yet to coalesce into a coherent narrative. Feel free to look around you might find something interesting!


Theorem: Let $f, g : [0, 1] \to \R$ be (non-strictly) increasing functions. Then

\[\qty(\int_0^1 f(x) \dd{x}) \qty(\int_0^1 g(x) \dd{x}) \le \int_0^1 f(x) \, g(x) \dd{x}. \]
First Proof: Observe that both sides of the desired inequality can be written as double integrals. For reasons that will later become apparent, we symmetrize the integrands with respect to the integration variables $x$ and $y.$
\[\begin{aligned} \qty(\int_0^1 f(x) \dd{x}) \qty(\int_0^1 g(x) \dd{x}) &= \int_0^1 \!\! \int_0^1 f(x) \, g(y) \dd{x} \dd{y} \\ &= \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) \, g(y) + f(y) \, g(x)] \dd{x} \dd{y} \end{aligned} \]
\[\begin{aligned} \int_0^1 f(x) \, g(x) \dd{x} &= \int_0^1 \!\! \int_0^1 f(x) \, g(x) \dd{x} \dd{y} \\ &= \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) \, g(x) + f(y) \, g(y)] \dd{x} \dd{y} \end{aligned} \]

This transforms the desired inequality into:

\[\begin{aligned} \qty(\int_0^1 f(x) \dd{x}) \qty(\int_0^1 g(x) \dd{x}) &\le \int_0^1 f(x) \, g(x) \dd{x} \\ \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) \, g(y) + f(y) \, g(x)] \dd{x} \dd{y} &\le \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) \, g(x) + f(y) \, g(y)] \dd{x} \dd{y} \end{aligned} \]

We now move all terms to the RHS in order to obtain a miraculous factorization.

\[\begin{aligned} 0 &\le \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) \, g(x) - f(x) \, g(y) - f(y) \, g(x) + f(y) \, g(y)] \dd{x} \dd{y} \\ 0 &\le \frac{1}{2} \int_0^1 \!\! \int_0^1 \qty[f(x) - f(y)] \qty[g(x) - g(y)] \dd{x} \dd{y} \\ \end{aligned} \]

(This is why we had to symmetrize the integrands earlierotherwise, we would have ended up with $f(x) \qty[g(x) - g(y)]$ instead of $\qty[f(x) - f(y)] \qty[g(x) - g(y)]$.) Now, because $f$ and $g$ are both increasing, whenever $x \ge y,$ we have $f(x) - f(y) \ge 0$ and $g(x) - g(y) \ge 0.$ Similarly, whenever $x \le y,$ we have $f(x) - f(y) \le 0$ and $g(x) - g(y) \le 0.$ In either case, $\qty[f(x) - f(y)] \qty[g(x) - g(y)]$ is always non-negative because it is a product of two numbers with the same sign. Hence, the integral of this expression is also non-negative, and the desired inequality holds.

Second Proof: It is also possible to give an alternative argument that avoids symmetrization by invoking the mean value theorem. Specifically, by applying the mean value theorem to the function $h: [0, 1] \to \R$ defined by
\[h(t) \coloneqq \int_0^t g(x) \dd{x}, \]

we may conclude that there exists $x_0 \in (0, 1)$ satisfying $g(x_0) = h(1) - h(0) = \int_0^1 g(x) \dd{x}.$ Now, the desired inequality is:

\[\begin{aligned} \qty(\int_0^1 f(x) \dd{x}) \qty(\int_0^1 g(x) \dd{x}) &\le \int_0^1 f(x) \, g(x) \dd{x} \\ \int_0^1 f(x) \, g(x_0) \dd{x} &\le \int_0^1 f(x) \, g(x) \dd{x} \\ 0 &\le \int_0^1 f(x) \qty[ g(x) - g(x_0) ] \dd{x} \end{aligned} \]

Because $g$ is an increasing function, we have:

\[\begin{aligned} g(x) - g(x_0) &\le 0 && \text{when } 0 \le x \le x_0. \\ g(x) - g(x_0) &\ge 0 && \text{when } x_0 \le x \le 1. \end{aligned} \]

Similarly, because $f$ is an increasing function, we have:

\[\begin{aligned} f(x) \qty[g(x) - g(x_0)] &\ge f(x_0) \qty[g(x) - g(x_0)] && \text{when } 0 \le x \le x_0. \\ f(x) \qty[g(x) - g(x_0)] &\ge f(x_0) \qty[g(x) - g(x_0)] && \text{when } x_0 \le x \le 1. \end{aligned} \]

Note that the two sign changes at $x = x_0$ cancel, yielding $f(x) \qty[g(x) - g(x_0)] \ge f(x_0) \qty[g(x) - g(x_0)]$ for all $0 \le x \le 1.$ It follows that

\[\int_0^1 f(x) \qty[ g(x) - g(x_0) ] \dd{x} \ge \int_0^1 f(x_0) \qty[ g(x) - g(x_0) ] \dd{x} = f(x_0) \qty[ g(x_0) - g(x_0) ] = 0 \]

which is the desired result.


Suppose we want to solve a system of linear equations $A\vx = \vb.$ The simplest case of this problem occurs when the matrix $A \in \R^{N \times N}$ is symmetric and positive-definite. In this case, we can recast $A\vx = \vb$ as a convex optimization problem for the quadratic function

\[f(\vx) \coloneqq \frac{1}{2} \vx^T \! A \vx - \vb^T \vx. \]

Note that the positive-definiteness of $A$ guarantees that this function is strongly convex. Moreover, the unique global minimizer of this function occurs where

\[\nabla f(\vx) = A\vx - \vb = \vo \quad\iff\quad A\vx = \vb. \]

Now, the most straightforward way to minimize a convex function is the method of gradient descent (which is also called steepest descent in the context of solving linear equations). To perform gradient descent, we pick an initial guess $\vx_0$ (when no other information is available, $\vx_0 \coloneqq \vo$ is a common choice) and compute a sequence $\vx_0, \vx_1, \vx_2, \ldots$ of approximate solutions by iterating

\[\begin{aligned} \vx_{n+1} &\coloneqq \vx_n - \alpha_n \nabla f(\vx_n) \\ &\phantom{:}= \vx_n - \alpha_n (A\vx_n - \vb). \end{aligned} \]

In most applications of gradient descent, the step size $\alpha_n > 0$ is determined using a “line search” procedure (i.e., a one-dimensional optimization algorithm). However, there is no need to perform line search in this case, because the optimal step size for a quadratic objective function can be calculated analytically. Given an initial point $\vx \in \R^n$ and a step direction $\vd \in \R^n,$ we optimize

\[\min_\alpha f(\vx - \alpha \vd) \]

by computing

\[\begin{aligned} f(\vx - \alpha \vd) &= \frac{1}{2} (\vx - \alpha \vd)^T \! A (\vx - \alpha \vd) - \vb^T (\vx - \alpha \vd) \\ &= \frac{1}{2} \vx^T \! A \vx - \alpha \vd^T \! A \vx + \frac{\alpha^2}{2} \vd^T \! A \vd - \vb^T \vx + \alpha \vb^T \vd \\ &= \frac{\alpha^2}{2} \vd^T \! A \vd - \alpha (A\vx - \vb)^T \vd + \qty(\frac{1}{2} \vx^T \! A \vx - \vb^T \vx) \end{aligned} \]

and differentiating with respect to $\alpha,$ as follows:

\[\pdv{\alpha} f(\vx - \alpha \vd) = \alpha \vd^T \! A \vd - (A\vx - \vb)^T \vd = \vo \quad\implies\quad \alpha = \frac{(A\vx - \vb)^T \vd}{\vd^T \! A \vd} \]

Using this optimal step size, the iteration equation for gradient descent becomes

\[\vx_{n+1} = \vx_n - \frac{(A\vx_n - \vb)^T (A\vx_n - \vb)}{(A\vx_n - \vb)^T \! A (A\vx_n - \vb)} (A\vx_n - \vb). \]

This equation can be simplified by defining the residual vector $\vr_n \coloneqq \vb - A\vx_n,$ upon which it simplifies to

\[\vx_{n+1} = \vx_n + \frac{\vr_n^T \vr_n}{\vr_n^T \! A \vr_n} \vr_n. \]

This completes the derivation of the method of gradient descent (a.k.a. steepest descent).

\[\begin{aligned} \vr_0 & \coloneqq \vb - A\vx_0 \qquad & \qquad \vr_0 & \coloneqq \vb - A\vx_0 \\ r_0^2 & \coloneqq \norm{\vr_0}_2^2 & r_0^2 & \coloneqq \norm{\vr_0}_2^2 \\ & & \vp_0 & \coloneqq \vr_0 \end{aligned} \]
\[\begin{aligned} \vq_n & \coloneqq A\vr_n & \vq_n & \coloneqq A \vp_n \\ \alpha_n & \coloneqq \frac{r_n^2}{\vr_n^T \vq_n} & \alpha_n & \coloneqq \frac{r_n^2}{\vp_n^T \vq_n} \\ \vx_{n+1} & \coloneqq \vx_n + \alpha_n \vr_n \qquad & \qquad \vx_{n+1} & \coloneqq \vx_n + \alpha_n \vp_n \\ \vr_{n+1} & \coloneqq \vr_n - \alpha_n \vq_n & \vr_{n+1} & \coloneqq \vr_n - \alpha_n \vq_n \\ r_{n+1}^2 & \coloneqq \norm{\vr_{n+1}}_2^2 & r_{n+1}^2 & \coloneqq \norm{\vr_{n+1}}_2^2 \\ & & \beta_n & \coloneqq r_{n+1}^2 \divisionsymbol r_n^2 \\ & & \vp_{n+1} & \coloneqq \vr_{n+1} + \beta_n \vp_n \end{aligned} \]

Gradient descent is an extremely natural idea it's the first thing you'd come up with after recasting $A\vx = \vb$ as a convex optimization problem but it has a significant drawback. Namely, it often takes multiple steps in the same direction $\vd \in \R^n.$ This is wasteful, since a step $\alpha_1 \vd$ followed by a another step $\alpha_2 \vd$ can be combined into a single step $(\alpha_1 + \alpha_2) \vd.$

Wouldn't it be nice if we got the step size right the first time? That is, whenever we take a step in a direction $\vd,$ we should choose the step size $\alpha$ in such a way that we never need to step in the same direction $\vd$ again. All future step directions should be orthogonal to $\vd.$

In mathematical terms, we would like to ensure that after we take the step $\vx_{n+1} \coloneqq \vx_n - \alpha \vd,$ the remaining displacement between the approximate solution $\vx_{n+1}$ and the true solution $A^{-1} \vb$ is orthogonal to the step direction $\vd.$ This yields an equation which determines the step size $\alpha$ as follows:

\[\begin{aligned} \vd^T (\vx_{n+1} - A^{-1} \vb) &= 0 \\ \vd^T (\vx_n - \alpha \vd - A^{-1} \vb) &= 0 \\ \vd^T (\vx_n - A^{-1} \vb) &= \alpha \vd^T \! \vd \\ \alpha &= \frac{\vd^T (\vx_n - A^{-1} \vb)}{\vd^T \! \vd} \\ \end{aligned} \]

This equation has a serious problem. It requires us to know $A^{-1} \vb$ in order to compute the step size $\alpha,$ but the whole point of solving $A\vx = \vb$ is to find $A^{-1} \vb.$ If we had $A^{-1} \vb$ in hand, then we would already be done! Unfortunately, our dream of enforcing orthogonality between $\vd$ and $\vx_{n+1} - A^{-1} \vb$ turned out to be unworkable.

But wait! We derived this equation for $\alpha$ by demanding that $\vd$ be orthogonal to $\vx_{n+1} - A^{-1} \vb$ under the standard Euclidean inner product $\ev{\vx, \vy} \coloneqq \vx^T \vy.$ There is no reason to prefer this inner product over any other, and our idea can be saved by switching to another one. In particular, because $A$ is a symmetric positive-definite matrix, it defines the $A$-inner product

\[\ev{\vx, \vy}_{\! A} \coloneqq \vx^T \! A \vy. \]

If we ask for $\vd$ to be $A$-orthogonal to $\vx_{n+1} - A^{-1} \vb,$ then we instead obtain the following equation:;

\[\begin{aligned} \vd^T \! A (\vx_{n+1} - A^{-1} \vb) &= 0 \\ \vd^T (A \vx_n - \alpha A \vd - \vb) &= 0 \\ \vd^T (A \vx_n - \vb) &= \alpha \vd^T \! A \vd \\ \alpha &= \frac{\vd^T (A \vx_n - \vb)}{\vd^T \! A \vd} \\ \end{aligned} \]

Remarkably, this is the same value of $\alpha$ that we derived earlier by minimizing $f(\vx - \alpha \vd)!$ It contains no occurrences of $A^{-1},$ and can be computed purely from quantities we already know.


In the following section, let $A \in \R^{N \times N}$ be a symmetric positive-definite matrix, and let $\vb, \vx_0 \in \R^N$ be arbitrary vectors. We define the sequences $(\alpha_n), (\beta_n) \subset \R$ and $(\vx_n), (\vr_n), (\vp_n) \subset \R^N$ by the initial conditions $\vr_0 \coloneqq \vp_0 \coloneqq \vb - A\vx_0$ and the following mutual recurrence relations:

\[\begin{aligned} \alpha_n &\coloneqq \frac{\norm{\vr_n}_2^2}{\vp_n^T \! A \vp_n} \\ \vx_{n+1} &\coloneqq \vx_n + \alpha_n \vp_n \\ \vr_{n+1} &\coloneqq \vr_n - \alpha_n A \vp_n \\ \beta_n &\coloneqq \frac{\norm{\vr_{n+1}}_2^2}{\norm{\vr_n}_2^2} \\ \vp_{n+1} &\coloneqq \vr_{n+1} + \beta_n \vp_n \end{aligned} \]

These sequences are well-defined until the first $n$ for which $\vr_n = \vo,$ at which point division by zero occurs in the definition of $\beta_n.$

Lemma: Each step direction $\vp_n$ is a linear combination of the residual vectors $\vr_0, \dots, \vr_n.$ In particular,
\[\vp_n = \sum_{k=0}^n \frac{\norm{\vr_n}_2^2}{\norm{\vr_k}_2^2} \vr_k. \]
\[\begin{aligned} \vp_0 &= \vr_0 \\ \vp_1 &= \vr_1 + \frac{\norm{\vr_1}_2^2}{\norm{\vr_0}_2^2} \vr_0 \\ \vp_2 &= \vr_2 + \frac{\norm{\vr_2}_2^2}{\norm{\vr_1}_2^2} \vp_1 \\ &= \vr_2 + \frac{\norm{\vr_2}_2^2}{\norm{\vr_1}_2^2} \vr_1 + \frac{\norm{\vr_2}_2^2}{\norm{\vr_0}_2^2} \vr_0 \end{aligned} \]
Theorem:
\[\begin{aligned} \vr_i^T \vr_j &= 0 \qquad\qquad i \ne j \\ \vp_i^T \! A \vp_j &= 0 \qquad\qquad i \ne j \\ \vr_i^T \vp_j &= \norm{\vr_j}_2^2 \qquad\, i \le j \\ \vr_i^T \! A \vp_i &= \vp_i^T \! A \vp_i \\ \vr_i^T \! A \vp_j &= 0 \qquad\qquad i \notin \{j, j+1\} \end{aligned} \]

Proof: We first verify that the theorem holds for $\vr_0,$ $\vp_0,$ and $\vr_1.$ Since $\vr_0 = \vp_0,$ the only nontrivial claim to verify is that

\[\begin{aligned} \vr_0^T \vr_1 &= \vr_0^T \qty(\vr_0 - \alpha_0 A \vp_0) \\ &= \vr_0^T \qty(\vr_0 - \frac{\norm{\vr_0}_2^2}{\vr_0^T \! A \vr_0} A \vr_0) \\ &= \norm{\vr_0}_2^2 - \frac{\norm{\vr_0}_2^2}{\vr_0^T \! A \vr_0} \vr_0^T \! A \vr_0 \\ &= 0. \end{aligned} \]

We now proceed by induction in two stages. Suppose that the theorem holds for the vectors $\vr_0, \dots, \vr_k$ and $\vp_0, \dots, \vp_{k-1}.$ We will show that the theorem also holds for $\vp_k$ and $\vr_{k+1}.$ To prove the theorem for $\vp_k,$ we need to verify the following four claims:

\[\begin{aligned} \vr_i^T \vp_k &= \norm{\vr_k}_2^2 \qquad & \qquad i &= 0, \dots, k \\ \vp_i^T \! A \vp_k &= 0 & i &= 0, \dots, k-1 \\ \vr_i^T \! A \vp_k &= 0 & i &= 0, \dots, k-1 \\ \vr_k^T \! A \vp_k &= \vp_k^T \! A \vp_k & & \end{aligned} \]

Let $i \in \{0, \dots, k\}.$ The first inductive claim follows from a direct calculation using the inductive hypothesis $\vr_i^T \vr_j = 0$ for $i \ne j$ and the formula for $\vp_k$ obtained earlier.

\[\vr_i^T \vp_k = \sum_{j=0}^k \frac{\norm{\vr_k}_2^2}{\norm{\vr_j}_2^2} \vr_i^T \vr_j = \sum_{j=0}^k \frac{\norm{\vr_k}_2^2}{\norm{\vr_j}_2^2} \delta_{ij} \norm{\vr_j}_2^2 = \norm{\vr_k}_2^2 \]

Now, let $i \in \{0, \dots, k-1\}.$ The preceding result implies that $\vr_i^T \vp_k = \vr_{i+1}^T \vp_k = \norm{\vr_k}_2^2.$ We obtain the second inductive claim by combining this with the iteration equation $\vr_{i+1} \coloneqq \vr_i - \alpha_i A \vp_i.$

\[\norm{\vr_k}_2^2 = \vr_{i+1}^T \vp_k = \vr_i^T \vp_k - \alpha_i \vp_i^T \! A \vp_k = \norm{\vr_k}_2^2 - \alpha_i \vp_i^T \! A \vp_k \]
\[\qquad\implies\qquad \alpha_i \vp_i^T \! A \vp_k = 0 \]

Since $\alpha_i > 0,$ we may conclude that $\vp_i^T \! A \vp_k = 0.$ We now apply the iteration equation $\vp_i = \vr_i + \beta_{i-1} \vp_{i-1}$ to this result to obtain the third inductive claim. (For $i = 0,$ we set $\beta_{-1} \coloneqq 0$ and $\vp_{-1} \coloneqq \vo.$)

\[0 = \vp_i^T \! A \vp_k = (\vr_i + \beta_{i-1} \vp_{i-1})^T \! A \vp_k = \vr_i^T \! A \vp_k + \beta_{i-1} \vp_{i-1}^T \! A \vp_k = \vr_i^T \! A \vp_k \]

Finally, the fourth inductive claim follows by combining the iteration equation $\vp_k = \vr_k + \beta_{k-1} \vp_{k-1}$ with a special case of the second inductive claim, namely, $\vp_{k-1}^T \! A \vp_k = 0.$

\[\vp_k^T \! A \vp_k = \qty(\vr_k + \beta_{k-1} \vp_{k-1})^T \! A \vp_k = \vr_k^T \! A \vp_k + \beta_{k-1} \vp_{k-1}^T \! A \vp_k = \vr_k^T \! A \vp_k \]

This completes the proof that the theorem holds for $\vp_k.$ We now turn our attention to proving the theorem for $\vr_{k+1}.$ This requires us to verify two additional claims:

\[\begin{aligned} \vr_j^T \vr_{k+1} &= 0 \qquad & \qquad j &= 0, \dots, k \\ \vr_{k+1}^T \! A \vp_j &= 0 & j &= 0, \dots, k-1 \end{aligned} \]

The first inductive claim requires two cases. For $j \in \{0, \dots, k-1\},$ we combine the iteration equation $\vr_{k+1} \coloneqq \vr_k - \alpha_k A \vp_k$ with the inductive hypotheses $\vr_j^T \vr_k = 0$ and $\vr_j^T \! A \vp_k = 0$ to write:

\[\vr_j^T \vr_{k+1} = \vr_j^T \qty(\vr_k - \alpha_k A \vp_k) = \vr_j^T \vr_k - \alpha_k \vr_j^T \! A \vp_k = 0 \]

On the other hand, for $j = k,$ we apply the definition $\alpha_k \coloneqq \norm{\vr_k}_2^2 / (\vp_k^T \! A \vp_k)$ together with the inductive hypothesis $\vr_k^T \! A \vp_k = \vp_k^T \! A \vp_k.$

\[\vr_k^T \vr_{k+1} = \vr_k^T \qty(\vr_k - \alpha_k A \vp_k) = \vr_k^T \vr_k - \alpha_k \vr_k^T A \vp_k = \vr_k^T \vr_k - \frac{\norm{\vr_k}_2^2}{\vp_k^T \! A \vp_k} \vr_k^T \! A \vp_k = 0 \]

Finally, the second inductive claim only requires one case. Let $j \in \{0, \dots, k-1\}.$ From the first inductive claim, we know that $\vr_j^T \vr_{k+1} = \vr_{j+1}^T \vr_{k+1} = 0.$ It follows that:

\[0 = \vr_{k+1}^T \vr_{j+1} = \vr_{k+1}^T \qty(\vr_j - \alpha_j A \vp_j) = \vr_{k+1}^T \vr_j - \alpha_j \vr_{k+1}^T A \vp_j = -\alpha_j \vr_{k+1}^T A \vp_j \]

Since $\alpha_j > 0,$ we may conclude that $\vr_{k+1}^T A \vp_j = 0$. This completes the proof.


Definition: A decision tree on $n$ variables is a full binary tree $T$ in which each internal node is labeled with an index $i \in \{1, \dots, n\}$ and each leaf is labeled a result $r \in \{0,1\}.$

[TODO: Draw a picture of a decision tree.]

Every decision tree $T$ on $n$ variables defines a Boolean function $\{0,1\}^n \to \{0,1\}$ as follows:

[TODO: Illustrate examples of decision tree computation.]

Given a Boolean function $f : \{0,1\}^n \to \{0,1\},$ we say that a decision tree $T$ computes $f$ if $T(x) = f(x)$ for every bit string $x \in \{0,1\}^n.$

The depth of a decision tree $T$ is the length of a longest path in $T$ from the root to any leaf. Here, the length of a path is defined as the number of edges it traverses, so that a tree consisting only of a single leaf has depth zero.

We can construct a decision tree $T$ that computes any Boolean function $f : \{0,1\}^n \to \{0,1\}$ by simply writing the truth table for $f$ along the bottom row of a complete binary tree of depth $n.$ This is a highly inefficient decision tree having $2^n$ leaves and $2^n-1$ internal nodes. The nontrivial question is whether a given function $f$ admits a smaller decision tree.

If a Boolean function $f$ can be computed by a very shallow decision tree, then only a few bits of $x$ are needed to deduce the value of $f(x).$ This suggests that $f$ is, in some sense, a very simple function. We formalize this notion in the following definition:

Definition: The decision tree complexity of a Boolean function $f: \{0,1\}^n \to \{0,1\},$ denoted by $D(f),$ is the minimum depth of any decision tree $T$ that computes $f.$
Definition: Let $x \in \{0,1\}^n$ and $S \subseteq [n].$ The restriction of $x$ to $S,$ denoted by $x|_S,$ is the bit string $x_{i_1} x_{i_2} \cdots x_{i_\abs{S}},$ where $i_1 < i_2 < \cdots < i_{\abs{S}}$ denotes the enumeration of $S = \{i_1, i_2, \dots, i_{\abs{S}}\}$ in increasing order.

For example, if $x = 011010$ and $S = \{1, 3, 5\},$ then $x|_S = 011.$ Similarly, $x|_{\{2,4,6\}} = 100,$ and $x|_\varnothing$ is the empty string.

Definition: Let $f: \{0,1\}^n \to \{0,1\}$ and $x \in \{0,1\}^n.$ A certificate for $f$ at $x$ is a set $S \subseteq [n]$ with the following property: for all $x' \in \{0,1\}^n,$ if $x|_S = x'|_S,$ then $f(x) = f(x').$

Certificates always exist, since the full set $[n]$ is always a certificate for any $f$ at any $x.$ The nontrivial question is whether a given function admits smaller certificates. Intuitively, small certificates suggest that a function has low complexity, since only a small number of bits of $x$ are needed to deduce the value of $f(x).$

Definition: The certificate complexity of a Boolean function $f: \{0,1\}^n \to \{0,1\}$ at a bit string $x \in \{0,1\}^n$ is the smallest cardinality of any certificate for $f$ at $x.$

\[C_x(f) \coloneqq \min_{\substack{S \subseteq [n] \\ \text{$S$ is a certificate for $f$ at $x$}}} \abs{S} \]

The certificate complexity of $f$ is

\[C(f) \coloneqq \max_{x \in \{0,1\}^n} C_x(f). \]

This definition introduces two distinct but related notions: “certificate complexity of f and “certificate complexity of f at x.” This is analogous to the separate but related notions of “continuity of f and “continuity of f at x in topology.

Theorem: $C(f) \le D(f)$ for any Boolean function $f : \{0,1\}^n \to \{0,1\}.$
Proof: Let $T$ be a minimal-depth decision tree that computes $f.$ For any bit string $x \in \{0,1\}^n,$ the set of indices queried when computing $T(x)$

$\deg(f)$

$\apxdeg(f)$


In this section, we denote by $\SphS^1$ the unit circle in the complex plane, i.e.,

\[\SphS^1 \coloneqq \{ z \in \C : \abs{z} = 1 \}. \]

Observe that $\SphS^1$ is a group under multiplication, and that the group-theoretic inverse of $z \in \SphS^1$ is simply its complex conjugate $z^*.$ Moreover, $\SphS^1$ is a topological group under the subspace topology inherited from the usual topology on $\C.$

Definition: The Pontryagin dual of a locally compact abelian group $G$ is the topological group $\widehat{G} \coloneqq \operatorname{Hom}_{\mathbf{TopGrp}}(G, \SphS^1)$ of continuous group homomorphisms from $G$ to the circle group $\SphS^1,$ endowed with the operation of pointwise multiplication and the topology of uniform convergence on compact sets.
Theorem: If $G$ is a locally compact abelian group, then so is $\widehat{G}.$

Having shown that $\widehat{G}$ is itself a locally compact abelian group, it now makes sense to consider the double Pontryagin dual $\widehat{\!\vphantom{\rule{1pt}{8.2pt}}\smash{\widehat{G}}}.$

Pontryagin Duality Theorem: For any locally compact abelian group $G,$ the evaluation homomorphism $G \hookrightarrow \widehat{\!\vphantom{\rule{1pt}{8.2pt}}\smash{\widehat{G}}}$ is an isomorphism of topological groups (i.e., a group isomorphism and a homeomorphism).

\[\ind_{=} : \prod_{A : \UU} \ \prod_{C : \prod\limits_{x : A} \prod\limits_{y : A} \qty[(x \mathrel{=_A} y) \to \UU]} \qty[\qty(\prod_{x : A} C(x, x, \refl_x)) \to \prod_{x : A} \ \prod_{y : A} \ \prod_{p : x \mathrel{=_A} y} C(x, y, p)] \]
\[\ind_{=}(A, C, f, x, x, \refl_x) \equiv f(x) \]
\[\transport : \prod_{A : \UU} \ \prod_{P : A \to \UU} \ \prod_{x : A} \ \prod_{y : A} \ \prod_{p : x \mathrel{=_A} y} \qty[P(x) \to P(y)] \]

We normally suppress the arguments $A,$ $x,$ and $y$ of $\transport,$ since these can be inferred from $P$ and $p.$ Thus, we write $\transport^P(p) : P(x) \to P(y).$

\[\begin{aligned} \transport \equiv (A : \UU) &\mapsto (P : A \to \UU) \mapsto \ind_{=}(A, \\ &(x : A) \mapsto (y : A) \mapsto (p : x \mathrel{=_A} y) \mapsto \qty[P(x) \to P(y)], \\ &(x : A) \mapsto \id_{P(x)}) \end{aligned} \]

\[\apd : \prod_{A : \UU} \ \prod_{P : A \to \UU} \ \prod_{f : \prod\limits_{x : A} P(x)} \ \prod_{x : A} \ \prod_{y : A} \ \prod_{p : x \mathrel{=_A} y} \qty[\transport^P(p, f(x)) \mathrel{=_{P(y)}} f(y)] \]

As with $\transport,$ we normally suppress the arguments $A,$ $P,$ $x,$ and $y$ of $\apd,$ since they can be inferred from $f$ and $p.$ Thus, we write

\[\apd_f : \prod_{p : x \mathrel{=_A} y} \qty[\transport^P(p, f(x)) \mathrel{=_{P(y)}} f(y)]. \]

To understand $\apd,$ think of $P$ as a proposition, and $f$ as a “proof constructor” for $P.$ In other words, for each $x : A,$ $f(x)$ is a proof term for $P(x).$ Given a path $p : x \mathrel{=_A} y,$ $\apd$ says that transporting the proof term $f(x)$ across the path $p$ yields a proof which is equal to $f(y).$

\[\begin{aligned} \apd \equiv (A : \UU) &\mapsto (P : A \to \UU) \mapsto \qty(f : \prod_{x : A} P(x)) \mapsto \ind_{=}(A, \\ &(x : A) \mapsto (y : A) \mapsto (p : x \mathrel{=_A} y) \mapsto \qty[\transport^P(p, f(x)) \mathrel{=_{P(y)}} f(y)], \\ &(x : A) \mapsto \refl_{f(x)}) \end{aligned} \]

Observe that when $y \equiv x$ and $p \equiv \refl_x,$ the type $\transport^P(p, f(x)) \mathrel{=_{P(y)}} f(y)$ judgmentally reduces to $f(x) \mathrel{=_{P(x)}} f(x).$ This type is inhabited by $\refl_{f(x)}.$


Definition: Let $f, g : \prod_{(x : A)} B(x).$ A homotopy from $f$ to $g$ is a dependent function of the following type:

\[(f \sim g) \equiv \prod_{x : A} \qty[ f(x) \mathrel{=_{B(x)}} g(x) ] \]

Definition: Let $f: A \to B.$ A quasi-inverse of $f$ is an ordered triple $(g, \alpha, \beta)$ consisting of a function $g: B \to A$ and two homotopies $\alpha: f \circ g \sim \id_B$ and $\beta: g \circ f \sim \id_A.$ The type of quasi-inverses of $f$ is denoted by:

\[\qinv(f) \equiv \sum_{g : B \to A} \qty[(f \circ g \sim \id_B) \times (g \circ f \sim \id_A)] \]

Definition: Let $f: A \to B.$ We say that $f$ is an equivalence from $A$ to $B$ if there exists an ordered quadruple $(g, \alpha, h, \beta)$ consisting of two functions $g, h : B \to A$ and two homotopies $\alpha : f \circ g \sim \id_B$ and $\beta: h \circ f \sim \id_A.$ The type of such ordered quadruples is denoted by:

\[\isequiv(f) \equiv \Bigg[ \sum_{g : B \to A} (f \circ g \sim \id_B) \Bigg] \times \Bigg[ \sum_{h : B \to A} (h \circ f \sim \id_A) \Bigg] \]

The type of equivalences from $A$ to $B$ is denoted by:

\[(A \simeq B) \equiv \sum_{f : A \to B} \isequiv(f) \]

Thus, we write $f: A \simeq B$ to indicate that $f$ is an equivalence from $A$ to $B.$

Theorem: For each $f : A \to B,$ the types $\qinv(f)$ and $\isequiv(f)$ are logically equivalent. In other words, the types $\qinv(f) \to \isequiv(f)$ and $\isequiv(f) \to \qinv(f)$ are inhabited.

The direction $\qinv(f) \to \isequiv(f)$ is easy, since we can send $(g, \alpha, \beta) \mapsto (g, \alpha, g, \beta).$


Definition: The nuclear norm (a.k.a. trace norm) of a matrix $A \in \C^{m \times n}$ is the quantity

\[\nucnorm{A} \coloneqq \sum_{i=1}^{\min(m,n)} \sigma_i(A) \]

where $\sigma_i(A)$ denotes the $i^{\text{th}}$ singular value of $A.$

The nuclear norm is a powerful tool in computational mathematics because it can be interpreted as a convex relaxation of the rank function. (More precisely, the nuclear norm is the convex envelope of the rank function when restricted to the unit ball in the $\ell^2 \to \ell^2$ operator norm.) In matrix recovery problems, adding a penalty term proportional to the nuclear norm encourages a solver to find low-rank solutions.

The nuclear norm also has a special interpretation for Hermitian matrices. Recall that the singular values of a Hermitian matrix $A \in \C^{n \times n}$ are the absolute values of its eigenvalues, i.e., $\sigma_i = \abs{\lambda_i}.$ It follows that $\nucnorm{A} = \sum_{i=1}^n \abs{\lambda_i}$ when $A$ is Hermitian. Moreover, when $A$ is both Hermitian and positive (semi)definite, we can drop the absolute value bars and write $\nucnorm{A} = \sum_{i=1}^n \lambda_i = \tr(A)$. This property is, presumably, the justification for the name “trace norm.”

Because the definition of $\nucnorm{A}$ depends only on the singular values of $A,$ it is clear that the nuclear norm is unitarily invariant, i.e., that $\nucnorm{A} = \nucnorm{U\!AV}$ for all $U \in \LieU(m)$ and $V \in \LieU(n).$ (Here, $\LieU(n) \subseteq \C^{n \times n}$ denotes the set of $n \times n$ unitary matrices.) However, it is not at all evident that the nuclear norm is actually a norm! In order to prove this, we will first establish the following variational characterization:

Theorem: For all square matrices $A \in \C^{n \times n},$

\[\nucnorm{A} = \max_{U,V \in \LieU(n)} \trace \abs{U^\dagger \! A V} \]

where $\abs{U^\dagger \! A V}$ denotes the matrix obtained by taking the absolute value of each entry of $U^\dagger \! A V.$ Moreover, the maximum is achieved whenever $U^\dagger \! A V$ is a diagonal matrix.

Proof: Let $A = W\Sigma Z^\dagger$ denote a singular value decomposition of $A,$ where $W,Z \in \LieU(n)$ and $\Sigma = \operatorname{diag}(\sigma_1, \dots, \sigma_n)$ is a diagonal matrix of singular values. Then

\[\max_{U,V \in \LieU(n)} \trace \abs{U^\dagger \! A V} \ge \trace \abs{W^\dagger \! A Z} = \trace \abs{\Sigma} = \sum_{i=1}^n \sigma_i = \nucnorm{A}. \]

We now must show that $\trace \abs{U^\dagger \! A V} \le \nucnorm{A}$ for all $U,V \in \LieU(n).$ Observe that as $U$ ranges over all unitary matrices, the product $WU$ also ranges over all unitary matrices. The same holds for $V$ and $ZV.$ Thus, under the change of variables $(U, V) \mapsto (WU, ZV),$ it suffices to show that

\[\trace \abs{U^\dagger W^\dagger \! A Z V} = \trace \abs{U^\dagger \Sigma V} \le \nucnorm{A} \]

for all $U,V \in \LieU(n).$ This is proven by the following computation:

\[\begin{aligned} \trace \abs{U^\dagger \Sigma V} &= \sum_{i=1}^n \abs{(U^\dagger \Sigma V)_{ii}} \\ &= \sum_{i=1}^n \abs{\sum_{j=1}^n (U^\dagger)_{ij} \sigma_j V_{ji}} \\ &\le \sum_{i=1}^n \sum_{j=1}^n \abs{U^*_{ji} \sigma_j V_{ji}} && \text{(triangle inequality)} \\ &= \sum_{j=1}^n \sigma_j \sum_{i=1}^n \abs{U_{ji}} \abs{V_{ji}} \\ &\le \sum_{j=1}^n \sigma_j \sum_{i=1}^n \frac{1}{2} \qty(\abs{U_{ji}}^2 + \abs{V_{ji}}^2) && \text{(AM-GM inequality)} \\ &= \frac{1}{2} \sum_{j=1}^n \sigma_j \qty(\sum_{i=1}^n \abs{U_{ji}}^2 + \sum_{i=1}^n \abs{V_{ji}}^2) \\ &= \frac{1}{2} \sum_{j=1}^n \sigma_j \cdot \qty(1 + 1) && \text{($U,V$ have orthonormal rows)} \\ &= \sum_{j=1}^n \sigma_j \\ &= \nucnorm{A} \end{aligned} \]

Corollary: If $A \in \C^{n \times n}$ is a Hermitian matrix, then

\[\nucnorm{A} = \max_{U \in \LieU(n)} \trace \abs{U^\dagger \! A U}. \]

Moreover, the maximum value is achieved when $U$ diagonalizes $A.$

Theorem: The nuclear norm is a norm on $\C^{n \times n}.$

Proof: Positive-definiteness and absolute homogeneity are clear, since the singular values $\sigma_i(A)$ are nonnegative and absolutely homogeneous functions of $A$, and every nonzero matrix has at least one nonzero singular value. To establish subadditivity, we perform the following computation:

\[\begin{aligned} \nucnorm{A + B} &= \max_{U,V \in \LieU(n)} \trace \abs{U^\dagger (A + B) V} \\ &= \max_{U,V \in \LieU(n)} \trace \abs{U^\dagger \! A V + U^\dagger \! B V} \\ &\le \max_{U,V \in \LieU(n)} \qty(\trace \abs{U^\dagger \! A V} + \trace \abs{U^\dagger \! B V}) \\ &\le \max_{U,V \in \LieU(n)} \trace \abs{U^\dagger \! A V} + \max_{W,Z \in \LieU(n)} \trace \abs{W^\dagger \! B Z} \\ &= \nucnorm{A} + \nucnorm{B} \end{aligned} \]

In the third step, we use the entrywise inequality $\abs{A+B} \le \abs{A} + \abs{B}.$

Definition: A density matrix is a Hermitian positive semidefinite matrix with trace 1.

Earlier, we observed that the nuclear norm $\nucnorm{A}$ is unitarily invariant because it only depends on the singular values of $A.$ Conversely, every unitarily invariant norm $\norm{\cdot} : \C^{n \times n} \to \R_{\ge 0}$ can be written as a function of the singular values, since we can always use the SVD $A = U\Sigma V^\dagger$ to write $\norm{A} = \norm{U \Sigma V^\dagger} = \norm{\Sigma}.$


Consider a quantum system with Hamiltonian $\hat{H},$ eigenvalues $E_n,$ and corresponding eigenvectors $\ket{n}\!.$ Its partition function is

\[Z \coloneqq \sum_n e^{-\beta E_n} = \sum_n \mel{n}{e^{-\beta \hat{H}}}{n} = \Trace(e^{-\beta \hat{H}}). \]

Recall that by introducing a resolution of the identity, we can verify the identity $Z = \Trace(e^{-\beta \hat{H}})$ in any basis. For example, in the position basis, we have:

\[\begin{aligned} Z &= \sum_n \mel{n}{e^{-\beta \hat{H}}}{n} \\ &= \sum_n \bra{n} \qty[\int \ket{\vx} \bra{\vx} \dd{\vx}] e^{-\beta \hat{H}} \qty[\int \ket{\vx'} \bra{\vx'} \dd{\vx'}] \ket{n} \\ &= \sum_n \iint \bra{n} \ket{\vx} \bra{\vx} e^{-\beta \hat{H}} \ket{\vx'} \bra{\vx'} \ket{n} \dd{\vx} \dd{\vx'} \\ &= \iint \bra{\vx'} \qty[\sum_n \ket{n} \bra{n}] \ket{\vx} \bra{\vx} e^{-\beta \hat{H}} \ket{\vx'} \dd{\vx} \dd{\vx'} \\ &= \iint \bra{\vx'} \ket{\vx} \bra{\vx} e^{-\beta \hat{H}} \ket{\vx'} \dd{\vx} \dd{\vx'} \\ &= \iint \delta(\vx - \vx') \bra{\vx} e^{-\beta \hat{H}} \ket{\vx'} \dd{\vx} \dd{\vx'} \\ &= \int \bra{\vx} e^{-\beta \hat{H}} \ket{\vx} \dd{\vx} \end{aligned} \]

Now, if the system consists of $N$ distinguishable particles in $d$ spatial dimensions, then the Hamiltonian $\hat{H}$ has the form

\[\hat{H} = \hat{T} + \hat{V} = \sum_{i=1}^N \frac{\hat{p}_i^2}{2m_i} + V(\hat{\vx}_1, \dots, \hat{\vx}_N). \]

In the classical limit $\hbar \to 0,$ the canonical commutation relation $[x_{ij}, p_{k\ell}] = i\hbar \delta_{ik} \delta_{j\ell} = O(\hbar)$ vanishes. It follows that the kinetic energy $\hat{T}$ and potential energy $\hat{V}$ operators commute, and we can write:

\[\begin{aligned} Z &= \Trace(e^{-\beta \hat{T}} e^{-\beta \hat{V}}) \\ &= \int \bra{\vx} e^{-\beta \hat{T}} e^{-\beta \hat{V}} \ket{\vx} \dd{\vx} \\ &= \int \bra{\vx} e^{-\beta \hat{T}} \qty[\int \ket{\vx'} \bra{\vx'} \dd{\vx'}] e^{-\beta \hat{V}} \ket{\vx} \dd{\vx} \\ &= \iint \bra{\vx} e^{-\beta \hat{T}} \ket{\vx'} \bra{\vx'} e^{-\beta \hat{V}} \ket{\vx} \dd{\vx} \dd{\vx'} \\ &= \iint \bra{\vx} e^{-\beta \hat{T}} \ket{\vx'} e^{-\beta V(\vx)} \delta(\vx - \vx') \dd{\vx} \dd{\vx'} \\ &= \int \bra{\vx} e^{-\beta \hat{T}} \ket{\vx} e^{-\beta V(\vx)} \dd{\vx} \\ &= \int \bra{\vx} \qty[\int \ket{\vp} \bra{\vp} \dd{\vp}] e^{-\beta \hat{T}} \qty[\int \ket{\vp'} \bra{\vp'} \dd{\vp'}] \ket{\vx} e^{-\beta V(\vx)} \dd{\vx} \\ &= \iiint \bra{\vx} \ket{\vp} \bra{\vp} e^{-\beta \hat{T}} \ket{\vp'} \bra{\vp'} \ket{\vx} e^{-\beta V(\vx)} \dd{\vx} \dd{\vp} \dd{\vp'} \\ &= \iiint \bra{\vx} \ket{\vp} e^{-\beta T(\vp)} \delta(\vp - \vp') \bra{\vp'} \ket{\vx} e^{-\beta V(\vx)} \dd{\vx} \dd{\vp} \dd{\vp'} \\ &= \iiint \bra{\vx} \ket{\vp} e^{-\beta T(\vp)} \bra{\vp} \ket{\vx} e^{-\beta V(\vx)} \dd{\vx} \dd{\vp} \\ &= \frac{1}{(2\pi\hbar)^{Nd}} \iiint e^{-\beta H(\vx, \vp)} e^{i\vp \cdot \vx/\hbar} e^{-i\vp \cdot \vx/\hbar} \dd{\vx} \dd{\vp} \\ &= \frac{1}{h^{Nd}} \iiint e^{-\beta H(\vx, \vp)} \dd{\vx} \dd{\vp} \end{aligned} \]

In the penultimate step, we use the fact that

\[\bra{\vx} \ket{\vp} = \frac{e^{i\vp \cdot \vx / \hbar}}{(2\pi \hbar)^{Nd/2}} \qquad\text{ and }\qquad \bra{\vp} \ket{\vx} = \frac{e^{-i\vp \cdot \vx / \hbar}}{(2\pi \hbar)^{Nd/2}}. \]

This calculation explains the appearance of Planck's constant $h$ in the formula for the classical partition function. In classical thermodynamics, the value of $h$ is immaterial, since we always compute physically observable quantities by taking a derivative of the logarithm of $Z.$ However, the quantum-mechanical perspective shows that $h$ is the correct measure of phase-space volume.


Cayley's Theorem: Every group $G$ is isomorphic to a subgroup of $\Sym(G).$
Proof: Send each element $g \in G$ to the permutation of $G$ defined by $x \mapsto gx.$
Strong Cayley Theorem: For each subgroup $H$ of a group $G,$ there exists a homomorphism $\varphi: G \to \SymS_m,$ where $m \coloneqq \abs{G:H},$ whose kernel is the largest normal subgroup of $G$ contained in $H.$
Remark: There is no requirement that $G,$ $H,$ or $\abs{G:H}$ be finite. This statement still makes sense when $m$ is interpreted as a cardinal number.

Proof: Consider the action of $G$ by left multiplication on the set of left cosets of $H.$ This action identifies each element $g \in G$ with a permutation of a set containing $\abs{G:H} = m$ elements, i.e., an element of $\SymS_m.$ This defines the desired homomorphism $\varphi: G \to \SymS_m.$

Observe that $g \in \ker \varphi$ if and only if left multiplication by $g$ fixes every left coset of $H,$ i.e.,

\[\begin{aligned} && g(aH) = aH \qquad \forall a \in G \\ \iff && a^{-1} gaH = H \qquad \forall a \in G \\ \iff && a^{-1} ga \in H \qquad \forall a \in G \\ \iff && g \in aHa^{-1} \qquad \forall a \in G \\ \end{aligned} \]

This shows that $\ker\varphi = \bigcap_{a \in G} aHa^{-1}.$ Now, any normal subgroup $N \normalin G$ satisfying $N \subseteq H$ also satisfies $N = aNa^{-1} \subseteq aHa^{-1}$ for all $a \in G,$ which proves $N \subseteq \ker \varphi.$

Note that Cayley's theorem is the special case $H = \{1_G\}$ of the strong Cayley theorem.

Corollary: If a finite group $G$ has a subgroup $H \le G$ whose index $p \coloneqq \abs{G:H}$ is the smallest prime factor of $\abs{G},$ then $H \normalin G.$

Proof: Let $\varphi: G \to \SymS_p$ denote the homomorphism induced by the left action of $G$ on the collection of left cosets of $H.$ The strong Cayley theorem tells us that $K \coloneqq \ker \varphi$ is the largest normal subgroup of $G$ contained in $H.$ We will prove that, in fact, $K = H.$

Since we already know $K \subseteq H,$ to establish equality, it suffices to prove $\abs{K} \ge \abs{H}.$ We will do this by proving that every prime power which divides $\abs{H}$ also divides $\abs{K}.$

Observe via the first isomorphism theorem that $G/K$ is isomorphic to a subgroup of $\SymS_p.$ By Lagrange's theorem, $\abs{G:K}$ divides $\abs{\SymS_p} = p!.$ But since $\abs{G:K} = \abs{G:H}\abs{H:K} = p\abs{H:K},$ we may conclude that $\abs{H:K}$ divides $(p-1)!.$ In particular, $\abs{H:K}$ cannot have any prime factors $\ge p.$

Now, let $q$ be a prime for which $q^k$ divides $\abs{H},$ where $k \ge 1.$ Since $\abs{H} = \abs{H:K} \abs{K},$ we know that $\abs{H:K}$ and $\abs{K}$ must collectively share $k$ factors of $q.$ However, $q$ cannot be a prime factor of $\abs{H:K},$ since any prime factor of $\abs{H}$ is also a prime factor of $\abs{G} = \abs{G:H}\abs{H},$ and by hypothesis, $q \ge p.$ Therefore, $q^k$ divides $\abs{K}.$

This generalizes the fact that every subgroup of index 2 is normal. Thus, to prove that a group of order 99 is not simple, it suffices to find a subgroup of index 3.

Every group acts on itself by conjugation: $(g, x) \mapsto gxg^{-1}.$

Orbit-Stabilizer Theorem: Let $G \actson X.$ For any $x \in X,$

\[\abs{G} = \abs{Gx} \abs{G_x} \]

where $Gx$ denotes the orbit of $x$ under $G,$ and $G_x$ denotes the stabilizer subgroup of $x.$

Remark: This statement holds in cardinal arithmetic regardless of whether $\abs{G}$ is finite.
Proof: Elements of the orbit $Gx$ correspond to cosets of the stabilizer $G_x$ in $G.$ Hence, $\abs{G}$ equals the number of cosets $\abs{Gx}$ multiplied by the size $\abs{G_x}$ of each coset.
Burnside's Lemma: Let $G \actson X$ with both $G$ and $X$ finite. Then
\[\abs{X/G} = \frac{1}{\abs{G}} \sum_{g \in G} \abs{X^g} \]

where $\abs{X/G}$ denotes the number of distinct orbits, and $X^g$ denotes the set of elements of $X$ which are fixed by a given $g \in G.$

Proof: Consider the quantity $\sum_{g \in G} \sum_{x \in X} [gx = x],$ where $[gx = x]$ denotes the Iverson bracket (i.e., 1 if $gx = x,$ otherwise 0). On one hand, we can write

\[\sum_{g \in G} \sum_{x \in X} [gx = x] = \sum_{g \in G} \sum_{x \in X^g} 1 = \sum_{g \in G} \abs{X^g}. \]

However, by changing the order of summation, we can apply the orbit-stabilizer theorem to write

\[\sum_{g \in G} \sum_{x \in X} [gx = x] = \sum_{x \in X} \sum_{g \in G} [gx = x] = \sum_{x \in X} \abs{G_x} = \abs{G} \sum_{x \in X} \frac{1}{\abs{Gx}}. \]

To complete the proof, observe that $\abs{X/G} = \sum_{x \in X} \abs{Gx}^{-1},$ since each distinct orbit contributes a total of 1 to this sum.

Cauchy's Theorem: Let $G$ be a finite group. If a prime number $p$ divides $\abs{G},$ then $G$ contains an element of order $p.$

Proof: Consider the set

\[X \coloneqq \{ (x_1, x_2, \dots, x_p) \in G^p : x_1 x_2 \cdots x_p = 1_G \}. \]

Observe that $\abs{X} = \abs{G}^{p-1},$ since the first $p-1$ elements of $(x_1, x_2, \dots, x_p)$ can be chosen freely; the final element is then uniquely determined. Since (by hypothesis) $p$ divides $\abs{G},$ it follows that $p$ also divides $\abs{X}.$

Let $\Z_p \actson X$ by cyclic permutations. (Note that $x_1 x_2 \cdots x_p = 1_G$ implies $x_2 \cdots x_p x_1 = 1_G$.) This action partitions $X$ into orbits, and since the only subgroups of $\Z_p$ are itself and the trivial group, the orbit-stabilizer theorem implies that every orbit has size 1 or $p.$ Since $\abs{X} \equiv 0 \mod p,$ the number of orbits of size 1 must be divisible by $p.$ There is at least one such orbit, given by the element $(1_G, 1_G, \dots, 1_G) \in X,$ so we can conclude that there are at least $p-1$ others. Every such orbit yields an element of $G$ of order $p.$

Sylow's First Theorem: Let $G$ be a finite group, and $p$ a prime number. If $p^k$ divides $\abs{G},$ for any $k \ge 0,$ then $G$ contains a subgroup of order $p^k.$
Proof:

A function $f: \R^n \to \R$ is harmonic if it is annihilated by the Laplacian: $\nabla^2 f = 0.$ Think of $\nabla^2 f(\vx)$ as measuring the difference between $f(\vx)$ and the average value that $f$ assumes at nearby points. At a local minimizer $\vx^*,$ the quantity $\nabla^2 f(\vx^*)$ is positive, because the value of $f$ at any point near $\vx^*$ exceeds $f(\vx^*).$

Note that if $f$ is harmonic, then so is any partial derivative of $f,$ since partial derivative operators commute with the Laplacian.

Mean Value Property: Let $f: \R^n \to \R$ be a harmonic function, and let $\nu$ denote the normalized surface area measure on the unit sphere $S^{n-1}$ embedded in $\R^n.$ For any $\vx \in \R^n$ and $r \ge 0,$
\[f(\vx) = \int_{S^{n-1}} f(\vx + r\vy) \dd{\nu(\vy)}. \]

Proof: The claim is trivial for $r = 0,$ since the integrand is identically $f(\vx).$ We complete the proof by showing that the RHS is constant as a function of $r.$ To do this, we differentiate and apply the divergence theorem:

\[\begin{aligned} \pdv{r} \qty[\int_{S^{n-1}} f(\vx + r\vy) \dd{\nu(\vy)}] &= \int_{S^{n-1}} \vy \cdot \nabla f(\vx + r\vy) \dd{\nu(\vy)} \\ &= \int_{B^n} \nabla^2 f(\vx + r\vy) \dd{\mu(\vy)} = 0 \end{aligned} \]

Here, $B^n \subset \R^n$ denotes the volume bounded by $S^{n-1},$ and $\mu$ denotes a scalar multiple of Lebesgue measure on $B^n.$ Since $f$ was assumed to be harmonic, the integrand vanishes identically.

Corollary: The value of a harmonic function at a point is also equal to its average value over a ball centered at that point. (The sphericity of the ball is important; this statement does not hold for an arbitrarily-shaped neighborhood.)

The class of harmonic functions is remarkably well-behaved. For example, a converse to the mean value property also holds: any continuous function which satisfies the mean value property must be harmonic. Moreover, all harmonic functions are real analytic. (This is proven by establishing bounds on the growth rate of the $n^{\text{th}}$ derivative of a harmonic function.) As a corollary, harmonic functions are uniquely determined by their values in an arbitrarily small neighborhood (or, in fact, a germ at a single point).

Liouville's Theorem: Any function which is harmonic on all of $\R^n$ is either constant or unbounded (above and below).

One might naturally ask whether Laplace's equation $\nabla^2 f = 0$ admits weak solutions which are not classical solutions. This is not the case.

Weyl's Lemma: Every weak solution of Laplace's equation $\nabla^2 f = 0$ is a classical solution up to redefinition on a set of measure zero.

In this section, an operator is a map that takes a function $f : \R \to \R$ and returns a function $Lf : \R \to \R.$ In this context, we call $f$ the input signal and $Lf$ the output signal. We will particularly be interested in operators that are linear and translation-invariant, abbreviated LTI. (Technically, we ought to say “translation-equivariant,” but this slightly imprecise terminology is deeply entrenched.)

If we know what an LTI operator $L$ does to a function, then we also know what it does to any linear combination of translates of said function. Now, any function can be thought of as a linear combination of unit impulses $\delta,$ so if we know the impulse response $h \coloneqq L\delta,$ then we know how $L$ acts on any function. In particular, $Lf = L(\delta * f) = (L\delta) * f = h * f,$ where $*$ denotes convolution.

We interpret $h(t)$ as the amount of the input signal $f$ which is reverberated after a time delay of $t.$ We say that the operator $L$ is causal if $h(t) = 0$ for all $t < 0;$ this means that $L$ does not look into the future (i.e., the input signal at a future time cannot affect the output signal at a past time).

It is a remarkable fact that complex exponential signals $f(t) \coloneqq e^{i \omega t}$ are eigenfunctions of any LTI operator $L.$ The corresponding eigenvalue can be computed from the impulse response $h \coloneqq L \delta$ as follows:

\[(Lf)(t) = \int_{-\infty}^{\infty} h(u) f(t-u) \dd{u} = \int_{-\infty}^{\infty} h(u) e^{i \omega (t - u)} \dd{u} = e^{i \omega t} \int_{-\infty}^{\infty} h(u) e^{-i \omega u} \dd{u} \]

Thus, if we define the function $\hat{h}: \R \to \R$ by

\[\hat{h}(\omega) \coloneqq \int_{-\infty}^{\infty} h(u) e^{-i \omega u} \dd{u} \]

then we see that $\hat{h}(\omega)$ is the corresponding eigenvalue for the eigenvector $e^{i \omega t}$ of the LTI operator with impulse response $h.$ The function $\hat{h}$ is called the Fourier transform of $h,$ and is commonly denoted either by addition of a hat $\hat{h}$ or a calligraphic $\mathcal{F},$ as in $\hat{h} = \mathcal{F}h.$ This is clearly well-defined for $h \in L^1(\R),$ and we will later show that $\mathcal{F}$ can be extended to $L^2(\R).$

It is natural to ask whether a function $h$ is uniquely determined by its Fourier transform $\hat{h}.$ This is the case, and in fact, there is an explicit inversion formula for computing $h$ from $\hat{h}.$

\[h(t) = \frac{1}{2\pi} \int_{-\infty}^{\infty} \hat{h}(\omega) e^{i\omega t} \dd{\omega} \]

This defines the inverse Fourier transform $\mathcal{F}^{-1}.$ [TODO: Prove the Fourier inversion formula for the case where $h$ and $\hat{h}$ are both in $L^1(\R)$.]

\[\mathcal{F}(f * h) = (\mathcal{F} f) (\mathcal{F} h) \]
\[\braket{f}{g} = \frac{1}{2\pi} \braket{\mathcal{F} f}{\mathcal{F} g} \]

Suppose we observe a stochastic process $Y_t = X_t + Z_t$ which is the sum of a signal component $X_t$ and a noise component $Z_t.$


Problem: How many edges does an icosahedron have?
Note: An icosahedron is a polyhedron (3D shape) that has 20 triangular faces.

Obviously, we could solve this problem by drawing a picture of an icosahedron and counting the number of edges we see. However, we can also solve this problem in a purely combinatorial fashion, by counting something in two different ways.

Solution: Let's say that an edge-face pair $(e, f)$ is an ordered pair consisting of a face $f$ and an edge $e$ which is incident to $f$. Since the faces of an icosahedron are triangular, each contributes three edge-face pairs. Hence, the number of edge-face pairs is $3\abs{F}.$ Similarly, each edge contributes two edge-face pairs, so the number of edge-face pairs is $2\abs{E}.$ This shows that $2\abs{E} = 3\abs{F},$ so an icosahedron has 30 edges.

This is a special case of a more general principle.

Theorem: Let $G$ be a bipartite graph with $n_L$ vertices on the left and $n_R$ vertices on the right. Suppose the vertices on the left all have degree $d_L,$ and the vertices on the right all have degree $d_R.$ Then $n_L d_L = n_R d_R.$
Proof: Both $n_L d_L$ and $n_R d_R$ count the number of edges in $G.$

Definition: A connected (pseudo-)Riemannian manifold $M$ is said to be extensible if $M$ can be isometrically embedded as an open submanifold of a strictly larger connected (pseudo-)Riemannian manifold. Otherwise, we say that $M$ is inextensible or maximal.

Hopf–Rinow Theorem: Let $M$ be a connected Riemannian manifold. The following are equivalent:

  1. $M$ is geodesically complete.
  2. $M$ is geodesically complete at some point $p \in M.$
  3. $M$ has the HeineBorel property.
  4. $M$ is complete as a metric space.

Moreover, if any of these statements hold, then there exists a length-minimizing geodesic joining any two points of $M.$


We denote by $\Ord$ the class of ordinal numbers. Following von Neumann's construction, we identify each ordinal number $\alpha \in \Ord$ with the transitive set of all smaller ordinals, well-ordered with respect to the set membership relation ${\in}.$

Definition: A surreal number is a function from an ordinal number to the 2-element set $\{+, -\}.$ The class of surreal numbers is denoted by $\No.$

Clearly, there are at least as many surreal numbers as ordinal numbers, so $\No$ is a proper class. We will later see that $\Ord$ embeds into $\No$ in a natural way.

Notation: The domain of a surreal number $x \in \No$ is called its birthday, denoted by $\beta(x).$

Conway imagines that the surreal numbers are constructed inductively, with the empty sequence $()$ being created on “day zero.” (Recall that in von Neumann's construction, the empty set is the ordinal number zero.) On day one, its descendants $(+)$ and $(-)$ are born via concatenation of a ${+}$ or ${-}.$ Then, on day two, $(+)$ gives birth to $(++)$ and $(+-)$, while $(-)$ gives birth to $(-+)$ and $(--).$ Continuing in this fashion, $2^n$ new surreal numbers are born on the $n$th day. This process continues transfinitely, with $2^\alpha$ new surreal numbers born on each day $\alpha \in \Ord.$

Definition: We say that a surreal number $x \in \No$ is simpler than a surreal number $y \in \No,$ denoted by $x \simplereq y,$ if $x$ is an initial segment of $y.$ We write $x \simpler y$ to mean $x \simplereq y$ and $x \ne y.$
Theorem: Let $F, G \subset \No$ be sets of surreal numbers. (Note that this notation indicates that $F$ and $G$ are subsets of $\No,$ as opposed to subclasses.) If $F < G,$ then there exists a unique simplest surreal number $x \in \No$ satisfying $F < x < G.$

Proof sketch: If $F = G = \varnothing,$ then we take the empty sequence. If $F$ is non-empty but $G$ is empty, then we take a sufficiently long sequence of $+;$ similarly, if $F$ is empty but $G$ is non-empty, then we take a sufficiently long sequence of $-.$

If both $F$ and $G$ are non-empty, then there exists a least ordinal $\alpha$ such that any member of $F$ differs from any member of $G$ somewhere in their first $\alpha$ places. (Indeed, if there were no such $\alpha,$ then there would exist a member of $F$ that agrees with a member of $G$ in every place, contradicting our assumption that $F < G.$) Because we picked $\alpha$ to be the least such ordinal, for every $\gamma < \alpha,$ there exist members of $F$ and $G$ who agree in their first $\gamma$ places. This uniquely determines $x(\gamma)$ for all $\gamma < \alpha.$

We now consider two cases. If $\alpha$ is a limit ordinal, then we have a well-defined surreal number $x: \alpha \to \{{+}, {-}\}.$ If $x \notin F$ and $x \notin G,$ then $x$ works; otherwise, either $x \in F$ or $x \in G,$ but not both (since $F < G$). If $x \in F,$ then we append just enough $+$ signs to get a number that exceeds $F;$ similarly with $-$ signs for $G.$

On the other hand, if $\alpha$ is a successor ordinal, then we have a well-defined surreal number $x: (\alpha - 1) \to \{{+}, {-}\}.$ Let $F'$ denote the set of tails of $x$ in $F,$ and let $G'$ denote the set of tails of $x$ in $G.$ Note that neither $F'$ nor $G'$ are empty, since by construction, $x$ has tails in both $F$ and $G$ (even if the only tail is the empty tail). By the definition of $\alpha,$ for any $a \in F'$ and $b \in G',$ we have $a(0) < b(0).$ If neither $F'$ nor $G'$ contains the empty sequence, then $a(0) = {-}$ and $b(0) = {+},$ so $x$ works. Otherwise, suppose $F'$ contains the empty sequence. We append a single ${+}$ to $x$ to exceed $F,$ followed by just enough ${-}$ signs to slip under $G.$

Notation: Given $F, G \subset \No$ with $F < G,$ we denote by $F \mid G$ the simplest surreal number $x \in \No$ satisfying $F < x < G.$ We call the pair of sets $(F, G)$ a representation of $x,$ and write $x = F \mid G.$
Theorem: Let $F, G \subset \No$ with $F < G,$ and let $\alpha$ be the least ordinal for which $\beta(x) < \alpha$ for all $x \in F \cup G.$ Then $\beta(F \mid G) \le \alpha.$

Note that the inequality $\beta(x) < \alpha$ in the statement of the preceding theorem cannot be replaced by $\beta(x) \le \alpha.$ Indeed, if $F = \{(+)\}$ and $G = \{(++)\},$ then $F \mid G = ({+}{+}{-}).$

Definition: The canonical representation of a surreal number $x \in \No$ is the representation $x = F \mid G$ where $F$ is the set of initial segments of $x$ which are less than $x$ itself, and $G$ is the set of initial segments of $x$ exceeding $x.$
Definition:
\[a + b \coloneqq \{ a^L + b,\ a + b^L \} \mid \{ a^R + b,\ a + b^R \} \]
Definition: We say that two positive surreal numbers $x, y \in \No$ are in the same Archimedean class, denoted by $x \sim y,$ if there exists an integer $n \in \N$ such that $nx \ge y$ and $x \le ny.$ On the other hand, if $nx < y$ for all $n \in \N,$ we write $x \ll y,$ or equivalently, $y \gg x.$

For any two positive surreal numbers $x, y \in \No,$ we either have $x \ll y,$ $x \sim y,$ or $x \gg y.$

Definition:
\[\omega^x \coloneqq \{ 0, 2^n \omega^{x^L} \} \mid \{ 2^{-n} \omega^{x^R} \} \]

Theorem: Let $x, y \in \No$.

  • $\omega^x > 0.$
  • $x < y$ if and only if $\omega^x \ll \omega^y.$
  • $\omega^0 = 1.$
  • $\omega^{x+y} = \omega^x \omega^y.$
  • If $x$ is an ordinal, then $\omega^x$ agrees with ordinal exponentiation in the usual sense.
Theorem: A positive surreal number is the simplest element of its Archimedean class if and only if it can be written in the form $\omega^x$ for some $x \in \No$.

Theorem: Every surreal number $x \in \No$ admits a unique representation in the form

\[x = \sum_{\alpha < \beta} r_\alpha \omega^{y_\alpha} \]

where $\beta \in \Ord,$ $(y_\alpha)_{\alpha < \beta}$ is a strictly decreasing sequence of surreal numbers, and $(r_\alpha)_{\alpha < \beta}$ is a sequence of nonzero real numbers. If $x$ is an ordinal, then this representation coincides with the Cantor normal form.

Definition: This representation is called the Conway normal form of $x.$
Proof sketch: The technique is somewhat analogous to the algorithm for computing continued fractions. Repeatedly extract the largest Archimedean class from $x,$ and show that this procedure must terminate at some transfinite stage.

It follows from the existence of Conway normal forms that every surreal number can be uniquely written as the sum of a real number (the $\omega^0$ term), an infinitesimal number (the terms with negative exponents), and a purely infinite number (the terms with positive exponents).

Definition: For $x, y \in \No,$ let $r_y(x)$ denote the real-valued coefficient of $\omega^y$ in the Conway normal form of $x$, with $r_y(x) \coloneqq 0$ if $\omega^y$ does not occur. We say that a sequence $(x_n)_{n \in \N}$ of surreal numbers converges to $x,$ denoted by $x = \lim_{n \to \infty} x_n,$ if $r_y(x) = \lim_{n \to \infty} r_y(x_n)$ for all $y \in \No.$

If, for each fixed $y,$ the sequence $r_y(x_n)$ is eventually constant, then we say that $(x_n)_{n \in \N}$ converges absolutely to $x.$

Note that any (possibly multivariate) formal power series with real coefficients, even if wildly divergent on real-valued arguments, is absolutely convergent when fed infinitesimal surreal arguments.

This suggests a powerful design pattern for extending real functions $f: \R \to \R$ to the surreals.

This strategy is used in this paper to extend the natural logarithm to $\No.$


There exists a field called constructive gravity dedicated to the following question: given a collection of equations describing the dynamics of a matter field, what is the corresponding diffeomorphism-invariant theory of gravity? That is, what tensor fields are required to describe the spacetime geometry to which this matter couples, and what field equations do those tensor fields satisfy?

The methods of constructive gravity can describe more general spacetime geometries than the Lorentzian metrics of standard general relativity. For example, there are so-called area metric manifolds which, instead of carrying a metric $g_{ab}$ which measures tangent lengths, carry an area metric $G_{abcd}$ which measures tangent areas. Such a geometry is required to describe a spacetime in which light can spontaneously birefract.


\[\begin{aligned} x &= r \sin(\theta) \cos(\phi) \\ y &= r \sin(\theta) \sin(\phi) \\ z &= r \cos(\theta) \end{aligned} \]

\[\begin{aligned} \pdv{x}{r} &= \sin(\theta) \cos(\phi) & \pdv{x}{\theta} &= r \cos(\theta) \cos(\phi) & \pdv{x}{\phi} &= -r \sin(\theta) \sin(\phi) \\ \pdv{y}{r} &= \sin(\theta) \sin(\phi) & \pdv{y}{\theta} &= r \cos(\theta) \sin(\phi) & \pdv{y}{\phi} &= r \sin(\theta) \cos(\phi) \\ \pdv{z}{r} &= \cos(\theta) & \pdv{z}{\theta} &= -r \sin(\theta) & \pdv{z}{\phi} &= 0 \end{aligned} \]

Definition: Let $M$ be a smooth manifold. A tangent vector at a point $p \in M$ is an $\R$-linear map $X_p : C^\infty(M) \to \R$ satisfying the Leibniz identity
\[X_p(f g) = X_p(f) g(p) + f(p) X_p(g). \]

The set of all tangent vectors at the point $p$ is called the tangent space at $p,$ denoted by $T_p M.$


© David K. Zhang 2016 2021