##### Miscellaneous Thoughts

This page contains a disorganized collection of ideas that I'm learning and thinking about, but have yet to coalesce into a coherent narrative. Feel free to look around you might find something interesting!

Theorem: Let be (non-strictly) increasing functions. Then

First Proof: Observe that both sides of the desired inequality can be written as double integrals. For reasons that will later become apparent, we symmetrize the integrands with respect to the integration variables and

This transforms the desired inequality into:

We now move all terms to the RHS in order to obtain a miraculous factorization.

(This is why we had to symmetrize the integrands earlierotherwise, we would have ended up with instead of .) Now, because and are both increasing, whenever we have and Similarly, whenever we have and In either case, is always non-negative because it is a product of two numbers with the same sign. Hence, the integral of this expression is also non-negative, and the desired inequality holds.

Second Proof: It is also possible to give an alternative argument that avoids symmetrization by invoking the mean value theorem. Specifically, by applying the mean value theorem to the function defined by

we may conclude that there exists satisfying Now, the desired inequality is:

Because is an increasing function, we have:

Similarly, because is an increasing function, we have:

Note that the two sign changes at cancel, yielding for all It follows that

which is the desired result.

Suppose we want to solve a system of linear equations The simplest case of this problem occurs when the matrix is symmetric and positive-definite. In this case, we can recast as a convex optimization problem for the quadratic function

Note that the positive-definiteness of guarantees that this function is strongly convex. Moreover, the unique global minimizer of this function occurs where

Now, the most straightforward way to minimize a convex function is the method of gradient descent (which is also called steepest descent in the context of solving linear equations). To perform gradient descent, we pick an initial guess (when no other information is available, is a common choice) and compute a sequence of approximate solutions by iterating

In most applications of gradient descent, the step size is determined using a “line search” procedure (i.e., a one-dimensional optimization algorithm). However, there is no need to perform line search in this case, because the optimal step size for a quadratic objective function can be calculated analytically. Given an initial point and a step direction we optimize

by computing

and differentiating with respect to as follows:

Using this optimal step size, the iteration equation for gradient descent becomes

This equation can be simplified by defining the residual vector upon which it simplifies to

This completes the derivation of the method of gradient descent (a.k.a. steepest descent).

Gradient descent is an extremely natural idea it's the first thing you'd come up with after recasting as a convex optimization problem but it has a significant drawback. Namely, it often takes multiple steps in the same direction This is wasteful, since a step followed by a another step can be combined into a single step

Wouldn't it be nice if we got the step size right the first time? That is, whenever we take a step in a direction we should choose the step size in such a way that we never need to step in the same direction again. All future step directions should be orthogonal to

In mathematical terms, we would like to ensure that after we take the step the remaining displacement between the approximate solution and the true solution is orthogonal to the step direction This yields an equation which determines the step size as follows:

This equation has a serious problem. It requires us to know in order to compute the step size but the whole point of solving is to find If we had in hand, then we would already be done! Unfortunately, our dream of enforcing orthogonality between and turned out to be unworkable.

But wait! We derived this equation for by demanding that be orthogonal to under the standard Euclidean inner product There is no reason to prefer this inner product over any other, and our idea can be saved by switching to another one. In particular, because is a symmetric positive-definite matrix, it defines the -inner product

If we ask for to be -orthogonal to then we instead obtain the following equation:;

Remarkably, this is the same value of that we derived earlier by minimizing It contains no occurrences of and can be computed purely from quantities we already know.

In the following section, let be a symmetric positive-definite matrix, and let be arbitrary vectors. We define the sequences and by the initial conditions and the following mutual recurrence relations:

These sequences are well-defined until the first for which at which point division by zero occurs in the definition of

Lemma: Each step direction is a linear combination of the residual vectors In particular,
Theorem:

Proof: We first verify that the theorem holds for and Since the only nontrivial claim to verify is that

We now proceed by induction in two stages. Suppose that the theorem holds for the vectors and We will show that the theorem also holds for and To prove the theorem for we need to verify the following four claims:

Let The first inductive claim follows from a direct calculation using the inductive hypothesis for and the formula for obtained earlier.

Now, let The preceding result implies that We obtain the second inductive claim by combining this with the iteration equation

Since we may conclude that We now apply the iteration equation to this result to obtain the third inductive claim. (For we set and )

Finally, the fourth inductive claim follows by combining the iteration equation with a special case of the second inductive claim, namely,

This completes the proof that the theorem holds for We now turn our attention to proving the theorem for This requires us to verify two additional claims:

The first inductive claim requires two cases. For we combine the iteration equation with the inductive hypotheses and to write:

On the other hand, for we apply the definition together with the inductive hypothesis

Finally, the second inductive claim only requires one case. Let From the first inductive claim, we know that It follows that:

Since we may conclude that . This completes the proof.

Definition: A decision tree on variables is a full binary tree in which each internal node is labeled with an index and each leaf is labeled a result

[TODO: Draw a picture of a decision tree.]

Every decision tree on variables defines a Boolean function as follows:

[TODO: Illustrate examples of decision tree computation.]

Given a Boolean function we say that a decision tree computes if for every bit string

The depth of a decision tree is the length of a longest path in from the root to any leaf. Here, the length of a path is defined as the number of edges it traverses, so that a tree consisting only of a single leaf has depth zero.

We can construct a decision tree that computes any Boolean function by simply writing the truth table for along the bottom row of a complete binary tree of depth This is a highly inefficient decision tree having leaves and internal nodes. The nontrivial question is whether a given function admits a smaller decision tree.

If a Boolean function can be computed by a very shallow decision tree, then only a few bits of are needed to deduce the value of This suggests that is, in some sense, a very simple function. We formalize this notion in the following definition:

Definition: The decision tree complexity of a Boolean function denoted by is the minimum depth of any decision tree that computes
Definition: Let and The restriction of to denoted by is the bit string where denotes the enumeration of in increasing order.

For example, if and then Similarly, and is the empty string.

Definition: Let and A certificate for at is a set with the following property: for all if then

Certificates always exist, since the full set is always a certificate for any at any The nontrivial question is whether a given function admits smaller certificates. Intuitively, small certificates suggest that a function has low complexity, since only a small number of bits of are needed to deduce the value of

Definition: The certificate complexity of a Boolean function at a bit string is the smallest cardinality of any certificate for at

The certificate complexity of is

This definition introduces two distinct but related notions: “certificate complexity of f and “certificate complexity of f at x.” This is analogous to the separate but related notions of “continuity of f and “continuity of f at x in topology.

Theorem: for any Boolean function
Proof: Let be a minimal-depth decision tree that computes For any bit string the set of indices queried when computing

In this section, we denote by the unit circle in the complex plane, i.e.,

Observe that is a group under multiplication, and that the group-theoretic inverse of is simply its complex conjugate Moreover, is a topological group under the subspace topology inherited from the usual topology on

Definition: The Pontryagin dual of a locally compact abelian group is the topological group of continuous group homomorphisms from to the circle group endowed with the operation of pointwise multiplication and the topology of uniform convergence on compact sets.
Theorem: If is a locally compact abelian group, then so is

Having shown that is itself a locally compact abelian group, it now makes sense to consider the double Pontryagin dual

Pontryagin Duality Theorem: For any locally compact abelian group the evaluation homomorphism is an isomorphism of topological groups (i.e., a group isomorphism and a homeomorphism).

We normally suppress the arguments and of since these can be inferred from and Thus, we write

As with we normally suppress the arguments and of since they can be inferred from and Thus, we write

To understand think of as a proposition, and as a “proof constructor” for In other words, for each is a proof term for Given a path says that transporting the proof term across the path yields a proof which is equal to

Observe that when and the type judgmentally reduces to This type is inhabited by

Definition: Let A homotopy from to is a dependent function of the following type:

Definition: Let A quasi-inverse of is an ordered triple consisting of a function and two homotopies and The type of quasi-inverses of is denoted by:

Definition: Let We say that is an equivalence from to if there exists an ordered quadruple consisting of two functions and two homotopies and The type of such ordered quadruples is denoted by:

The type of equivalences from to is denoted by:

Thus, we write to indicate that is an equivalence from to

Theorem: For each the types and are logically equivalent. In other words, the types and are inhabited.

The direction is easy, since we can send

Definition: The nuclear norm (a.k.a. trace norm) of a matrix is the quantity

where denotes the singular value of

The nuclear norm is a powerful tool in computational mathematics because it can be interpreted as a convex relaxation of the rank function. (More precisely, the nuclear norm is the convex envelope of the rank function when restricted to the unit ball in the operator norm.) In matrix recovery problems, adding a penalty term proportional to the nuclear norm encourages a solver to find low-rank solutions.

The nuclear norm also has a special interpretation for Hermitian matrices. Recall that the singular values of a Hermitian matrix are the absolute values of its eigenvalues, i.e., It follows that when is Hermitian. Moreover, when is both Hermitian and positive (semi)definite, we can drop the absolute value bars and write . This property is, presumably, the justification for the name “trace norm.”

Because the definition of depends only on the singular values of it is clear that the nuclear norm is unitarily invariant, i.e., that for all and (Here, denotes the set of unitary matrices.) However, it is not at all evident that the nuclear norm is actually a norm! In order to prove this, we will first establish the following variational characterization:

Theorem: For all square matrices

where denotes the matrix obtained by taking the absolute value of each entry of Moreover, the maximum is achieved whenever is a diagonal matrix.

Proof: Let denote a singular value decomposition of where and is a diagonal matrix of singular values. Then

We now must show that for all Observe that as ranges over all unitary matrices, the product also ranges over all unitary matrices. The same holds for and Thus, under the change of variables it suffices to show that

for all This is proven by the following computation:

Corollary: If is a Hermitian matrix, then

Moreover, the maximum value is achieved when diagonalizes

Theorem: The nuclear norm is a norm on

Proof: Positive-definiteness and absolute homogeneity are clear, since the singular values are nonnegative and absolutely homogeneous functions of , and every nonzero matrix has at least one nonzero singular value. To establish subadditivity, we perform the following computation:

In the third step, we use the entrywise inequality

Definition: A density matrix is a Hermitian positive semidefinite matrix with trace 1.

Earlier, we observed that the nuclear norm is unitarily invariant because it only depends on the singular values of Conversely, every unitarily invariant norm can be written as a function of the singular values, since we can always use the SVD to write

Consider a quantum system with Hamiltonian eigenvalues and corresponding eigenvectors Its partition function is

Recall that by introducing a resolution of the identity, we can verify the identity in any basis. For example, in the position basis, we have:

Now, if the system consists of distinguishable particles in spatial dimensions, then the Hamiltonian has the form

In the classical limit the canonical commutation relation vanishes. It follows that the kinetic energy and potential energy operators commute, and we can write:

In the penultimate step, we use the fact that

This calculation explains the appearance of Planck's constant in the formula for the classical partition function. In classical thermodynamics, the value of is immaterial, since we always compute physically observable quantities by taking a derivative of the logarithm of However, the quantum-mechanical perspective shows that is the correct measure of phase-space volume.

Cayley's Theorem: Every group is isomorphic to a subgroup of
Proof: Send each element to the permutation of defined by
Strong Cayley Theorem: For each subgroup of a group there exists a homomorphism where whose kernel is the largest normal subgroup of contained in
Remark: There is no requirement that or be finite. This statement still makes sense when is interpreted as a cardinal number.

Proof: Consider the action of by left multiplication on the set of left cosets of This action identifies each element with a permutation of a set containing elements, i.e., an element of This defines the desired homomorphism

Observe that if and only if left multiplication by fixes every left coset of i.e.,

This shows that Now, any normal subgroup satisfying also satisfies for all which proves

Note that Cayley's theorem is the special case of the strong Cayley theorem.

Corollary: If a finite group has a subgroup whose index is the smallest prime factor of then

Proof: Let denote the homomorphism induced by the left action of on the collection of left cosets of The strong Cayley theorem tells us that is the largest normal subgroup of contained in We will prove that, in fact,

Since we already know to establish equality, it suffices to prove We will do this by proving that every prime power which divides also divides

Observe via the first isomorphism theorem that is isomorphic to a subgroup of By Lagrange's theorem, divides But since we may conclude that divides In particular, cannot have any prime factors

Now, let be a prime for which divides where Since we know that and must collectively share factors of However, cannot be a prime factor of since any prime factor of is also a prime factor of and by hypothesis, Therefore, divides

This generalizes the fact that every subgroup of index 2 is normal. Thus, to prove that a group of order 99 is not simple, it suffices to find a subgroup of index 3.

Every group acts on itself by conjugation:

Orbit-Stabilizer Theorem: Let For any

where denotes the orbit of under and denotes the stabilizer subgroup of

Remark: This statement holds in cardinal arithmetic regardless of whether is finite.
Proof: Elements of the orbit correspond to cosets of the stabilizer in Hence, equals the number of cosets multiplied by the size of each coset.
Burnside's Lemma: Let with both and finite. Then

where denotes the number of distinct orbits, and denotes the set of elements of which are fixed by a given

Proof: Consider the quantity where denotes the Iverson bracket (i.e., 1 if otherwise 0). On one hand, we can write

However, by changing the order of summation, we can apply the orbit-stabilizer theorem to write

To complete the proof, observe that since each distinct orbit contributes a total of 1 to this sum.

Cauchy's Theorem: Let be a finite group. If a prime number divides then contains an element of order

Proof: Consider the set

Observe that since the first elements of can be chosen freely; the final element is then uniquely determined. Since (by hypothesis) divides it follows that also divides

Let by cyclic permutations. (Note that implies .) This action partitions into orbits, and since the only subgroups of are itself and the trivial group, the orbit-stabilizer theorem implies that every orbit has size 1 or Since the number of orbits of size 1 must be divisible by There is at least one such orbit, given by the element so we can conclude that there are at least others. Every such orbit yields an element of of order

Sylow's First Theorem: Let be a finite group, and a prime number. If divides for any then contains a subgroup of order
Proof:

A function is harmonic if it is annihilated by the Laplacian: Think of as measuring the difference between and the average value that assumes at nearby points. At a local minimizer the quantity is positive, because the value of at any point near exceeds

Note that if is harmonic, then so is any partial derivative of since partial derivative operators commute with the Laplacian.

Mean Value Property: Let be a harmonic function, and let denote the normalized surface area measure on the unit sphere embedded in For any and

Proof: The claim is trivial for since the integrand is identically We complete the proof by showing that the RHS is constant as a function of To do this, we differentiate and apply the divergence theorem:

Here, denotes the volume bounded by and denotes a scalar multiple of Lebesgue measure on Since was assumed to be harmonic, the integrand vanishes identically.

Corollary: The value of a harmonic function at a point is also equal to its average value over a ball centered at that point. (The sphericity of the ball is important; this statement does not hold for an arbitrarily-shaped neighborhood.)

The class of harmonic functions is remarkably well-behaved. For example, a converse to the mean value property also holds: any continuous function which satisfies the mean value property must be harmonic. Moreover, all harmonic functions are real analytic. (This is proven by establishing bounds on the growth rate of the derivative of a harmonic function.) As a corollary, harmonic functions are uniquely determined by their values in an arbitrarily small neighborhood (or, in fact, a germ at a single point).

Liouville's Theorem: Any function which is harmonic on all of is either constant or unbounded (above and below).

One might naturally ask whether Laplace's equation admits weak solutions which are not classical solutions. This is not the case.

Weyl's Lemma: Every weak solution of Laplace's equation is a classical solution up to redefinition on a set of measure zero.

In this section, an operator is a map that takes a function and returns a function In this context, we call the input signal and the output signal. We will particularly be interested in operators that are linear and translation-invariant, abbreviated LTI. (Technically, we ought to say “translation-equivariant,” but this slightly imprecise terminology is deeply entrenched.)

If we know what an LTI operator does to a function, then we also know what it does to any linear combination of translates of said function. Now, any function can be thought of as a linear combination of unit impulses so if we know the impulse response then we know how acts on any function. In particular, where denotes convolution.

We interpret as the amount of the input signal which is reverberated after a time delay of We say that the operator is causal if for all this means that does not look into the future (i.e., the input signal at a future time cannot affect the output signal at a past time).

It is a remarkable fact that complex exponential signals are eigenfunctions of any LTI operator The corresponding eigenvalue can be computed from the impulse response as follows:

Thus, if we define the function by

then we see that is the corresponding eigenvalue for the eigenvector of the LTI operator with impulse response The function is called the Fourier transform of and is commonly denoted either by addition of a hat or a calligraphic as in This is clearly well-defined for and we will later show that can be extended to

It is natural to ask whether a function is uniquely determined by its Fourier transform This is the case, and in fact, there is an explicit inversion formula for computing from

This defines the inverse Fourier transform [TODO: Prove the Fourier inversion formula for the case where and are both in .]

Suppose we observe a stochastic process which is the sum of a signal component and a noise component

Problem: How many edges does an icosahedron have?
Note: An icosahedron is a polyhedron (3D shape) that has 20 triangular faces.

Obviously, we could solve this problem by drawing a picture of an icosahedron and counting the number of edges we see. However, we can also solve this problem in a purely combinatorial fashion, by counting something in two different ways.

Solution: Let's say that an edge-face pair is an ordered pair consisting of a face and an edge which is incident to . Since the faces of an icosahedron are triangular, each contributes three edge-face pairs. Hence, the number of edge-face pairs is Similarly, each edge contributes two edge-face pairs, so the number of edge-face pairs is This shows that so an icosahedron has 30 edges.

This is a special case of a more general principle.

Theorem: Let be a bipartite graph with vertices on the left and vertices on the right. Suppose the vertices on the left all have degree and the vertices on the right all have degree Then
Proof: Both and count the number of edges in

Definition: A connected (pseudo-)Riemannian manifold is said to be extensible if can be isometrically embedded as an open submanifold of a strictly larger connected (pseudo-)Riemannian manifold. Otherwise, we say that is inextensible or maximal.

Hopf–Rinow Theorem: Let be a connected Riemannian manifold. The following are equivalent:

1. is geodesically complete.
2. is geodesically complete at some point
3. has the HeineBorel property.
4. is complete as a metric space.

Moreover, if any of these statements hold, then there exists a length-minimizing geodesic joining any two points of

We denote by the class of ordinal numbers. Following von Neumann's construction, we identify each ordinal number with the transitive set of all smaller ordinals, well-ordered with respect to the set membership relation

Definition: A surreal number is a function from an ordinal number to the 2-element set The class of surreal numbers is denoted by

Clearly, there are at least as many surreal numbers as ordinal numbers, so is a proper class. We will later see that embeds into in a natural way.

Notation: The domain of a surreal number is called its birthday, denoted by

Conway imagines that the surreal numbers are constructed inductively, with the empty sequence being created on “day zero.” (Recall that in von Neumann's construction, the empty set is the ordinal number zero.) On day one, its descendants and are born via concatenation of a or Then, on day two, gives birth to and , while gives birth to and Continuing in this fashion, new surreal numbers are born on the th day. This process continues transfinitely, with new surreal numbers born on each day

Definition: We say that a surreal number is simpler than a surreal number denoted by if is an initial segment of We write to mean and
Theorem: Let be sets of surreal numbers. (Note that this notation indicates that and are subsets of as opposed to subclasses.) If then there exists a unique simplest surreal number satisfying

Proof sketch: If then we take the empty sequence. If is non-empty but is empty, then we take a sufficiently long sequence of similarly, if is empty but is non-empty, then we take a sufficiently long sequence of

If both and are non-empty, then there exists a least ordinal such that any member of differs from any member of somewhere in their first places. (Indeed, if there were no such then there would exist a member of that agrees with a member of in every place, contradicting our assumption that ) Because we picked to be the least such ordinal, for every there exist members of and who agree in their first places. This uniquely determines for all

We now consider two cases. If is a limit ordinal, then we have a well-defined surreal number If and then works; otherwise, either or but not both (since ). If then we append just enough signs to get a number that exceeds similarly with signs for

On the other hand, if is a successor ordinal, then we have a well-defined surreal number Let denote the set of tails of in and let denote the set of tails of in Note that neither nor are empty, since by construction, has tails in both and (even if the only tail is the empty tail). By the definition of for any and we have If neither nor contains the empty sequence, then and so works. Otherwise, suppose contains the empty sequence. We append a single to to exceed followed by just enough signs to slip under

Notation: Given with we denote by the simplest surreal number satisfying We call the pair of sets a representation of and write
Theorem: Let with and let be the least ordinal for which for all Then

Note that the inequality in the statement of the preceding theorem cannot be replaced by Indeed, if and then

Definition: The canonical representation of a surreal number is the representation where is the set of initial segments of which are less than itself, and is the set of initial segments of exceeding
Definition:
Definition: We say that two positive surreal numbers are in the same Archimedean class, denoted by if there exists an integer such that and On the other hand, if for all we write or equivalently,

For any two positive surreal numbers we either have or

Definition:

Theorem: Let .

• if and only if
• If is an ordinal, then agrees with ordinal exponentiation in the usual sense.
Theorem: A positive surreal number is the simplest element of its Archimedean class if and only if it can be written in the form for some .

Theorem: Every surreal number admits a unique representation in the form

where is a strictly decreasing sequence of surreal numbers, and is a sequence of nonzero real numbers. If is an ordinal, then this representation coincides with the Cantor normal form.

Definition: This representation is called the Conway normal form of
Proof sketch: The technique is somewhat analogous to the algorithm for computing continued fractions. Repeatedly extract the largest Archimedean class from and show that this procedure must terminate at some transfinite stage.

It follows from the existence of Conway normal forms that every surreal number can be uniquely written as the sum of a real number (the term), an infinitesimal number (the terms with negative exponents), and a purely infinite number (the terms with positive exponents).

Definition: For let denote the real-valued coefficient of in the Conway normal form of , with if does not occur. We say that a sequence of surreal numbers converges to denoted by if for all

If, for each fixed the sequence is eventually constant, then we say that converges absolutely to

Note that any (possibly multivariate) formal power series with real coefficients, even if wildly divergent on real-valued arguments, is absolutely convergent when fed infinitesimal surreal arguments.

This suggests a powerful design pattern for extending real functions to the surreals.

• Split a surreal number into infinitesimal, real, and purely infinite parts.
• Use any power series, even a divergent power series, to define on infinitesimals.
• Define as usual on the reals.
• Find an inductive definition of on purely infinite numbers, say, of the form

This strategy is used in this paper to extend the natural logarithm to

There exists a field called constructive gravity dedicated to the following question: given a collection of equations describing the dynamics of a matter field, what is the corresponding diffeomorphism-invariant theory of gravity? That is, what tensor fields are required to describe the spacetime geometry to which this matter couples, and what field equations do those tensor fields satisfy?

The methods of constructive gravity can describe more general spacetime geometries than the Lorentzian metrics of standard general relativity. For example, there are so-called area metric manifolds which, instead of carrying a metric which measures tangent lengths, carry an area metric which measures tangent areas. Such a geometry is required to describe a spacetime in which light can spontaneously birefract.

Definition: Let be a smooth manifold. A tangent vector at a point is an -linear map satisfying the Leibniz identity

The set of all tangent vectors at the point is called the tangent space at denoted by

© David K. Zhang 2016 2021