This page contains a disorganized collection of ideas that I'm learning and thinking about, but have yet to coalesce into a coherent narrative. Feel free to look around — you might find something interesting!
Theorem: Let be (non-strictly) increasing functions. Then
First Proof: Observe that both sides of the desired inequality can be written as double integrals. For reasons that will later become apparent, we symmetrize the integrands with respect to the integration variables and
This transforms the desired inequality into:
We now move all terms to the RHS in order to obtain a miraculous factorization.
(This is why we had to symmetrize the integrands earlier—otherwise, we would have ended up with instead of .) Now, because and are both increasing, whenever we have and Similarly, whenever we have and In either case, is always non-negative because it is a product of two numbers with the same sign. Hence, the integral of this expression is also non-negative, and the desired inequality holds. ☐
Second Proof: It is also possible to give an alternative argument that avoids symmetrization by invoking the mean value theorem. Specifically, by applying the mean value theorem to the function defined by
we may conclude that there exists satisfying Now, the desired inequality is:
Because is an increasing function, we have:
Similarly, because is an increasing function, we have:
Note that the two sign changes at cancel, yielding for all It follows that
which is the desired result. ☐
Suppose we want to solve a system of linear equations The simplest case of this problem occurs when the matrix is symmetric and positive-definite. In this case, we can recast as a convex optimization problem for the quadratic function
Note that the positive-definiteness of guarantees that this function is strongly convex. Moreover, the unique global minimizer of this function occurs where
Now, the most straightforward way to minimize a convex function is the method of gradient descent (which is also called steepest descent in the context of solving linear equations). To perform gradient descent, we pick an initial guess (when no other information is available, is a common choice) and compute a sequence of approximate solutions by iterating
In most applications of gradient descent, the step size is determined using a “line search” procedure (i.e., a one-dimensional optimization algorithm). However, there is no need to perform line search in this case, because the optimal step size for a quadratic objective function can be calculated analytically. Given an initial point and a step direction we optimize
and differentiating with respect to as follows:
Using this optimal step size, the iteration equation for gradient descent becomes
This equation can be simplified by defining the residual vector upon which it simplifies to
This completes the derivation of the method of gradient descent (a.k.a. steepest descent).
Gradient descent is an extremely natural idea — it's the first thing you'd come up with after recasting as a convex optimization problem — but it has a significant drawback. Namely, it often takes multiple steps in the same direction This is wasteful, since a step followed by a another step can be combined into a single step
Wouldn't it be nice if we got the step size right the first time? That is, whenever we take a step in a direction we should choose the step size in such a way that we never need to step in the same direction again. All future step directions should be orthogonal to
In mathematical terms, we would like to ensure that after we take the step the remaining displacement between the approximate solution and the true solution is orthogonal to the step direction This yields an equation which determines the step size as follows:
This equation has a serious problem. It requires us to know in order to compute the step size but the whole point of solving is to find If we had in hand, then we would already be done! Unfortunately, our dream of enforcing orthogonality between and turned out to be unworkable.
But wait! We derived this equation for by demanding that be orthogonal to under the standard Euclidean inner product There is no reason to prefer this inner product over any other, and our idea can be saved by switching to another one. In particular, because is a symmetric positive-definite matrix, it defines the -inner product
If we ask for to be -orthogonal to then we instead obtain the following equation:;
Remarkably, this is the same value of that we derived earlier by minimizing It contains no occurrences of and can be computed purely from quantities we already know.
In the following section, let be a symmetric positive-definite matrix, and let be arbitrary vectors. We define the sequences and by the initial conditions and the following mutual recurrence relations:
These sequences are well-defined until the first for which at which point division by zero occurs in the definition of
Lemma: Each step direction is a linear combination of the residual vectors In particular,
We first verify that the theorem holds for and Since the only nontrivial claim to verify is that
We now proceed by induction in two stages. Suppose that the theorem holds for the vectors and We will show that the theorem also holds for and To prove the theorem for we need to verify the following four claims:
Let The first inductive claim follows from a direct calculation using the inductive hypothesis for and the formula for obtained earlier.
Now, let The preceding result implies that We obtain the second inductive claim by combining this with the iteration equation
Since we may conclude that We now apply the iteration equation to this result to obtain the third inductive claim. (For we set and )
Finally, the fourth inductive claim follows by combining the iteration equation with a special case of the second inductive claim, namely,
This completes the proof that the theorem holds for We now turn our attention to proving the theorem for This requires us to verify two additional claims:
The first inductive claim requires two cases. For we combine the iteration equation with the inductive hypotheses and to write:
On the other hand, for we apply the definition together with the inductive hypothesis
Finally, the second inductive claim only requires one case. Let From the first inductive claim, we know that It follows that:
Since we may conclude that . This completes the proof.
A decision tree on variables is a full binary tree in which each internal node is labeled with an index and each leaf is labeled a result
[TODO: Draw a picture of a decision tree.]
Every decision tree on variables defines a Boolean function as follows:
[TODO: Illustrate examples of decision tree computation.]
Given a Boolean function we say that a decision tree computes if for every bit string
The depth of a decision tree is the length of a longest path in from the root to any leaf. Here, the length of a path is defined as the number of edges it traverses, so that a tree consisting only of a single leaf has depth zero.
We can construct a decision tree that computes any Boolean function by simply writing the truth table for along the bottom row of a complete binary tree of depth This is a highly inefficient decision tree having leaves and internal nodes. The nontrivial question is whether a given function admits a smaller decision tree.
If a Boolean function can be computed by a very shallow decision tree, then only a few bits of are needed to deduce the value of This suggests that is, in some sense, a very simple function. We formalize this notion in the following definition:
The decision tree complexity of a Boolean function denoted by is the minimum depth of any decision tree that computes
Let and The restriction of to denoted by is the bit string where denotes the enumeration of in increasing order.
For example, if and then Similarly, and is the empty string.
Let and A certificate for at is a set with the following property: for all if then
Certificates always exist, since the full set is always a certificate for any at any The nontrivial question is whether a given function admits smaller certificates. Intuitively, small certificates suggest that a function has low complexity, since only a small number of bits of are needed to deduce the value of
The certificate complexity of a Boolean function at a bit string is the smallest cardinality of any certificate for at
The certificate complexity of is
This definition introduces two distinct but related notions: “certificate complexity of f” and “certificate complexity of f at x.” This is analogous to the separate but related notions of “continuity of f” and “continuity of f at x” in topology.
Theorem: for any Boolean function
Let be a minimal-depth decision tree that computes For any bit string the set of indices queried when computing ☐
In this section, we denote by the unit circle in the complex plane, i.e.,
Observe that is a group under multiplication, and that the group-theoretic inverse of is simply its complex conjugate Moreover, is a topological group under the subspace topology inherited from the usual topology on
The Pontryagin dual of a locally compact abelian group is the topological group of continuous group homomorphisms from to the circle group endowed with the operation of pointwise multiplication and the topology of uniform convergence on compact sets.
Theorem: If is a locally compact abelian group, then so is
Having shown that is itself a locally compact abelian group, it now makes sense to consider the double Pontryagin dual
Pontryagin Duality Theorem: For any locally compact abelian group the evaluation homomorphism is an isomorphism of topological groups (i.e., a group isomorphism and a homeomorphism).
We normally suppress the arguments and of since these can be inferred from and Thus, we write
As with we normally suppress the arguments and of since they can be inferred from and Thus, we write
To understand think of as a proposition, and as a “proof constructor” for In other words, for each is a proof term for Given a path says that transporting the proof term across the path yields a proof which is equal to
Observe that when and the type judgmentally reduces to This type is inhabited by
Let A homotopy from to is a dependent function of the following type:
Let A quasi-inverse of is an ordered triple consisting of a function and two homotopies and The type of quasi-inverses of is denoted by:
Let We say that is an equivalence from to if there exists an ordered quadruple consisting of two functions and two homotopies and The type of such ordered quadruples is denoted by:
The type of equivalences from to is denoted by:
Thus, we write to indicate that is an equivalence from to
Theorem: For each the types and are logically equivalent. In other words, the types and are inhabited.
The direction is easy, since we can send
The nuclear norm (a.k.a. trace norm) of a matrix is the quantity
where denotes the singular value of
The nuclear norm is a powerful tool in computational mathematics because it can be interpreted as a convex relaxation of the rank function. (More precisely, the nuclear norm is the convex envelope of the rank function when restricted to the unit ball in the operator norm.) In matrix recovery problems, adding a penalty term proportional to the nuclear norm encourages a solver to find low-rank solutions.
The nuclear norm also has a special interpretation for Hermitian matrices. Recall that the singular values of a Hermitian matrix are the absolute values of its eigenvalues, i.e., It follows that when is Hermitian. Moreover, when is both Hermitian and positive (semi)definite, we can drop the absolute value bars and write . This property is, presumably, the justification for the name “trace norm.”
Because the definition of depends only on the singular values of it is clear that the nuclear norm is unitarily invariant, i.e., that for all and (Here, denotes the set of unitary matrices.) However, it is not at all evident that the nuclear norm is actually a norm! In order to prove this, we will first establish the following variational characterization:
Theorem: For all square matrices
where denotes the matrix obtained by taking the absolute value of each entry of Moreover, the maximum is achieved whenever is a diagonal matrix.
Let denote a singular value decomposition of where and is a diagonal matrix of singular values. Then
We now must show that for all Observe that as ranges over all unitary matrices, the product also ranges over all unitary matrices. The same holds for and Thus, under the change of variables it suffices to show that
for all This is proven by the following computation:
Corollary: If is a Hermitian matrix, then
Moreover, the maximum value is achieved when diagonalizes
Theorem: The nuclear norm is a norm on
Positive-definiteness and absolute homogeneity are clear, since the singular values are nonnegative and absolutely homogeneous functions of , and every nonzero matrix has at least one nonzero singular value. To establish subadditivity, we perform the following computation:
In the third step, we use the entrywise inequality
A density matrix is a Hermitian positive semidefinite matrix with trace 1.
Earlier, we observed that the nuclear norm is unitarily invariant because it only depends on the singular values of Conversely, every unitarily invariant norm can be written as a function of the singular values, since we can always use the SVD to write
Consider a quantum system with Hamiltonian eigenvalues and corresponding eigenvectors Its partition function is
Recall that by introducing a resolution of the identity, we can verify the identity in any basis. For example, in the position basis, we have:
Now, if the system consists of distinguishable particles in spatial dimensions, then the Hamiltonian has the form
In the classical limit the canonical commutation relation vanishes. It follows that the kinetic energy and potential energy operators commute, and we can write:
In the penultimate step, we use the fact that
This calculation explains the appearance of Planck's constant in the formula for the classical partition function. In classical thermodynamics, the value of is immaterial, since we always compute physically observable quantities by taking a derivative of the logarithm of However, the quantum-mechanical perspective shows that is the correct measure of phase-space volume.
Cayley's Theorem: Every group is isomorphic to a subgroup of
Send each element to the permutation of defined by ☐
Strong Cayley Theorem: For each subgroup of a group there exists a homomorphism where whose kernel is the largest normal subgroup of contained in
Remark: There is no requirement that or be finite. This statement still makes sense when is interpreted as a cardinal number.
Consider the action of by left multiplication on the set of left cosets of This action identifies each element with a permutation of a set containing elements, i.e., an element of This defines the desired homomorphism
Observe that if and only if left multiplication by fixes every left coset of i.e.,
This shows that Now, any normal subgroup satisfying also satisfies for all which proves
Note that Cayley's theorem is the special case of the strong Cayley theorem.
Corollary: If a finite group has a subgroup whose index is the smallest prime factor of then
Let denote the homomorphism induced by the left action of on the collection of left cosets of The strong Cayley theorem tells us that is the largest normal subgroup of contained in We will prove that, in fact,
Since we already know to establish equality, it suffices to prove We will do this by proving that every prime power which divides also divides
Observe via the first isomorphism theorem that is isomorphic to a subgroup of By Lagrange's theorem, divides But since we may conclude that divides In particular, cannot have any prime factors
Now, let be a prime for which divides where Since we know that and must collectively share factors of However, cannot be a prime factor of since any prime factor of is also a prime factor of and by hypothesis, Therefore, divides
This generalizes the fact that every subgroup of index 2 is normal. Thus, to prove that a group of order 99 is not simple, it suffices to find a subgroup of index 3.
Every group acts on itself by conjugation:
Orbit-Stabilizer Theorem: Let For any
where denotes the orbit of under and denotes the stabilizer subgroup of
Remark: This statement holds in cardinal arithmetic regardless of whether is finite.
Elements of the orbit correspond to cosets of the stabilizer in Hence, equals the number of cosets multiplied by the size of each coset.☐
Burnside's Lemma: Let with both and finite. Then
where denotes the number of distinct orbits, and denotes the set of elements of which are fixed by a given
Consider the quantity where denotes the Iverson bracket (i.e., 1 if otherwise 0). On one hand, we can write
However, by changing the order of summation, we can apply the orbit-stabilizer theorem to write
To complete the proof, observe that since each distinct orbit contributes a total of 1 to this sum.
Cauchy's Theorem: Let be a finite group. If a prime number divides then contains an element of order
Consider the set
Observe that since the first elements of can be chosen freely; the final element is then uniquely determined. Since (by hypothesis) divides it follows that also divides
Let by cyclic permutations. (Note that implies .) This action partitions into orbits, and since the only subgroups of are itself and the trivial group, the orbit-stabilizer theorem implies that every orbit has size 1 or Since the number of orbits of size 1 must be divisible by There is at least one such orbit, given by the element so we can conclude that there are at least others. Every such orbit yields an element of of order
Sylow's First Theorem: Let be a finite group, and a prime number. If divides for any then contains a subgroup of order
A function is harmonic if it is annihilated by the Laplacian: Think of as measuring the difference between and the average value that assumes at nearby points. At a local minimizer the quantity is positive, because the value of at any point near exceeds
Note that if is harmonic, then so is any partial derivative of since partial derivative operators commute with the Laplacian.
Mean Value Property: Let be a harmonic function, and let denote the normalized surface area measure on the unit sphere embedded in For any and
The claim is trivial for since the integrand is identically We complete the proof by showing that the RHS is constant as a function of To do this, we differentiate and apply the divergence theorem:
Here, denotes the volume bounded by and denotes a scalar multiple of Lebesgue measure on Since was assumed to be harmonic, the integrand vanishes identically.
Corollary: The value of a harmonic function at a point is also equal to its average value over a ball centered at that point. (The sphericity of the ball is important; this statement does not hold for an arbitrarily-shaped neighborhood.)
The class of harmonic functions is remarkably well-behaved. For example, a converse to the mean value property also holds: any continuous function which satisfies the mean value property must be harmonic. Moreover, all harmonic functions are real analytic. (This is proven by establishing bounds on the growth rate of the derivative of a harmonic function.) As a corollary, harmonic functions are uniquely determined by their values in an arbitrarily small neighborhood (or, in fact, a germ at a single point).
Liouville's Theorem: Any function which is harmonic on all of is either constant or unbounded (above and below).
One might naturally ask whether Laplace's equation admits weak solutions which are not classical solutions. This is not the case.
Weyl's Lemma: Every weak solution of Laplace's equation is a classical solution up to redefinition on a set of measure zero.
In this section, an operator is a map that takes a function and returns a function In this context, we call the input signal and the output signal. We will particularly be interested in operators that are linear and translation-invariant, abbreviated LTI. (Technically, we ought to say “translation-equivariant,” but this slightly imprecise terminology is deeply entrenched.)
If we know what an LTI operator does to a function, then we also know what it does to any linear combination of translates of said function. Now, any function can be thought of as a linear combination of unit impulses so if we know the impulse response then we know how acts on any function. In particular, where denotes convolution.
We interpret as the amount of the input signal which is reverberated after a time delay of We say that the operator is causal if for all this means that does not look into the future (i.e., the input signal at a future time cannot affect the output signal at a past time).
It is a remarkable fact that complex exponential signals are eigenfunctions of any LTI operator The corresponding eigenvalue can be computed from the impulse response as follows:
Thus, if we define the function by
then we see that is the corresponding eigenvalue for the eigenvector of the LTI operator with impulse response The function is called the Fourier transform of and is commonly denoted either by addition of a hat or a calligraphic as in This is clearly well-defined for and we will later show that can be extended to
It is natural to ask whether a function is uniquely determined by its Fourier transform This is the case, and in fact, there is an explicit inversion formula for computing from
This defines the inverse Fourier transform [TODO: Prove the Fourier inversion formula for the case where and are both in .]
Suppose we observe a stochastic process which is the sum of a signal component and a noise component
Problem: How many edges does an icosahedron have?
Note: An icosahedron is a polyhedron (3D shape) that has 20 triangular faces.
Obviously, we could solve this problem by drawing a picture of an icosahedron and counting the number of edges we see. However, we can also solve this problem in a purely combinatorial fashion, by counting something in two different ways.
Solution: Let's say that an edge-face pair is an ordered pair consisting of a face and an edge which is incident to . Since the faces of an icosahedron are triangular, each contributes three edge-face pairs. Hence, the number of edge-face pairs is Similarly, each edge contributes two edge-face pairs, so the number of edge-face pairs is This shows that so an icosahedron has 30 edges.
This is a special case of a more general principle.
Theorem: Let be a bipartite graph with vertices on the left and vertices on the right. Suppose the vertices on the left all have degree and the vertices on the right all have degree Then
Both and count the number of edges in ☐
A connected (pseudo-)Riemannian manifold is said to be extensible if can be isometrically embedded as an open submanifold of a strictly larger connected (pseudo-)Riemannian manifold. Otherwise, we say that is inextensible or maximal.
Hopf–Rinow Theorem: Let be a connected Riemannian manifold. The following are equivalent:
is geodesically complete.
is geodesically complete at some point
has the Heine–Borel property.
is complete as a metric space.
Moreover, if any of these statements hold, then there exists a length-minimizing geodesic joining any two points of
We denote by the class of ordinal numbers. Following von Neumann's construction, we identify each ordinal number with the transitive set of all smaller ordinals, well-ordered with respect to the set membership relation
A surreal number is a function from an ordinal number to the 2-element set The class of surreal numbers is denoted by
Clearly, there are at least as many surreal numbers as ordinal numbers, so is a proper class. We will later see that embeds into in a natural way.
The domain of a surreal number is called its birthday, denoted by
Conway imagines that the surreal numbers are constructed inductively, with the empty sequence being created on “day zero.” (Recall that in von Neumann's construction, the empty set is the ordinal number zero.) On day one, its descendants and are born via concatenation of a or Then, on day two, gives birth to and , while gives birth to and Continuing in this fashion, new surreal numbers are born on the th day. This process continues transfinitely, with new surreal numbers born on each day
We say that a surreal number is simpler than a surreal number denoted by if is an initial segment of We write to mean and
Theorem: Let be sets of surreal numbers. (Note that this notation indicates that and are subsets of as opposed to subclasses.) If then there exists a unique simplest surreal number satisfying
If then we take the empty sequence. If is non-empty but is empty, then we take a sufficiently long sequence of similarly, if is empty but is non-empty, then we take a sufficiently long sequence of
If both and are non-empty, then there exists a least ordinal such that any member of differs from any member of somewhere in their first places. (Indeed, if there were no such then there would exist a member of that agrees with a member of in every place, contradicting our assumption that ) Because we picked to be the least such ordinal, for every there exist members of and who agree in their first places. This uniquely determines for all
We now consider two cases. If is a limit ordinal, then we have a well-defined surreal number If and then works; otherwise, either or but not both (since ). If then we append just enough signs to get a number that exceeds similarly with signs for
On the other hand, if is a successor ordinal, then we have a well-defined surreal number Let denote the set of tails of in and let denote the set of tails of in Note that neither nor are empty, since by construction, has tails in both and (even if the only tail is the empty tail). By the definition of for any and we have If neither nor contains the empty sequence, then and so works. Otherwise, suppose contains the empty sequence. We append a single to to exceed followed by just enough signs to slip under
Given with we denote by the simplest surreal number satisfying We call the pair of sets a representation of and write
Theorem: Let with and let be the least ordinal for which for all Then
Note that the inequality in the statement of the preceding theorem cannot be replaced by Indeed, if and then
The canonical representation of a surreal number is the representation where is the set of initial segments of which are less than itself, and is the set of initial segments of exceeding
We say that two positive surreal numbers are in the same Archimedean class, denoted by if there exists an integer such that and On the other hand, if for all we write or equivalently,
For any two positive surreal numbers we either have or
Theorem: Let .
if and only if
If is an ordinal, then agrees with ordinal exponentiation in the usual sense.
Theorem: A positive surreal number is the simplest element of its Archimedean class if and only if it can be written in the form for some .
Theorem: Every surreal number admits a unique representation in the form
where is a strictly decreasing sequence of surreal numbers, and is a sequence of nonzero real numbers. If is an ordinal, then this representation coincides with the Cantor normal form.
This representation is called the Conway normal form of
The technique is somewhat analogous to the algorithm for computing continued fractions. Repeatedly extract the largest Archimedean class from and show that this procedure must terminate at some transfinite stage.☐
It follows from the existence of Conway normal forms that every surreal number can be uniquely written as the sum of a real number (the term), an infinitesimal number (the terms with negative exponents), and a purely infinite number (the terms with positive exponents).
For let denote the real-valued coefficient of in the Conway normal form of , with if does not occur. We say that a sequence of surreal numbers converges to denoted by if for all
If, for each fixed the sequence is eventually constant, then we say that converges absolutely to
Note that any (possibly multivariate) formal power series with real coefficients, even if wildly divergent on real-valued arguments, is absolutely convergent when fed infinitesimal surreal arguments.
This suggests a powerful design pattern for extending real functions to the surreals.
Split a surreal number into infinitesimal, real, and purely infinite parts.
Use any power series, even a divergent power series, to define on infinitesimals.
Define as usual on the reals.
Find an inductive definition of on purely infinite numbers, say, of the form
This strategy is used in this paper to extend the natural logarithm to
There exists a field called constructive gravity dedicated to the following question: given a collection of equations describing the dynamics of a matter field, what is the corresponding diffeomorphism-invariant theory of gravity? That is, what tensor fields are required to describe the spacetime geometry to which this matter couples, and what field equations do those tensor fields satisfy?
The methods of constructive gravity can describe more general spacetime geometries than the Lorentzian metrics of standard general relativity. For example, there are so-called area metric manifolds which, instead of carrying a metric which measures tangent lengths, carry an area metric which measures tangent areas. Such a geometry is required to describe a spacetime in which light can spontaneously birefract.