The first time I learned about the inner workings of the lfe package in R, I was told to read a paper by mathematician Israel Halperin called “The product of projection operators” (good luck finding it though. The only reason why I was able to access it is because I was sent a pdf at one point). This paper is concerned with generalizing a technique known as the method of alternating projections. Von Neumann first proved that by successively projecting a vector into two affine spaces, you converge on the projection into their intersection. Halperin managed to extend this result to remarkable generality, only requiring some minimal Banach space structure. I dutifully read this paper, but remember very little about the details. The level of generality was simply more than I really cared to take the time to understand, especially since my interest in it only went as far as understanding how an R package works.
It occurred to me recently that while the Halperin paper is difficult to grasp even at a conceptual level, the original von Neumann approach is so simple that extending it to an arbitrary number of projections in Euclidean space should be doable by fairly elementary means. Like most things I write, this probably won’t be of any use for the practitioner, except maybe to impart a sense of awe the next time a 20-way felm runs in mere minutes.
First some background. When I was first taught about the Frisch-Waugh-Lovell theorem, it was mostly framed as a formal justification for “adding controls” to linear regression models as a way to remove the confounding effects of other covariates. LFE uses the theorem in a completely different way: as a computational tool to speed up regressions. As a review, suppose we have a regression model:
Let be the partitioned design matrix. Then by the usual OLS, we have:
One version of Frisch-Waugh-Lovell tells us that an alternative expression for is given by:
where
is often known as the “residual maker” matrix because when applied to a vector
, it returns the residuals from an OLS regression of
on
. I leave a few elementary properties as an exercise:
Exercise 1:
is idempotent, i.e.
is symmetric, i.e.
- Conclude from above that
where
is an orthonormal matrix and
- Additionally, if
is not collinear, then there are exactly
1’s in the above.
(Side note: typing that matrix was not fun).

In particular, because of the above properties, is a projection matrix and therefore has the special property there is some
dimensional subspace of
,
such that
and
. In fact, the subspace is exactly the span of the first
columns of
(uh, it might be the first rows, I can never remember this).
The sorts of models lfe excels at are ones of the form
where each has the special property that it is a
dimensional vector such that exactly one entry is 1 and the rest are 0. Intuitively, we can think of the
as categories such that belonging in a given category has the effect of shifting the regression function up or down by some amount
. The shifts associated with the
‘s are known as fixed effects, and econometricians often just write as shorthand:
Exercise 2: Show that where
is the average value of
among those observations where
has a 1 in its
component. This is known as de-meaning
The fixed effects are usually not the parameters of interest, so we would like to estimate without estimating all of the fixed effects if possible. The power of lfe comes from its ability to abuse this fact. Demeaning in this manner is considerably more efficient than building out the entire design matrix and inverting its covariance matrix (or more realistically, doing QR decomposition with something like a Householder transformation or a Given’s rotation for numerical stability). In the case where
, we now have a complete description of what lfe does. It simply demeans and then regresses using the de-meaned variables. What happens when
? In this case, we need to project into the space associated with
. Unfortunately, it is not obvious how to project into this space. The demeaning formula no longer works and it is once again unclear how to do the projection without building up a huge design matrix. We will now need one more fact to finally motivate the use of the method of alternating projections.
Exercise 3: Show that if projects into
, then
is the orthogonal projection into
.
So at this point we know how to project into each individual easily, but we want to efficiently project into the intersection. In fact, the von Neumann result already shows us that if
, then
.

The natural next question then is: does ? The answer will turn out to be yes. Let us first express this in slightly more abstract terms. Essentially, we want to prove the following:
Theorem 4: Suppose are orthogonal projection matrices into spaces
. Then
where
is the orthogonal projection into the space
.
Orthogonality is a bit annoying to deal with, so let’s begin by proving something slightly easier.
Theorem 5: Suppose are projection matrices into spaces
. Then
where
is a projection into the space
.
For notational convenience, call and
. First, note that
for all
, so
. We next show that
is well defined for vectors not in
and has image
. This will show that
is a projection. Let
be
dimensional and extend an orthonormal basis on
,
to an orthonormal basis on
given by
. It will suffice to show that any
has
for some
(this will simultaneously show the existence of
and that its range is
. Recall that the operator norm for a matrix is given by
. The idea for the rest of the proof is now fairly straightforward. Because
are all projections, the size of
can never go up. Furthermore, because we are in a finite dimensional space (the lack of finite dimensions is why the Halperin paper has to use much less elementary techniques), we can bound this shrinkage factor away from 1. But in doing so, we immediately get an exponential estimate on the rate of convergence.
Exercise 6: For a nontrivial projection matrix , show that
and the supremum is attained if and only if
.
Since each is not in
for some
, then there is some first index such that
. Let
. Then
. Let
. It is then clear that for any vector
, we have
. Now fix some such
and consider the sequence
. If
for some
, then we must have
for all
by the above discussion. But then we are immediately done since then
. With the orthonormal basis we constructed before, we must have
(the
‘s on the RHS are superscripts, not exponents). Define
and
.
Exercise 7: Show that and
. Use this to conclude that
for some
as desired, completing the proof of theorem 5.
We now add the orthogonality back in. All we need to do is show that if each is orthogonal, then
. It will suffice to show that
for each
(since in this case,
using the notation of exercise 7). But another change of basis makes this easy. Extend
to an orthonormal basis
where
is the rank of
, and
. Then
by orthogonality of
. Additionally,
(since it is orthogonal to
). Then
.
In fact, as mentioned earlier, we have shown something even more remarkable.
Corollary 8: .
The rate of convergence of the alternating projections is at least linear. Roughly speaking, linear convergence means that there are some such that if we want
digits of precision, then we need to iterate at most
times and at least
times. Assuming the
from the above proof stays constant, we can now analyze the time complexity of felm fairly closely. Let
be the number of fixed effects,
be the desired number of digits of precision,
be the number of parameters of interest, and
be the size of the data. Demeaning is an
operation and we have to do it
times per iteration. By the linearity of convergence, we have that we need to project
times. At the end, we have to do normal OLS on the demeaned data, which is
Thus, we have that the time complexity of felm is
, so adding fixed effects incurs only a linear cost. By contrast, naive OLS on the entire system of fixed effects directly is
, so it will grow quadratically in the number of “ways” of our fixed effects.
In hindsight, none of these facts should be particularly surprising. The visual intuition from figure 2 is the main insight, and we’ve just been playing around the symbols to fill in the technical details. Even the rate of convergence should not be terribly surprising. Alternating projections just repeatedly multiplying a matrix that is “<1” in some sense by itself repeatedly. Doing this with a real number <1 gives exactly linear convergence, so it is unsurprising that this extends to matrices with the proper construction.
I conclude by musing that this sort of proof is really only something that makes sense to publish in some sort of informal setting such as a blog. For those more interested in the theoretical aspects of the projections, there is little reason to not just read the Halperin paper. It doesn’t draw on any particularly deep math, is considerably more general, and is much more representative of the functional analytic tools of the trade necessary for further exploration. On the other hand, for the practitioner, there is little value added in knowing the gory details of the proof. In this case, I doubt knowledge about why alternating projections works would even be useful for debugging unexpected lfe behavior. Nonetheless, I hope I have impressed upon you some of the beauty and elegance of the inner workings of my most used R package.
Bonus Exercise: The derivation of FWL purely from matrix algebra is tedious/intimidating, but straightforward. This is not my favorite proof of FWL (my favorite can be found in Theorem 2.1 here), but I think it is still informative in its own right and worth doing exactly one time in your life. I will outline it here. Using the notations from earlier, we have:
The main difficulty in getting something more tractable is inverting the partitioned matrix. Fortunately, manipulating partitioned matrices is an extremely well studied problem. Let
and define and
. Show the Shur complement formula:
(Hint: the proof is not hard, only tedious. The only trick is to write
and then solve/simplify). Now, use the Shur complement formula to show
(Hint: the idempotency of is your friend). Finally, use idempotency once more to get the form of the FWL theorem I stated in the post: