Interactions are hard.

I take a break from my usual pure theory to present something a little more applied. Suppose you estimate the OLS model

Y = \alpha + \beta_d D + \beta_x X + \beta_i D \cdot X + \varepsilon

where D is a dummy variable and X is continuous. For conceptual simplicity, suppose that the conditional expectation function truly as specified above and that the errors are homoskedastic. Upon estimating this model, you find out that not only is the interaction term not statistically significant, but so is \beta_d. This is a bit surprising since if you estimated the model

Y = \alpha + \beta_d D + \varepsilon

you find that \beta_d is highly significant. Additionally, the point estimates are totally different anyways. This points to the first subtlety of interpreting interaction coefficients. In the top model, in population, we have that

\beta_d = \mathbb E[Y | X = 0, D = 1] - \mathbb E[Y | X = 0, D = 0]

By contrast, in the bottom regression, we have

\beta_d = \mathbb E[Y | D = 1] - \mathbb E[Y | D = 0]

Even if we go so far as to assume that the population CEF is correctly specified by the interactions model, under unequal slopes, these two values are not in general the same. In fact, the only way they are the same is if either

  1. \mathbb E[X|D] = 0 for D = 0,1
  2. \beta_i = 0 in population
  3. Both of the above fail, but in a way that happens to just cancel out.

Now, suppose you find yourself in the situation (as I did) where variation in X was assigned according to an experiment. The experiment worked in the following way. First, a proportion p of all units are chosen at random from the population to be part of the experiment. Those units in the experiment are randomly assigned a value of X from some set [a,b] \ni c (in fact, assume \mathbb E[X] = c, which makes the analysis somewhat simpler in a bit). The rest of the units were assigned by default to a value of c. You now first run the model only on those units tagged as in the experiment. When you run the interacted model, you find that estimate of \beta_d has a high variance. Your idea is then to use all of the other units to decrease the variance of the estimator. Does this work? To my surprise, the standard errors hardly changed at all.

Since a model with interaction terms is equivalent to running OLS separately on each subsample, it suffices to analyze the standard errors of the intercept term on a single OLS model. I assume sampling of the following form. There is a fixed number n_1 of observations where X varies. Additionally, we have some sample n_2 of observations where X is fixed to be E[X] from the first sample, and \varepsilon is drawn from the same distribution as the first sample (this is why it is important that X is determined experimentally). What is the asymptotic behavior of the variance as n_2 \to \infty? Recall that for homoskedastic (and correctly specified) OLS, the variance estimator is given by

\hat\sigma^2_n\left(\mathbb X'\mathbb X\right)^{-1} = \dfrac{\hat\sigma^2_n}{n\sum X_i^2 - n^2 \bar X^2}\begin{pmatrix}\sum X_i^2 & - n \bar X \\ -n \bar X & n\end{pmatrix}

We are in particular interested in the 1,1 coordinate of the above matrix which I rewrite:

\hat\sigma^2_n \frac1n \dfrac{\frac1n \sum X_i^2}{\frac1n \sum X_i^2 - \bar X^2}

Denote by V_1 the variance of X in the original n_1 observations. Since sample variances and moments are variances and moments in their own rights, we can use the law of total variance to show that the above expression will converge in probability to

\hat\sigma^2_n\frac1n \dfrac{\mathbb E[X]^2}{\frac{n_1}{n_1+n_2}V_1} = \theta_p(1)

assuming \mathbb E[X] \neq 0 (the numerator comes from the assumption that the n_2 additional data points identically have X = \mathbb E[X]) where a sequence of random variables A_n is said to be \theta_p(f(n)) if both \frac{A_n}{f(n)} = O_p(1) and \frac{f(n)}{A_n} = O_p(1) (and A_n is O_p(1) means that it is “bounded in probability”, or \mathbb P(A_n \geq M) \to 0 as n\to\infty for some fixed, finite M). This is just a fancy way of saying that the random variable converges to a constant. In fact, in the above, since I assume my original sample of X_i to be fixed from the perspective of our asymptotics, the only non-constant stems from the uncertainty in estimating the standard errors. Therefore, I find exactly the result that increasing the sample past a certain point hardly changes the standard errors. This seems counterintuitive at first, but after some reflection, it seems to simply be a stark reminder that intercepts and interactions are not what we expect them to be.

Leave a comment