Splitting Linear Models

Last Thursday I watched Wolfgang Viechtbauer’s stream on Twitch. Wolfgang spoke about linear regression. (If you’re interested in the results: Wolfgang lists the R-code he wrote during the stream at https://www.wvbauer.com/doku.php/live_streams.)

During the stream we had an example with one categorical and one numeric predictor and we built a model with interaction between these two predictors. So the result is a straight line for each value of the categorival group.

The question was if it makes a difference if you build a simple linear model for each group.

So let’s look at it.

Example

Let’s start and build some sample data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


options(tidyverse.quiet = TRUE)
library(tidyverse)
parameters <- tribble(
  ~class_name, ~intercept, ~slope, ~sd,
  "A", 5, 0.5, 5,
  "B", 6, -0.5, 5,
  "C", 7, 1, 5, 
)

generate_class <- function(n, class_name, intercept, slope, sd) {
  start <-  0
  end   <- 10
  values <- tibble(
    class_name = rep(class_name, times = n),
    x          = runif(n, start, end),
    y          = intercept + x * slope + rnorm(n, sd)
  )
  
  return(values)  
}

set.seed(42)
data <- pmap_dfr(parameters %>% mutate(n = 20), generate_class)

So we get three clouds of points. Each cloud follows its own line.

1
2
3
4


data %>%
  ggplot(aes(x, y, color = class_name)) +
  geom_point() +
  stat_smooth(formula = y ~ x, method = lm, se = FALSE) 

Build one linear model

We’re building the model without an intercept by using 0 + as start of the formula so we get the intercept for each group directly (without building the sum of the reference group and the other groups.)

1
2


model_complete <- lm(y ~ 0 + class_name + class_name:x, data = data)
summary(model_complete)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


## 
## Call:
## lm(formula = y ~ 0 + class_name + class_name:x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94880 -0.65484  0.06566  0.66447  2.32326 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## class_nameA    8.78346    0.60220  14.586  < 2e-16 ***
## class_nameB   11.24628    0.42291  26.592  < 2e-16 ***
## class_nameC   11.43872    0.69364  16.491  < 2e-16 ***
## class_nameA:x  0.67056    0.09012   7.441 7.93e-10 ***
## class_nameB:x -0.55649    0.07967  -6.985 4.35e-09 ***
## class_nameC:x  1.09090    0.10240  10.653 6.95e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.071 on 54 degrees of freedom
## Multiple R-squared:  0.9948,	Adjusted R-squared:  0.9942 
## F-statistic:  1727 on 6 and 54 DF,  p-value: < 2.2e-16

Build three different models

Now build a linear model for each group.

1
2
3
4
5


model_split <- data %>%
  group_by(class_name) %>%
  group_map(~ coef(summary(lm(y ~ x, data = .x))))

model_split

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


## [[1]]
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 8.7834572  0.7482108 11.739282 7.181449e-10
## x           0.6705587  0.1119703  5.988718 1.153349e-05
## 
## [[2]]
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 11.2462761 0.40929811 27.476980 3.773904e-16
## x           -0.5564857 0.07710095 -7.217624 1.027562e-06
## 
## [[3]]
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 11.438716  0.5000135 22.87681 9.354783e-15
## x            1.090904  0.0738192 14.77805 1.653770e-11

Comparison

As you can see the coefficients are the same but the standard error of the coefficients are different.

“What’s the reason?” I was asking myself.

Simple example

So I began to look at a simpler example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


data <- tribble(
  ~class_name,  ~x, ~y,
  "A", 0, 1,
  "A", 10, 11,
  "A", 10, 21,
  "B", 0, 0,
  "B", 10, 0,
  "B", 10, 5,
)

data

1
2
3
4
5
6
7
8
9


## # A tibble: 6 x 3
##   class_name     x     y
##   <chr>      <dbl> <dbl>
## 1 A              0     1
## 2 A             10    11
## 3 A             10    21
## 4 B              0     0
## 5 B             10     0
## 6 B             10     5

1
2
3
4


data %>%
  ggplot(aes(x, y, color = class_name)) +
  geom_point() +
  stat_smooth(formula = y ~ x, method = lm, se = FALSE) 

So we’ve only two groups. The linear model is simple:

1

lm(y ~ class_name * x, data) %>% summary()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


## 
## Call:
## lm(formula = y ~ class_name * x, data = data)
## 
## Residuals:
##          1          2          3          4          5          6 
## -4.441e-16 -5.000e+00  5.000e+00  1.648e-15 -2.500e+00  2.500e+00 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     1.0000     5.5902   0.179    0.875
## class_nameB    -1.0000     7.9057  -0.126    0.911
## x               1.5000     0.6847   2.191    0.160
## class_nameB:x  -1.2500     0.9682  -1.291    0.326
## 
## Residual standard error: 5.59 on 2 degrees of freedom
## Multiple R-squared:  0.8201,	Adjusted R-squared:  0.5501 
## F-statistic: 3.038 on 3 and 2 DF,  p-value: 0.2574

Step by Step computation

But how can we compute the linear model step by step?

The data

First let’s split the data into $ y $ and $ X $:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


y <- data %>% pull(y)
X <- data %>% 
  mutate(
    x0 = 1,
    x1 = class_name != "A",
    x2 = x,
    x3 = x1*x
    ) %>% 
  select(x0, x1, x2, x3) %>% 
  as.matrix()

X

1
2
3
4
5
6
7


##      x0 x1 x2 x3
## [1,]  1  0  0  0
## [2,]  1  0 10  0
## [3,]  1  0 10  0
## [4,]  1  1  0  0
## [5,]  1  1 10 10
## [6,]  1  1 10 10

1

y

1

## [1]  1 11 21  0  0  5

So $ y $ contains the values of the dependent variable. $ X $ is a matrix with $ 1 $ in the first column called $ x_0 $. $ x_1 .. x_3 $ are the predictors. $ x_3 $ is the interaction between $ x_1 $ and $ x_2 $.

The model

Now we can write our model as:

$$ y = X \beta + \epsilon $$ We’re searching for $ b $ so that $ \hat{y} = X b $ is the prediction which minimizes the sum of the squared residuals.

We get $ b $ as

$$ b = (X^t X)^{-1} X^t y $$

So let’s compute it: (solve computes the inverse of a matrix)

1
2
3
4


Xt <- t(X)

b <- solve(Xt %*% X) %*% Xt %*% y
round(b, 3)

1
2
3
4
5


##     [,1]
## x0  1.00
## x1 -1.00
## x2  1.50
## x3 -1.25

Standard Errors of the coefficients

Now we’re looking at the standard errors of the coefficients:

We can compute them by

$$ Cov(b) = \sigma^2 (X^t X)^{-1} $$

A predictior for $ \sigma^2 $ is the mean squared error: $$ s^2 = \frac{\sum{(y_i - \hat{y})^2}}{df} $$ In our case the degrees of freedom are $ df = n - 4 $.

1
2
3
4


yh <- X %*% b
df <- length(y) - 4
MSE <- sum((y - yh)**2) / df
MSE

1

## [1] 31.25

So the standard errors of the coefficients are the main diagonal of this matrix:

1

(solve(Xt %*% X) * MSE) %>% sqrt()

1

## Warning in sqrt(.): NaNs produced

1
2
3
4
5


##          x0       x1        x2        x3
## x0 5.590170      NaN       NaN 1.7677670
## x1      NaN 7.905694 1.7677670       NaN
## x2      NaN 1.767767 0.6846532       NaN
## x3 1.767767      NaN       NaN 0.9682458

These are the same as computed by lm:

1

lm(y ~ class_name * x, data) %>% summary()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


## 
## Call:
## lm(formula = y ~ class_name * x, data = data)
## 
## Residuals:
##          1          2          3          4          5          6 
## -4.441e-16 -5.000e+00  5.000e+00  1.648e-15 -2.500e+00  2.500e+00 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     1.0000     5.5902   0.179    0.875
## class_nameB    -1.0000     7.9057  -0.126    0.911
## x               1.5000     0.6847   2.191    0.160
## class_nameB:x  -1.2500     0.9682  -1.291    0.326
## 
## Residual standard error: 5.59 on 2 degrees of freedom
## Multiple R-squared:  0.8201,	Adjusted R-squared:  0.5501 
## F-statistic: 3.038 on 3 and 2 DF,  p-value: 0.2574

As you can see the covariance matrix isn’t diagnal. Points from “the other group” also effect the standard error.

So the errors are different to the errors of two independent models:

1
2
3


data %>%
  group_by(class_name) %>%
  group_map(~ coef(summary(lm(y ~ x, data = .x))))

1
2
3
4
5
6
7
8
9


## [[1]]
##             Estimate Std. Error   t value  Pr(>|t|)
## (Intercept)      1.0  7.0710678 0.1414214 0.9105615
## x                1.5  0.8660254 1.7320508 0.3333333
## 
## [[2]]
##                Estimate Std. Error      t value  Pr(>|t|)
## (Intercept) 2.56395e-16  3.5355339 7.251946e-17 1.0000000
## x           2.50000e-01  0.4330127 5.773503e-01 0.6666667

Separate Models

For completeness let’s check the manual way for one single model here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


data_single <-data %>% 
  filter(class_name == "A")

y <- data_single %>% pull(y)
X <- data_single %>% 
  mutate(
    x0 = 1,
    x1 = class_name != "A",
    x2 = x,
    x3 = x1*x
    ) %>% 
  select(x0, x2) %>% 
  as.matrix()

X

1
2
3
4


##      x0 x2
## [1,]  1  0
## [2,]  1 10
## [3,]  1 10

1

y

1

## [1]  1 11 21

1
2
3
4


Xt <- t(X)

b <- solve(Xt %*% X) %*% Xt %*% y
round(b, 3)

1
2
3


##    [,1]
## x0  1.0
## x2  1.5

1
2
3
4


yh <- X %*% b
df <- length(y) - 2
MSE <- sum((y - yh)**2) / df
MSE

1

## [1] 50

1

(solve(Xt %*% X) * MSE) %>% sqrt()

1

## Warning in sqrt(.): NaNs produced

1
2
3


##          x0        x2
## x0 7.071068       NaN
## x2      NaN 0.8660254