Bayesian ANOVA Parameterisation

James Totterdell · 2019-09-02

Background

ANOVA methods are essentially about “batching” related model parameters to aid interpretability of the results (e.g. Gelman 2005). This batching is usually achieved by enforcing some kind of constraint on the parameters within the model. In classical inference, such constraints can be enforced in a number of ways and, to some extent, immaterially affect the inference due to the arbitrariness of the design matrix. However, in Bayesian inference, parameterisation determines our priors which influence our inferences. Therefore, whilst in classicla ANOVA, the method of enforcing the usual sum to zero constraint (e.g. setting \(\alpha_1=0\) or \(\alpha_a = -\sum_{i=1}^{a-1}\alpha_i\)) doesn’t really matter apart from interpretation, but in Bayesian inference a “fair” constraint is required.

To work through a sufficiently complex example (for more details see, for example (Rouder et al. 2012), or for a more rigorous analysis (Gu 2013)), consider a two-way factorial design with linear predictor \[ \begin{aligned} \eta &= 1^\top\mu + X_\alpha\alpha + X_\beta\beta + X_\gamma\gamma \\ &= X\theta \end{aligned} \] where the basic structure of the design matrices are as follows, repeated as necessary for the number of replicates \[ \begin{aligned} X_\alpha &= I_a\otimes 1_b \\ X_\beta &= 1_a \otimes I_b \\ X_\gamma &= I_{a\times b}. \end{aligned} \]

a <- 4
b <- 3
g <- a*b
Xa <- kronecker(diag(1, a), rep(1, b))
Xb <- kronecker(rep(1, a), diag(1, b))
Xg <- diag(1, g)
X <- cbind(1, Xa, Xb, Xg)
print(X)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
 [1,]    1    1    0    0    0    1    0    0    1     0     0     0     0
 [2,]    1    1    0    0    0    0    1    0    0     1     0     0     0
 [3,]    1    1    0    0    0    0    0    1    0     0     1     0     0
 [4,]    1    0    1    0    0    1    0    0    0     0     0     1     0
 [5,]    1    0    1    0    0    0    1    0    0     0     0     0     1
 [6,]    1    0    1    0    0    0    0    1    0     0     0     0     0
 [7,]    1    0    0    1    0    1    0    0    0     0     0     0     0
 [8,]    1    0    0    1    0    0    1    0    0     0     0     0     0
 [9,]    1    0    0    1    0    0    0    1    0     0     0     0     0
[10,]    1    0    0    0    1    1    0    0    0     0     0     0     0
[11,]    1    0    0    0    1    0    1    0    0     0     0     0     0
[12,]    1    0    0    0    1    0    0    1    0     0     0     0     0
      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]     0     0     0     0     0     0     0
 [2,]     0     0     0     0     0     0     0
 [3,]     0     0     0     0     0     0     0
 [4,]     0     0     0     0     0     0     0
 [5,]     0     0     0     0     0     0     0
 [6,]     1     0     0     0     0     0     0
 [7,]     0     1     0     0     0     0     0
 [8,]     0     0     1     0     0     0     0
 [9,]     0     0     0     1     0     0     0
[10,]     0     0     0     0     1     0     0
[11,]     0     0     0     0     0     1     0
[12,]     0     0     0     0     0     0     1

For interpretability, we would constrain the design so the parameters represent deflections about the overall mean \(\mu\). This is achieved by specifying a sum to zero constraint. \[ \begin{aligned} \sum_{j=1}^a \alpha_j &= 0 \\ \sum_{j=1}^b \beta_j &= 0 \\ \sum_{k=1}^{a} \text{vec}^{-1}(\gamma)_{jk} = \sum_{j=1}^b \text{vec}^{-1}(\gamma)_{jk} &=0. \end{aligned} \] If we were just to enforce \(\alpha_1=0\), say, then any \(\eta\) involving \(\alpha_1\) is a priori less variable than those involving the other \(\alpha\), similarly if we used the \(\alpha_a = -\sum_{i=1}^{a-1}\alpha_i\) constraint than any \(\eta\) involving \(\alpha_a\) is a priori more variable than those involving the other \(\alpha\). If comparing treatments say, this is disadvantaging/advantaging some comparisons over others.

Assuming we start from unconstrained variables, for example \(\alpha^\prime\sim N(0,\sigma^2_aI_a)\), then to fairly enforce the constraint we are specifying a transformation \(\alpha=\alpha^\prime - \bar\alpha^\prime=0\) which implies \(\alpha\sim N(0,\sigma_a^2(I_a - a^{-1}J_a))\), say. In general, the above constraints are equivalent to specifying correlated priors as follows \[ \begin{aligned} S_a &= I_a - a^{-1}J_a \\ S_b &= I_b - b^{-1}J_b \\ \alpha &\sim N(0, \Sigma_a = \sigma_a^2S_a) \\ \beta &\sim N(0, \Sigma_b = \sigma_b^2S_b) \\ \gamma &\sim N(0, \Sigma_{ab} = \sigma_g^2(S_a\otimes S_b)) \\ \end{aligned} \] which directly enforce the constraints posed previously and where we have used the fact that \(S_a S_a = S_a\), etc.

inv_vec <- function(A) matrix(A, a, b, byrow = T)
Sa <- diag(1, a) - 1/a
Sb <- diag(1, b) - 1/b
Sg <- kronecker(Sa, Sb)
R <- mvtnorm::rmvnorm(1, sigma = Sg)
mR <- inv_vec(R)
c(round(colSums(mR), 5), round(rowSums(mR), 5))
[1] 0 0 0 0 0 0 0

This model is still unidentifiable as the design matrix \(X\) has rank 12, but the model involves 20 parameters.

In the Bayesian framework, unidentifiability doesn’t necessarily pose a problem as we can still perform inference via any usual method whether it be approximations or sampling. In fact, we can just fit the unidentifiable model and then apply the appropriate constraints after the fact. However, we will find that sampling is inefficient due to the unidentifiability of the parameters. Therefore, we ideally re-parameterise first to enforce identifiability and improve sampling efficiency, and then can back transform after sampling is completed.

To apply the constraints after the fact, we wish to enforce the sum-to-zeros given above by subtracting the relevant means \[ \begin{aligned} \alpha_j^\star &= \alpha_j - \bar\alpha,\quad j=1,...,a \\ \beta_j^\star &= \beta_j - \bar\beta,\quad j=1,...,b \\ \gamma_{jk}^\star &= \gamma_{jk} - \bar\gamma_{j+} - \bar\gamma_{+k} + \bar\gamma,\quad j=1,...,a; k=1,...,b \\ \mu^\star &= \mu + \bar\alpha + \bar\beta + \bar\gamma. \end{aligned} \] There may be an easier way, but to enforce this constraint we can use the following matrices \[ \begin{aligned} C_\mu &= \begin{bmatrix} a^{-1}1_a^\top & b^{-1}1_b^\top & (ab)^{-1}1_{ab}^\top\end{bmatrix} \\ C_\alpha &= \begin{bmatrix} S_a & 0_{a\times b} & (b^{-1}I_a - (ab)^{-1}J_a)\otimes1_b^\top \end{bmatrix} \\ C_\beta &= \begin{bmatrix} 0_{b\times a} & S_b & 1_a^\top \otimes (a^{-1}I_b - (ab)^{-1}J_b)\end{bmatrix} \\ C_\gamma &= \begin{bmatrix} 0_{ab\times (a+b)} & S_a\otimes S_b\end{bmatrix}\\ C &= \begin{bmatrix} 1 & C_\mu \\ 0 & C_\alpha \\ 0 & C_\beta \\ 0 & C_\gamma \end{bmatrix} \end{aligned} \]

Cm <- rbind(c(rep(1 / a, a), rep(1 / b, b), rep(1 / g, g)))
Ca <- cbind(Sa, matrix(0, a, b), kronecker(diag(1/b, a) - 1/g, rbind(rep(1, b))))
Cb <- cbind(matrix(0, b, a), Sb, kronecker(rbind(rep(1, a)), diag(1/a, b) - 1/g))
Cg <- cbind(matrix(0, g, a + b), Sg)
C <- rbind(cbind(1, Cm), cbind(0, Ca), cbind(0, Cb), cbind(0, Cg))

beta <- rnorm(20)
beta_star <- drop(C %*% beta)
round(c(sum(beta_star[2:5]), sum(beta_star[6:8])), 3)
[1] 0 0
B <- matrix(beta_star[9:20], a, b, byrow = T)
round(c(sum(B), rowSums(B), colSums(B)), 3)
[1] 0 0 0 0 0 0 0 0

Re-parameterisation

The matrix \(S_a=(I_a-a^{-1}J_a)\) is not of full-rank, but we can eigen-decompose as \[ S_a = Q_a I_{a-1}Q_a^\top \implies \Sigma_a = \sigma_\alpha^2Q_aI_{a-1}Q_a^\top \] where \(Q_a\) is \(a\times(a-1)\) matrix containing the unit eigenvectors with non-zero eigenvalues. We can apply the transformation \[ \alpha^\star = Q_a^\top\alpha\sim N(0, \sigma_\alpha^2 I_{a-1}) \] where \(\alpha^\star\) now has length \(a-1\) and is identifiable. Additionally, this transformation implies a changed design matrix \[ X_a\alpha = X_aQ_aQ_a^\top\alpha = X_a^\star\alpha^\star. \]

Similarly for the other term \(\beta\) and for \(\gamma\) with \(Q_{ab} = Q_a\otimes Q_b\). Therefore, we are now specifying priors \[ \begin{aligned} \alpha^\star &= Q_a^\top\alpha \sim N(0, \sigma^2_\alpha I_{a-1}) \\ \beta^\star &= Q_b^\top\beta \sim N(0, \sigma^2_\beta I_{b-1}) \\ Q_{ab} &= Q_a\otimes Q_b \\ \gamma^\star &= Q_{ab}^\top\gamma \sim N(0, \sigma^2_gI_{ab-b+1}) \end{aligned} \] and model \[ \begin{aligned} \eta &= 1^\top\mu + X_\alpha^\star\alpha^\star + X_\beta^\star\beta^\star + X_\gamma^\star\gamma^\star \\ &= X^\star\theta^\star \\ X_\alpha^\star &= X_\alpha Q_\alpha \\ X_\beta^\star &= X_\beta Q_\beta \\ X_\gamma^\star &= X_\gamma Q_\gamma. \end{aligned} \]

Qa <- eigen(Sa)$vector[, 1:(a-1)]
Qb <- eigen(Sb)$vector[, 1:(b-1)]
Qg <- kronecker(Qa, Qb)
Q <- as.matrix(Matrix::bdiag(1, Qa, Qb, Qg))

Xa_star <- Xa %*% Qa
Xb_star <- Xb %*% Qb
Xg_star <- Xg %*% Qg
X_star <- cbind(1, Xa_star, Xb_star, Xg_star)
print(X_star)
      [,1]       [,2]       [,3]       [,4]       [,5]       [,6]       [,7]
 [1,]    1  0.0000000  0.8660254  0.0000000  0.0000000  0.8164966  0.0000000
 [2,]    1  0.0000000  0.8660254  0.0000000 -0.7071068 -0.4082483  0.0000000
 [3,]    1  0.0000000  0.8660254  0.0000000  0.7071068 -0.4082483  0.0000000
 [4,]    1 -0.2677175 -0.2886751  0.7713586  0.0000000  0.8164966  0.0000000
 [5,]    1 -0.2677175 -0.2886751  0.7713586 -0.7071068 -0.4082483  0.1893048
 [6,]    1 -0.2677175 -0.2886751  0.7713586  0.7071068 -0.4082483 -0.1893048
 [7,]    1 -0.5341574 -0.2886751 -0.6175294  0.0000000  0.8164966  0.0000000
 [8,]    1 -0.5341574 -0.2886751 -0.6175294 -0.7071068 -0.4082483  0.3777063
 [9,]    1 -0.5341574 -0.2886751 -0.6175294  0.7071068 -0.4082483 -0.3777063
[10,]    1  0.8018748 -0.2886751 -0.1538291  0.0000000  0.8164966  0.0000000
[11,]    1  0.8018748 -0.2886751 -0.1538291 -0.7071068 -0.4082483 -0.5670111
[12,]    1  0.8018748 -0.2886751 -0.1538291  0.7071068 -0.4082483  0.5670111
            [,8]       [,9]      [,10]      [,11]       [,12]
 [1,]  0.0000000  0.0000000  0.7071068  0.0000000  0.00000000
 [2,]  0.0000000 -0.6123724 -0.3535534  0.0000000  0.00000000
 [3,]  0.0000000  0.6123724 -0.3535534  0.0000000  0.00000000
 [4,] -0.2185904  0.0000000 -0.2357023  0.0000000  0.62981162
 [5,]  0.1092952  0.2041241  0.1178511 -0.5454329 -0.31490581
 [6,]  0.1092952 -0.2041241  0.1178511  0.5454329 -0.31490581
 [7,] -0.4361377  0.0000000 -0.2357023  0.0000000 -0.50421065
 [8,]  0.2180688  0.2041241  0.1178511  0.4366592  0.25210533
 [9,]  0.2180688 -0.2041241  0.1178511 -0.4366592  0.25210533
[10,]  0.6547281  0.0000000 -0.2357023  0.0000000 -0.12560097
[11,] -0.3273640  0.2041241  0.1178511  0.1087736  0.06280049
[12,] -0.3273640 -0.2041241  0.1178511 -0.1087736  0.06280049

The same setup can be used, whether the model be one-way, multi-way, random-effects, or mixed-effects.

Example

As a start, consider sampling based on the unidentifiable parameterisation versus the re-parameterised model.

// log_reg
data {
  int<lower=0> N;
  int<lower=1> P;
  int<lower=0> y[N];
  int<lower=1> n[N];
  matrix[N, P] X;
  vector[P] mu0;
  matrix[P, P] Sigma0;
}
parameters {
  vector[P] beta;
}
model {
  target += multi_normal_lpdf(beta | mu0, Sigma0)
          + binomial_logit_lpmf(y | n, X*beta);
}
library(rstan)
library(tidybayes)

# Data
n <- 1e2
mu <- 0.2
alpha <- seq(-0.25, 0.25, length.out = a)
beta  <- seq(-0.1, 0.1, length.out = b)
gamma <- rep(0, g)
theta <- c(mu, alpha, beta, gamma)
P <- length(theta)
N <- nrow(X)
eta <- X %*% theta
y <- rbinom(nrow(eta), n, plogis(eta))

# Fit
dat_X <- list(N = N, P = P, y = y, n = rep(n, N), X = X, mu0 = rep(0, P), Sigma0 = diag(10, P))
fit_X <- sampling(log_reg, data = dat_X, refresh = 0, iter = 2e4)
sam_X <- as.matrix(fit_X, "beta")
round(get_elapsed_time(fit_X), 2)
##         warmup sample
## chain:1  14.15  17.89
## chain:2  14.20  18.54
## chain:3  14.46  18.16
## chain:4  14.94  18.52
P_star <- ncol(X_star)
dat_X_star <- list(N = N, P = P_star, y = y, n = rep(n, N), X = X_star, mu0 = rep(0, P_star), Sigma0 = diag(10, P_star))
fit_X_star <- sampling(log_reg, data = dat_X_star, iter = 2e4, refresh = 0)
sam_X_star <- as.matrix(fit_X_star, "beta")
round(get_elapsed_time(fit_X_star), 2)
##         warmup sample
## chain:1   1.00   0.48
## chain:2   0.91   0.46
## chain:3   1.02   0.45
## chain:4   0.93   0.45

There is a large discepancy in the time taken for the dynamic HMC to complete sampling. The reparameterisation involves an identifiable space of parameters and its a much simpler geometry for the sampler to work with. The resultant parameters are equivalent.

print("Linear predictor means")
round(cbind(
  "True" = eta,
  "Unidentifiable" = apply(sam_X %*% t(X), 2, mean),
  "Identifiable" = apply(sam_X_star %*% t(X_star), 2, mean)
), 2)

print("Linear predictor variance")
round(cbind(
  "Unidentifiable" = sqrt(diag(var(sam_X %*% t(X)))),
  "Identifiable" = sqrt(diag(var(sam_X_star %*% t(X_star))))
), 2)

print("Constrained parameter means")
round(cbind(
  "True" = theta,
  "Unidentifiable" = apply(sam_X %*% t(C), 2, mean),
  "Identifiable" = apply(sam_X_star %*% t(Q), 2, mean)
), 2)

print("Constrained parameter variance")
round(cbind(
  "Unidentifiable" = sqrt(diag(var(sam_X %*% t(C)))),
  "Identifiable" = sqrt(diag(var(sam_X_star %*% t(Q))))
), 2)
[1] "Linear predictor means"
            Unidentifiable Identifiable
 [1,] -0.15          -0.12        -0.12
 [2,] -0.05           0.00         0.00
 [3,]  0.05          -0.08        -0.08
 [4,]  0.02           0.20         0.20
 [5,]  0.12           0.12         0.12
 [6,]  0.22           0.33         0.33
 [7,]  0.18           0.04         0.04
 [8,]  0.28           0.16         0.16
 [9,]  0.38           0.32         0.33
[10,]  0.35           0.49         0.49
[11,]  0.45           0.25         0.24
[12,]  0.55           0.45         0.45
[1] "Linear predictor variance"
      Unidentifiable Identifiable
 [1,]           0.20         0.20
 [2,]           0.20         0.20
 [3,]           0.20         0.20
 [4,]           0.20         0.20
 [5,]           0.20         0.20
 [6,]           0.21         0.20
 [7,]           0.20         0.20
 [8,]           0.20         0.20
 [9,]           0.20         0.20
[10,]           0.21         0.21
[11,]           0.20         0.20
[12,]           0.21         0.21
[1] "Constrained parameter means"
       True Unidentifiable Identifiable
 [1,]  0.20           0.18         0.18
 [2,] -0.25          -0.25        -0.25
 [3,] -0.08           0.04         0.04
 [4,]  0.08           0.00         0.00
 [5,]  0.25           0.22         0.22
 [6,] -0.10          -0.03        -0.03
 [7,]  0.00          -0.05        -0.05
 [8,]  0.10           0.07         0.08
 [9,]  0.00          -0.03        -0.03
[10,]  0.00           0.12         0.12
[11,]  0.00          -0.09        -0.09
[12,]  0.00           0.01         0.01
[13,]  0.00          -0.05        -0.05
[14,]  0.00           0.03         0.03
[15,]  0.00          -0.11        -0.11
[16,]  0.00           0.03         0.04
[17,]  0.00           0.07         0.07
[18,]  0.00           0.12         0.12
[19,]  0.00          -0.10        -0.10
[20,]  0.00          -0.02        -0.02
[1] "Constrained parameter variance"
      Unidentifiable Identifiable
 [1,]           0.06         0.06
 [2,]           0.10         0.10
 [3,]           0.10         0.10
 [4,]           0.10         0.10
 [5,]           0.10         0.10
 [6,]           0.08         0.08
 [7,]           0.08         0.08
 [8,]           0.08         0.08
 [9,]           0.14         0.14
[10,]           0.14         0.14
[11,]           0.14         0.14
[12,]           0.14         0.14
[13,]           0.14         0.14
[14,]           0.15         0.14
[15,]           0.14         0.14
[16,]           0.14         0.14
[17,]           0.14         0.14
[18,]           0.14         0.14
[19,]           0.14         0.14
[20,]           0.15         0.14

Definitions

Recall the Kronecker product definition \[ A\otimes B = \begin{pmatrix} a_11 B & a_{12} B & \cdots & a_{1n}B \\ a_21 B & a_{22} B & \cdots & a_{2n}B \\ \vdots & \vdots & \ddots & \vdots \\ a_m1 B & a_{m2} B & \cdots & a_{mn}B \end{pmatrix} \] and the vec operator definition \[ \text{vec}(A) = \begin{pmatrix} a_{11} \\ a_{21} \\ \vdots \\ a_{m1} \\ a_{12} \\ a_{22} \\ \vdots \\ a_{mn} \end{pmatrix} \] whereby if \(A\) is symmetric \[ \text{vec}(A) = \text{vec}(A^\top) \]


References

Gelman, Andrew. 2005. “Analysis of Variance—Why It Is More Important Than Ever.” Ann. Statist. 33 (1): 1–53. https://doi.org/10.1214/009053604000001048.

Gu, Chong. 2013. Smoothing Spline Anova Models. Vol. 297. Springer Science & Business Media.

Rouder, Jeffrey N, Richard D Morey, Paul L Speckman, and Jordan M Province. 2012. “Default Bayes Factors for Anova Designs.” Journal of Mathematical Psychology 56 (5): 356–74.