Normal Linear Regression
We consider the following model \[ \begin{aligned} p(y|\beta,\Sigma) &= \text{Normal}(y|X\beta, \Sigma) \\ p(\beta) &= \text{Normal}(\beta|\mu_0, \Sigma_0) \end{aligned} \] often assuming \(\Sigma = \sigma^2 I_N\) with either \(p(\sigma^2) = \text{Inverse-Gamma}(\sigma^2|a_0, b_0)\) or \(p(\sigma) = \text{Half-}t(\sigma^2|a_0,b_0)\).
(Look at the generalisation of this using G-Wishart priors as in Maestrini and Wand 2020)
The latter can be expressed as \[ \begin{aligned} p(\sigma^2|\lambda) &= \text{Inverse-Gamma}(b_0/2, b_0/\lambda) \\ p(\lambda) &= \text{Inverse-Gamma}(1/2,1/a_0^2) \\ \implies p(\sigma) &= \text{Half-}t(a_0,b_0). \end{aligned} \]
The proposed product-form variational densities are \(q(\beta,\Sigma,\lambda)=q(\beta)q(\Sigma)q(\lambda)\) with optimal solutions \[ \begin{aligned} q^\star(\beta) &\propto \exp\left\{\mathbb E_{\Sigma}\left[\ln p(\beta|y,\Sigma)\right]\right\} \\ q^\star(\Sigma) &\propto \exp\left\{\mathbb E_{\beta,\lambda}\left[\ln p(\Sigma|y,\beta,\lambda)\right]\right\} \\ q^\star(\lambda) &\propto \exp\left\{\mathbb E_\Sigma[\ln p(\lambda|y,\Sigma)\right\} \end{aligned} \] where the \(\lambda\) parameter is dropped if not needed.
From the model above, the joint density is \(\ln p(y,\beta,\Sigma) = \ln p(y|\beta,\Sigma) + \ln p(\beta) + \ln p(\Sigma|\lambda) + \ln p(\lambda)\).
Model Likelihood
The log-likelihood has the form \[ \ln p(y|\beta,\Sigma) = -\frac{1}{2}\left(\ln |\Sigma| +(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)+p\ln(2\pi)\right) \] so that \[ \begin{aligned} \mathbb E_q[\ln p(y|\beta,\Sigma)] &= -\frac{1}{2}\left(\mathbb E_q[\ln|\Sigma|] + \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)]+p\ln(2\pi)\right) \\ &= -\frac{1}{2}\left(\mathbb E_q[\ln|\Sigma|] + (y-X\mu_\beta)^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mu_\beta) + \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\Sigma_\beta\right)+p\ln(2\pi)\right) \end{aligned} \] due to the assumed variational independence of \(\beta\) and \(\Sigma\).
To obtain the above, (letting \(\mu_\beta = \mathbb E_q[\beta]\) and \(\Sigma_\beta = \mathbb V_q[\beta]\)) \[ \begin{aligned} \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)] &= y^\mathsf{T}\mathbb E_q[\Sigma{-1}]y-2y^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta]+\mathbb E_q[\beta^\mathsf{T}X^\mathsf{T}\Sigma^{-1}X\beta]\\ \mathbb E_q[\beta^\mathsf{T}X^\mathsf{T}\Sigma^{-1}X\beta] &= \mathbb E_q[\text{tr}\left(X^\mathsf{T}\Sigma^{-1}X\beta\beta^\mathsf{T}\right)] \\ &= \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\left\{\mathbb V_q[\beta]+\mathbb E_q[\beta]\mathbb E_q[\beta]^\mathsf{T}\right\}\right) \\ &= \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb V_q[\beta]\right)+(X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta]\\ y^\mathsf{T}\mathbb E_q[\Sigma{-1}]y-2y^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta] &= (y-X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mathbb E_q[\beta]) - (X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta] \\ \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)] &= (y-X\mu_\beta)^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mu_\beta) + \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\Sigma_\beta\right) \end{aligned} \]
Model Coefficients
For the regression coefficients, the full conditional is \[ \begin{aligned} \ln p(\beta | y, \Sigma) &\simeq -\frac{1}{2}\left[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta) - \frac{1}{2}(\beta-\mu_0)^{\mathsf{T}}\Sigma_0^{-1}(\beta-\mu_0)\right] \\ &= -\frac{1}{2}\left[\left(\beta - M^{-1}m\right)^{\mathsf{T}} M\left(\beta - M^{-1}m\right)\right]\\ M &= X^{\mathsf{T}}\Sigma^{-1}X + \Sigma_0^{-1} \\ m &= X^{\mathsf{T}}\Sigma^{-1}y + \Sigma_0^{-1}\mu_0 \\ \implies p(\beta|y,\Sigma) &= \text{Normal}(\mu_{\beta|y,\Sigma}, \Psi_{\beta|y,\Sigma}) \\ \Psi_{\beta|y,\Sigma} &= \left(X^{\mathsf{T}}\Sigma^{-1}X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{\beta|y,\Sigma} &= \Psi_{\beta|y,\Sigma}\left(X^{\mathsf{T}}\Sigma^{-1}y + \Sigma_0^{-1}\mu_0\right). \end{aligned} \]
From which, the optimal density is \[ \begin{aligned} \mathbb E_{q(\Sigma)}[\ln p(\beta|y,\Sigma)] &\simeq -\frac{1}{2}\left[\left(\beta - M_q^{-1}m_q\right)^{\mathsf{T}} M_q\left(\beta - M_q^{-1}m_q\right)\right] \\ M_q &= \mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right]X^{\mathsf{T}} X + \Sigma_0^{-1} \\ m_q &= \mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0 \\ \implies q^\star(\beta) &= \text{Normal}(\mu_{q(\beta)}, \Psi_{q(\beta)}) \\ \Psi_{q(\beta)} &= \left(\mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &= \Sigma_{q(\beta)}\left(\mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right). \end{aligned} \]
The exact formulation of this density depends upon the form of \(\Sigma\) and the resulting density \(q(\Sigma)\). Assuming that \(\Sigma=\sigma^2 I\), replace \(q(\Sigma)\) with \(q(\sigma^2)\).
Inverse-Gamma prior on \(\sigma^2\)
For the variance component assuming \(\Sigma=\sigma^2 I\), \[ \begin{aligned} \ln p(\sigma^2|y,\beta) &\simeq \frac{N}{2}\ln(\sigma^{2})-\frac{1}{2}\sigma^{-2}\left[(y-X\beta)^{\mathsf{T}}(y-X\beta)\right] - (a_0+1)\ln(\sigma^2) - \sigma^{-2}b_0\\ \implies p(\sigma^2|y,\beta) &= \text{Inverse-Gamma}(\sigma^2|a_{\sigma^2|\beta,y}, b_{\sigma^2|\beta,y}) \\ a_{\sigma^2|\beta,y} &= a_0 + \frac{N}{2} \\ b_{\sigma^2|\beta,y} &= b_0 + \frac{\lVert y-X\beta\rVert^2}{2} \end{aligned} \] From which, the optimal density is \[ \begin{aligned} q^\star(\sigma^2) &\propto \mathbb E_{q(\beta)}\left[\ln p(\sigma^2|y,\beta)\right] \\ \implies q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ a_{q(\sigma^2)} &= a_0 + \frac{N}{2} \\ b_{q(\sigma^2)} &= b_0 + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned} \] implying that \[ \begin{aligned} \mathbb E_{q(\sigma^2)}\left[\sigma^{-2}\right] &= \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} \\ \mathbb E_{q(\sigma^2)}\left[\ln(\sigma^2)\right] &= \ln\left(b_{q(\sigma^2)}\right) -\psi\left(a_{q(\sigma^2)}\right) \\ \mathbb H_{q(\sigma^2)}\left[\sigma^2\right] &= a_{q(\sigma^2)} + \ln b_{q(\sigma^2)} + \ln\Gamma \left(a_{q(\sigma^2)}\right) - (1+a_{q(\sigma^2)})\psi\left(a_{q(\sigma^2)}\right) \end{aligned} \] in the variational parameters for \(q^\star(\beta)\).
An alternative form for the \(b_{q(\sigma^2)}\) term which avoids repeated computation of statistics is to use \[ \begin{aligned} \lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)}) &= \\ y^{\mathsf{T}}y-2\mu_{q(\beta)}^{\mathsf{T}}X^{\mathsf{T}}y&+\text{tr}\left[(X^{\mathsf{T}}X)\left(\Psi_{q(\beta)}+\mu_{q(\beta)}\mu_{q(\beta)}^{\mathsf{T}}\right)\right]. \end{aligned} \]
The updates are then \[ \begin{aligned} a_{q(\sigma^2)} &\leftarrow a_0 + N/2 \\ \text{Cycle:} \\ \Psi_{q(\beta)} &\leftarrow \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &\leftarrow \Psi_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right) \\ b_{q(\sigma^2)} &\leftarrow b_0 + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned} \] until the change in \(\mathcal{L}(q)\) is below a specified tolerance level indicating convergence.
The lower bound itself is \[ \begin{aligned} \mathcal{L}(q) &= \mathbb E_q[\ln p(y,\beta,\sigma^2) - q(\beta,\sigma^2)] \\ &= \mathbb E_q[\ln p(y|\beta,\sigma^2)] + \mathbb E_q[\ln p(\beta)] + \mathbb E_q[\ln p(\sigma^2)] + \mathbb H_q[\beta] + \mathbb H_q[\sigma^2] \\ \mathbb E_q[\ln p(y|\beta,\sigma^2)] &= \\ \mathbb E_q[\ln p(\beta)] &= -\frac{1}{2}\left[d\ln(2\pi)+\ln|\Sigma_0|+(\mathbb E_q[\beta]-\mu_0)^\mathsf{T}\Sigma_0^{-1}(\mathbb E_q[\beta]-\mu_0)+\text{tr}\left(\Sigma_0^{-1}\mathbb V_q[\beta]\right)\right]\\ \mathbb E_q[\ln p(\sigma^2)] &= a_0\ln(b_0)-\ln\Gamma(a_0)-(a_0+1)\mathbb E_q[\ln\sigma^2]-b_0\mathbb E_q[\sigma^{-2}] \end{aligned} \]
Half-\(t\) prior on \(\sigma\)
(Maestrini and Wand 2020)
The optimal density for \(\beta\) is unchanged from the previous section.
For \(\lambda\), \[ \begin{aligned} \ln p(\lambda|\sigma^2) &\simeq -\lambda a_0^2 -(1/2+1)\ln\lambda -b_0/2\ln\lambda -\lambda\sigma^2/b_0\\ \implies p(\lambda|\sigma^2) &= \text{Inverse-Gamma}\left(a_{q(\lambda)}, b_{q(\lambda)}\right) \\ a_{\lambda|\sigma^2} &=(b_0+1)/2 \\ b_{\lambda|\sigma^2} &= a_0^{-2}+b_0\sigma^{-2} \end{aligned} \] From which, the optimal density is \[ \begin{aligned} q^\star(\lambda) &\propto \exp\left(\mathbb E_{q(\sigma^2)}[\ln p(\lambda|\sigma^2)]\right) \\ \implies q^\star(\lambda) &= \text{Inverse-Gamma}(a_{q(\lambda)}, b_{q(\lambda)}) \\ a_{q(\lambda)} &= (b_0+1)/2\\ b_{q(\lambda)} &= a_0^{-2}+b_0\mathbb E_{q(\sigma^2)}[\sigma^{-2}]. \end{aligned} \]
For \(\sigma^2\), \[ \begin{aligned} \ln p(\sigma^2|y,\beta,\lambda) &\simeq \frac{N}{2}\ln(\sigma^2) - \frac{1}{2}\sigma^{-2}||y-X\beta||^2 -(b_0/2+1)\ln(\sigma^2)-\sigma^{-2}b_0/\lambda \\ \implies p(\sigma^2|y,\beta,\lambda) &= \text{Inverse-Gamma}(a_{\sigma^2|y,\beta,\lambda}, b_{\sigma^2|y,\beta,\lambda}) \\ a_{\sigma^2|y,\beta,\lambda} &= (b_0+N)/2 \\ b_{\sigma^2|y,\beta,\lambda} &= b_0/\lambda + \frac{||y-X\beta||^2}{2} \end{aligned} \] from which, the optimal density is \[ \begin{aligned} q^\star(\sigma^2) &\propto \exp\left(\mathbb E_{q(\beta,\lambda)}[\ln p(\sigma^2|y,\beta,\lambda)]\right) \\ \implies q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ a_{q(\sigma^2)} &= (b_0 + N)/2 \\ b_{q(\sigma^2)} &= b_0\mathbb E_{q(\lambda)}[\lambda^{-1}] + \frac{||y-X\mu_{q(\beta)}||^2+\text{tr}\left(X^\mathsf{T}X\Psi_{q(\beta)}\right)}{2}\\ &= b_0\frac{a_{q(\lambda)}}{b_{q(\lambda)}} + \frac{||y-X\mu_{q(\beta)}||^2+\text{tr}\left(X^\mathsf{T}X\Psi_{q(\beta)}\right)}{2}. \end{aligned} \]
The updates are then \[ \begin{aligned} a_{q(\lambda)} &\leftarrow (b_0+1)/2 \\ a_{q(\sigma^2)} &\leftarrow (b_0 + N)/2 \\ \text{Cycle:} \\ \Psi_{q(\beta)} &\leftarrow \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &\leftarrow \Psi_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right) \\ b_{q(\lambda)} &\leftarrow a_0^{-2} + b_0 \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} \\ b_{q(\sigma^2)} &\leftarrow b_0\frac{a_{q(\lambda)}}{b_{q(\lambda)}} + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned} \]
\(t\) Linear Regression
(Wand et al. 2010)
We replace \(p(y|\beta,\sigma^2) = \text{Normal}(y|X\beta, \Sigma)\) by \[ \begin{aligned} p(y_i|\beta,\sigma,\nu) &= \text{Student-t}(x_i^{\mathsf{T}}\beta, \sigma, \nu) \\ p(\nu) &= \text{Uniform}(\nu_0, \nu_1) \end{aligned} \] which is equivalent to \[ \begin{aligned} p(y_i|\beta,\sigma,\nu) &= \text{Normal}(x_i^{\mathsf{T}}\beta, \lambda_i\sigma^2) \\ p(\lambda_i|\nu) &= \text{Inverse-Gamma}(\nu/2, \nu/2) \\ p(\nu) &= \text{Uniform}(\nu_0, \nu_1). \end{aligned} \]
Assuming an inverse-gamma prior on \(\sigma^2\), the full-conditionals satisfy \[ \begin{aligned} \ln p(\beta|y,\sigma^2,\nu,\lambda) &= \text{const} \\ \ln p(\sigma^2|y,\beta,\nu,\lambda) &= \text{const} -(a_0+n/2)\ln(\sigma^2) - \left[b_0+\frac{1}{2}\left((y-X\beta)^{\mathsf{T}} \text{diag}(1/\lambda)(y - X\beta)\right)\right]/\sigma^2 \\ \ln p(\lambda_i|y,\beta,\nu,\sigma^2) &= \text{const} + \sum_{i=1}^n\left[ -\frac{1}{2}(\nu + 1)\ln(\lambda_i) - \frac{1}{2}\left[\nu + \sigma^{-2}(y_i-x_i^{\mathsf{T}}\beta)^2\right]/\lambda_i\right] \\ \ln p(\nu|y,\beta,\sigma^2,\lambda) &= \text{const} + n\left[\frac{\nu}{2}\ln(\nu/2)-\ln\Gamma(\nu/2)\right]-(\nu/2)1^{\mathsf{T}}(\ln\lambda+1/\lambda),\quad\nu_0<\nu<\nu_1 \end{aligned} \]
We use the factorisation \(q(\beta,\nu,\sigma^2,\lambda)=q(\beta,\nu)q(\sigma^2)q(\lambda)\) which, from the form of the full-conditionals above, results in \[ \begin{aligned} q^\star(\beta) &= \text{Normal}(\mu_{q(\beta)}, \Sigma_{q(\beta)}) \\ q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ q^\star(\lambda_i) &= \text{Inverse-Gamma}(a_{q(\lambda_i)}, b_{q(\lambda_i)}) \\ q^\star(\nu) &= \frac{\exp\left\{n\left[\frac{\nu}{2}\ln(\nu/2)-\ln\Gamma(\nu/2)\right]-(\nu/2)C_\nu\right\}}{\mathcal{F}(0,n,C_\nu,\nu_0,\nu_1)} \\ C_\nu &= \sum_{i=1}^n \mathbb E[\ln\lambda_i] + \mathbb E[\lambda_i^{-1}] \\ &= \sum_{i=1}^n \ln(b_{q(\lambda_i)}) - \psi\left(\frac{1}{2}(\mathbb \mu_{q(\nu)}+1)\right) + \frac{a_{q(\lambda_i)}}{b_{q(\lambda_i)}} \\ \mu_{q(\nu)} &= \frac{\mathcal{F}(1,n,C_\nu,\nu_0,\nu_1)}{\mathcal{F}(0,n,C_\nu,\nu_0,\nu_1)} \\ a_{q(\lambda_i)} &= \frac{\mu_{q(\nu)} + 1}{2}\\ b_{q(\lambda_i)} &= \frac{1}{2}\left[\mu_{q(\nu)} + \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}\left\{(y-X\mu_{q(\beta)})_i^2 + (X\Sigma_{q(\beta)}X^{\mathsf{T}})_{ii}\right\}\right] \\ D_\lambda &= \text{diag}(a_{q(\lambda_i)}/b_{q(\lambda_i)}) \\ a_{q(\sigma^2)} &= a_0 + n/2\\ b_{q(\sigma^2)} &= b_0 + \frac{1}{2}\left[(y-X\mu_{q(\beta)})^{\mathsf{T}} D_\lambda(y-X\mu_{q(\beta)})+\text{tr}(\Sigma_{q(\beta)}X^{\mathsf{T}} D_\lambda X)\right]\\ \Sigma_{q(\beta)} &= \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} D_\lambda X + \Sigma_0^{-1}\right)^{-1}\\ \mu_{q(\beta)} &= \Sigma_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} D_\lambda y + \Sigma_0^{-1}\mu_0\right). \end{aligned} \]
The ELBO is then \[ \begin{aligned} \mathcal{L}(q) &= \mathbb E[\ln p(y,\beta,\sigma^2,\lambda,\nu) - \ln q(\beta,\sigma^2,\lambda,\nu)] \\ &= \end{aligned} \]
Linear Mixed Models
Either of the previous likelihoods may be extended to a hiearchical model \[ \begin{aligned} p(y|\beta,\sigma^2) &= \text{Normal}(y|X\beta + Z\gamma, \Sigma) \\ p(\gamma|G) &= \text{Normal}(\gamma|0,G). \end{aligned} \] Where we specify grouped parameters, \(\gamma = (\gamma_1^{\mathsf{T}},...,\gamma_K^{\mathsf{T}})^{\mathsf{T}}\) \[ \begin{aligned} Z &= \underset{N\times \sum_k(J_kR_k)}{\underbrace{\begin{bmatrix} Z_1 & \cdots & \overset{N\times J_kR_k}{\overbrace{Z_k}} &\cdots &Z_K \end{bmatrix}}} \\ \gamma_k|\underset{R_k\times R_k}{\underbrace{\Sigma_k}} &\sim \text{Normal}\left(0, I_{J_k} \otimes \Sigma_k\right) \\ \gamma_{kj}|\Sigma_k &\sim \text{Normal}(0, \Sigma_k),\quad j=1,...,J_k\\ G &= \bigoplus_{k=1}^K I_{J_k}\otimes \Sigma_k \\ &= \text{diag}(I_{J_1}\otimes \Sigma_1,...,I_{J_K}\otimes \Sigma_K) \end{aligned} \] such that \(J_k,R_k\) hint at the structure of \(\gamma_k\), e.g. (but not restricted to) \[ Z_k = I_{J_k}\otimes \underset{R_k\times R_k}{\underbrace{\tilde Z_k}}. \]
This covers combinations of different hierarchical terms in the linear predictor for example, \[ \begin{aligned} Z &= Z_1 \\ &= I_{10}\otimes 1_3\\ \gamma_1&\sim N(0, \tau_1^2I_{10}) \end{aligned} \] or \[ \begin{aligned} Z &= \begin{bmatrix} Z_1 & Z_2 & Z_3 \end{bmatrix} \\ Z_1 &= I_{3}\otimes 1_4\\ Z_2 &= I_{4} \otimes \begin{pmatrix}1&1\\1&2\\1&3\end{pmatrix} \\ Z_3 &= \begin{pmatrix} z_{3,1,1}&\cdots&z_{3,1,6}\\ \vdots &\ddots &\vdots \\ z_{3,12,1}&\cdots&z_{3,12,6} \end{pmatrix}\\ \gamma_1|\sigma_1^2&\sim N(0, \sigma_1^2I_{3}) \\ \gamma_2|\Sigma_2&\sim N(0, I_{4} \otimes\Sigma_2) \\ \gamma_3|\sigma_3^2 &\sim N(0, \sigma_3^2I_{6}) \\ G &= \text{bdiag}(\sigma_1^2 I_3. I_4\otimes\Sigma_2, \sigma_3^2 I_6) \end{aligned} \]
Similarly to the fixed effects case, we can consider priors on the variance components \[ \begin{aligned} \Sigma_k &\sim \text{Inverse-Wishart}(\xi_k,\Lambda_k),\quad \xi_k>2(R_k-1)\\ \Sigma_k|\Lambda_k &\sim \text{Inverse-Wishart}(\nu_k+2(R_k-1),\Lambda_k),\quad \nu_k>0 \\ \Lambda_k &\sim \text{Inverse-Wishart}\left(1, \left[\nu_k\text{diag}(a_{k1}^2,...,a_{kR_k}^2)\right]^{-1}\right) \end{aligned} \] where if \(R_k=1\) then \(\Sigma_k=\tau_kI_{J_k}\) and \[ \begin{aligned} \tau_k &\sim \text{Inverse-Gamma}(\xi_k/2,\xi_k\lambda_k/2) \\ \tau_k|\lambda_k&\sim \text{Inverse-Gamma}(\xi_k/2,\xi_k\lambda_k/2) \\ \lambda_k &\sim \text{Inverse-Gamma}\left(1/2,a_k^{-2}\right) \end{aligned} \]
The derivation for the optimal densities follows that for the linear regression model. Define \(C = [X \ Z]\), \(\zeta=(\beta^\top \ \gamma^\top)^\top\), and \[ \Xi = \Sigma_0 \oplus G \implies \Xi^{-1} = \Sigma_0^{-1}\oplus G^{-1} \]
Then \[ \begin{aligned} p(\beta,\gamma|y,\Sigma,G) &\propto p(\beta,\gamma|y,\Sigma,\Omega_1,...,\Omega_K) \\ &\propto p(y|\beta,\gamma,\Sigma)p(\beta)\prod_{k=1}^Kp(\gamma_k|\Omega_k) \\ \ln p(\beta,\gamma|) &\simeq -\frac{1}{2}\left[\zeta^{^{\mathsf{T}}}(C^\top\Sigma^{-1}C+\Xi^{-1})\zeta - 2\zeta^{\mathsf{T}}C^{\mathsf{T}}\Sigma^{-1} y\right] \\ \zeta|\text{rest} &\sim \text{Normal}(\mu_{\zeta|\text{rest}},\Sigma_{\zeta|\text{rest}}) \\ \Sigma_{\zeta|\text{rest}} &= \left(C^{\mathsf{T}}\Sigma^{-1}C + \Xi^{-1}\right)^{-1}\\ \mu_{\zeta|\text{rest}} &= \Sigma_{\zeta|\text{rest}}\left(C^{\mathsf{T}}\Sigma^{-1}y + \Xi^{-1}\begin{bmatrix}\mu_0\\ 0 \end{bmatrix}\right). \end{aligned} \] From which, the optimal density is \[ \begin{aligned} q(\zeta) &= \text{Normal}(\zeta|\mu_\zeta,\Sigma_\zeta) \\ \Sigma_\zeta &= \left(C^{\mathsf{T}}\mathbb E_q[\Sigma^{-1}]C + \mathbb E_q[\Xi^{-1}]\right)^{-1} \\ \mu_\zeta &= \Sigma_\zeta \left(C^{\mathsf{T}}\mathbb E_q[\Sigma^{-1}]y + \mathbb E_q[\Xi^{-1}]\begin{bmatrix}\mu_0\\ 0 \end{bmatrix}\right) \\ \mathbb E_q[\Xi^{-1}] &= \begin{bmatrix} \Sigma_0^{-1} & 0 \\ 0 & \bigoplus_{k=1}^K\left(I_{J_k}\otimes \mathbb E[\Omega_k^{-1}]\right)\end{bmatrix} \end{aligned} \]
Suppose we again assumed \(\Sigma = \sigma^2 I_N\) and \(p(\sigma^2)=\text{Inv-Gamma}(a_0,b_0)\), then similar to before \[ \begin{aligned} q(\sigma^2) &= \text{Inverse-Gamma}(a_{\sigma^2}, b_{\sigma}^2) \\ a_{\sigma^2} &= a_0 + \tfrac{N}{2} \\ b_{\sigma^2} &= b_0 + \tfrac{1}{2}\left\{\lVert y - C\mu_{q(\zeta)}\rVert^2+\text{tr}(C^{\mathsf{T}} C\Sigma_{q(\zeta)})\right\}. \end{aligned} \]
The new variational densities are those for \(\Omega_k\). Consider just one \(\Omega_k\) where \(\gamma_i | \Omega \sim N(0, \Omega), i=1,...,m\) (i.e. a standard random effects model) and where \(\Omega \sim \text{Inverse-Wishart}(\xi_0, \Lambda_0)\) then \[ \begin{aligned} p(\Omega|\text{rest}) &\propto p(\Omega|\gamma) \\ &\propto p(\gamma|\Omega)p(\Omega) \\ \ln p(\Omega|\gamma) &\simeq -\frac{m}{2} \ln |\Omega| - \frac{1}{2}\sum_{i=1}^m\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i - (\xi_0 + p + 1)/2\ln|\Omega| - \text{tr}(\Lambda_0\Omega^{-1}) \\ \mathbb {E}_q[\ln p(\Omega|\gamma)] &\simeq - \frac{1}{2}(\xi_0 + p + 1 + m)\ln|\Omega| - \frac{1}{2}\sum_{i=1}^m \mathbb{E}_q[\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i] - \text{tr}(\Lambda_0\Omega^{-1}) \end{aligned} \] Note that \[ \begin{aligned} \mathbb{E}_q[\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i] &= (\mathbb{E}_q[\gamma_i])^\mathsf{T}\Omega^{-1}\mathbb{E}_q[\gamma_i] + \text{tr}(\mathbb{V}_q[\gamma_i]\Omega^{-1}) \\ &= \text{tr}\left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T}\Omega^{-1} + \mathbb{V}_q[\gamma_i]\Omega^{-1}\right) \\ &= \text{tr}\left(\left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T} + \mathbb{V}_q[\gamma_i]\right)\Omega^{-1}\right) \end{aligned} \] So the optimal density is also Inverse-Wishart, i.e. \[ q(\Omega) = \text{Inverse-Wishart}(\Omega| \xi_{q(\Omega)}, \Lambda_{q(\Omega)}) \] where \[ \begin{aligned} \xi_{q(\Omega)} &= \xi_0 + m \\ \Lambda_{q(\Omega)} &= \Lambda_0 + \sum_{i=1}^m \left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T} + \mathbb{V}_q[\gamma_i]\right) \end{aligned} \] and where the required expectations are \[ \begin{aligned} \mathbb{E}_q[\Omega^{-1}] &= \xi_{q(\Omega)}\Lambda_{q(\Omega)}^{-1} \\ \mathbb{E}_q[\ln|\Omega|] &= \ln|\Lambda_{q(\Omega)}/2| - \psi_d(\xi_{q(\Omega)} - d + 1) \end{aligned} \]