Skip to contents

Normal Linear Regression

We consider the following model p(y|β,Σ)=Normal(y|Xβ,Σ)p(β)=Normal(β|μ0,Σ0) \begin{aligned} p(y|\beta,\Sigma) &= \text{Normal}(y|X\beta, \Sigma) \\ p(\beta) &= \text{Normal}(\beta|\mu_0, \Sigma_0) \end{aligned} often assuming Σ=σ2IN\Sigma = \sigma^2 I_N with either p(σ2)=Inverse-Gamma(σ2|a0,b0)p(\sigma^2) = \text{Inverse-Gamma}(\sigma^2|a_0, b_0) or p(σ)=Half-t(σ2|a0,b0)p(\sigma) = \text{Half-}t(\sigma^2|a_0,b_0).

(Look at the generalisation of this using G-Wishart priors as in Maestrini and Wand 2020)

The latter can be expressed as p(σ2|λ)=Inverse-Gamma(b0/2,b0/λ)p(λ)=Inverse-Gamma(1/2,1/a02)p(σ)=Half-t(a0,b0). \begin{aligned} p(\sigma^2|\lambda) &= \text{Inverse-Gamma}(b_0/2, b_0/\lambda) \\ p(\lambda) &= \text{Inverse-Gamma}(1/2,1/a_0^2) \\ \implies p(\sigma) &= \text{Half-}t(a_0,b_0). \end{aligned}

The proposed product-form variational densities are q(β,Σ,λ)=q(β)q(Σ)q(λ)q(\beta,\Sigma,\lambda)=q(\beta)q(\Sigma)q(\lambda) with optimal solutions q(β)exp{𝔼Σ[lnp(β|y,Σ)]}q(Σ)exp{𝔼β,λ[lnp(Σ|y,β,λ)]}q(λ)exp{𝔼Σ[lnp(λ|y,Σ)} \begin{aligned} q^\star(\beta) &\propto \exp\left\{\mathbb E_{\Sigma}\left[\ln p(\beta|y,\Sigma)\right]\right\} \\ q^\star(\Sigma) &\propto \exp\left\{\mathbb E_{\beta,\lambda}\left[\ln p(\Sigma|y,\beta,\lambda)\right]\right\} \\ q^\star(\lambda) &\propto \exp\left\{\mathbb E_\Sigma[\ln p(\lambda|y,\Sigma)\right\} \end{aligned} where the λ\lambda parameter is dropped if not needed.

From the model above, the joint density is lnp(y,β,Σ)=lnp(y|β,Σ)+lnp(β)+lnp(Σ|λ)+lnp(λ)\ln p(y,\beta,\Sigma) = \ln p(y|\beta,\Sigma) + \ln p(\beta) + \ln p(\Sigma|\lambda) + \ln p(\lambda).

Model Likelihood

The log-likelihood has the form lnp(y|β,Σ)=12(ln|Σ|+(yXβ)𝖳Σ1(yXβ)+pln(2π)) \ln p(y|\beta,\Sigma) = -\frac{1}{2}\left(\ln |\Sigma| +(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)+p\ln(2\pi)\right) so that 𝔼q[lnp(y|β,Σ)]=12(𝔼q[ln|Σ|]+𝔼q[(yXβ)𝖳Σ1(yXβ)]+pln(2π))=12(𝔼q[ln|Σ|]+(yXμβ)𝖳𝔼q[Σ1](yXμβ)+tr(X𝖳𝔼q[Σ1]XΣβ)+pln(2π)) \begin{aligned} \mathbb E_q[\ln p(y|\beta,\Sigma)] &= -\frac{1}{2}\left(\mathbb E_q[\ln|\Sigma|] + \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)]+p\ln(2\pi)\right) \\ &= -\frac{1}{2}\left(\mathbb E_q[\ln|\Sigma|] + (y-X\mu_\beta)^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mu_\beta) + \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\Sigma_\beta\right)+p\ln(2\pi)\right) \end{aligned} due to the assumed variational independence of β\beta and Σ\Sigma.

To obtain the above, (letting μβ=𝔼q[β]\mu_\beta = \mathbb E_q[\beta] and Σβ=𝕍q[β]\Sigma_\beta = \mathbb V_q[\beta]) 𝔼q[(yXβ)𝖳Σ1(yXβ)]=y𝖳𝔼q[Σ1]y2y𝖳𝔼q[Σ1]X𝔼q[β]+𝔼q[β𝖳X𝖳Σ1Xβ]𝔼q[β𝖳X𝖳Σ1Xβ]=𝔼q[tr(X𝖳Σ1Xββ𝖳)]=tr(X𝖳𝔼q[Σ1]X{𝕍q[β]+𝔼q[β]𝔼q[β]𝖳})=tr(X𝖳𝔼q[Σ1]X𝕍q[β])+(X𝔼q[β])𝖳𝔼q[Σ1]X𝔼q[β]y𝖳𝔼q[Σ1]y2y𝖳𝔼q[Σ1]X𝔼q[β]=(yX𝔼q[β])𝖳𝔼q[Σ1](yX𝔼q[β])(X𝔼q[β])𝖳𝔼q[Σ1]X𝔼q[β]𝔼q[(yXβ)𝖳Σ1(yXβ)]=(yXμβ)𝖳𝔼q[Σ1](yXμβ)+tr(X𝖳𝔼q[Σ1]XΣβ) \begin{aligned} \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)] &= y^\mathsf{T}\mathbb E_q[\Sigma{-1}]y-2y^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta]+\mathbb E_q[\beta^\mathsf{T}X^\mathsf{T}\Sigma^{-1}X\beta]\\ \mathbb E_q[\beta^\mathsf{T}X^\mathsf{T}\Sigma^{-1}X\beta] &= \mathbb E_q[\text{tr}\left(X^\mathsf{T}\Sigma^{-1}X\beta\beta^\mathsf{T}\right)] \\ &= \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\left\{\mathbb V_q[\beta]+\mathbb E_q[\beta]\mathbb E_q[\beta]^\mathsf{T}\right\}\right) \\ &= \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb V_q[\beta]\right)+(X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta]\\ y^\mathsf{T}\mathbb E_q[\Sigma{-1}]y-2y^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta] &= (y-X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mathbb E_q[\beta]) - (X\mathbb E_q[\beta])^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\mathbb E_q[\beta] \\ \mathbb E_q[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta)] &= (y-X\mu_\beta)^\mathsf{T}\mathbb E_q[\Sigma^{-1}](y-X\mu_\beta) + \text{tr}\left(X^\mathsf{T}\mathbb E_q[\Sigma^{-1}]X\Sigma_\beta\right) \end{aligned}

Model Coefficients

For the regression coefficients, the full conditional is lnp(β|y,Σ)12[(yXβ)𝖳Σ1(yXβ)12(βμ0)𝖳Σ01(βμ0)]=12[(βM1m)𝖳M(βM1m)]M=X𝖳Σ1X+Σ01m=X𝖳Σ1y+Σ01μ0p(β|y,Σ)=Normal(μβ|y,Σ,Ψβ|y,Σ)Ψβ|y,Σ=(X𝖳Σ1X+Σ01)1μβ|y,Σ=Ψβ|y,Σ(X𝖳Σ1y+Σ01μ0). \begin{aligned} \ln p(\beta | y, \Sigma) &\simeq -\frac{1}{2}\left[(y-X\beta)^{\mathsf{T}}\Sigma^{-1}(y-X\beta) - \frac{1}{2}(\beta-\mu_0)^{\mathsf{T}}\Sigma_0^{-1}(\beta-\mu_0)\right] \\ &= -\frac{1}{2}\left[\left(\beta - M^{-1}m\right)^{\mathsf{T}} M\left(\beta - M^{-1}m\right)\right]\\ M &= X^{\mathsf{T}}\Sigma^{-1}X + \Sigma_0^{-1} \\ m &= X^{\mathsf{T}}\Sigma^{-1}y + \Sigma_0^{-1}\mu_0 \\ \implies p(\beta|y,\Sigma) &= \text{Normal}(\mu_{\beta|y,\Sigma}, \Psi_{\beta|y,\Sigma}) \\ \Psi_{\beta|y,\Sigma} &= \left(X^{\mathsf{T}}\Sigma^{-1}X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{\beta|y,\Sigma} &= \Psi_{\beta|y,\Sigma}\left(X^{\mathsf{T}}\Sigma^{-1}y + \Sigma_0^{-1}\mu_0\right). \end{aligned}

From which, the optimal density is 𝔼q(Σ)[lnp(β|y,Σ)]12[(βMq1mq)𝖳Mq(βMq1mq)]Mq=𝔼q(Σ)[Σ1]X𝖳X+Σ01mq=𝔼q(Σ)[Σ1]X𝖳y+Σ01μ0q(β)=Normal(μq(β),Ψq(β))Ψq(β)=(𝔼q(Σ)[Σ1]X𝖳X+Σ01)1μq(β)=Σq(β)(𝔼q(Σ)[Σ1]X𝖳y+Σ01μ0). \begin{aligned} \mathbb E_{q(\Sigma)}[\ln p(\beta|y,\Sigma)] &\simeq -\frac{1}{2}\left[\left(\beta - M_q^{-1}m_q\right)^{\mathsf{T}} M_q\left(\beta - M_q^{-1}m_q\right)\right] \\ M_q &= \mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right]X^{\mathsf{T}} X + \Sigma_0^{-1} \\ m_q &= \mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0 \\ \implies q^\star(\beta) &= \text{Normal}(\mu_{q(\beta)}, \Psi_{q(\beta)}) \\ \Psi_{q(\beta)} &= \left(\mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &= \Sigma_{q(\beta)}\left(\mathbb E_{q(\Sigma)}\left[\Sigma^{-1}\right] X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right). \end{aligned}

The exact formulation of this density depends upon the form of Σ\Sigma and the resulting density q(Σ)q(\Sigma). Assuming that Σ=σ2I\Sigma=\sigma^2 I, replace q(Σ)q(\Sigma) with q(σ2)q(\sigma^2).

Inverse-Gamma prior on σ2\sigma^2

For the variance component assuming Σ=σ2I\Sigma=\sigma^2 I, lnp(σ2|y,β)N2ln(σ2)12σ2[(yXβ)𝖳(yXβ)](a0+1)ln(σ2)σ2b0p(σ2|y,β)=Inverse-Gamma(σ2|aσ2|β,y,bσ2|β,y)aσ2|β,y=a0+N2bσ2|β,y=b0+yXβ22 \begin{aligned} \ln p(\sigma^2|y,\beta) &\simeq \frac{N}{2}\ln(\sigma^{2})-\frac{1}{2}\sigma^{-2}\left[(y-X\beta)^{\mathsf{T}}(y-X\beta)\right] - (a_0+1)\ln(\sigma^2) - \sigma^{-2}b_0\\ \implies p(\sigma^2|y,\beta) &= \text{Inverse-Gamma}(\sigma^2|a_{\sigma^2|\beta,y}, b_{\sigma^2|\beta,y}) \\ a_{\sigma^2|\beta,y} &= a_0 + \frac{N}{2} \\ b_{\sigma^2|\beta,y} &= b_0 + \frac{\lVert y-X\beta\rVert^2}{2} \end{aligned} From which, the optimal density is q(σ2)𝔼q(β)[lnp(σ2|y,β)]q(σ2)=Inverse-Gamma(aq(σ2),bq(σ2))aq(σ2)=a0+N2bq(σ2)=b0+yXμq(β)2+tr(X𝖳XΨq(β))2 \begin{aligned} q^\star(\sigma^2) &\propto \mathbb E_{q(\beta)}\left[\ln p(\sigma^2|y,\beta)\right] \\ \implies q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ a_{q(\sigma^2)} &= a_0 + \frac{N}{2} \\ b_{q(\sigma^2)} &= b_0 + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned} implying that 𝔼q(σ2)[σ2]=aq(σ2)bq(σ2)𝔼q(σ2)[ln(σ2)]=ln(bq(σ2))ψ(aq(σ2))q(σ2)[σ2]=aq(σ2)+lnbq(σ2)+lnΓ(aq(σ2))(1+aq(σ2))ψ(aq(σ2)) \begin{aligned} \mathbb E_{q(\sigma^2)}\left[\sigma^{-2}\right] &= \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} \\ \mathbb E_{q(\sigma^2)}\left[\ln(\sigma^2)\right] &= \ln\left(b_{q(\sigma^2)}\right) -\psi\left(a_{q(\sigma^2)}\right) \\ \mathbb H_{q(\sigma^2)}\left[\sigma^2\right] &= a_{q(\sigma^2)} + \ln b_{q(\sigma^2)} + \ln\Gamma \left(a_{q(\sigma^2)}\right) - (1+a_{q(\sigma^2)})\psi\left(a_{q(\sigma^2)}\right) \end{aligned} in the variational parameters for q(β)q^\star(\beta).

An alternative form for the bq(σ2)b_{q(\sigma^2)} term which avoids repeated computation of statistics is to use yXμq(β)2+tr(X𝖳XΨq(β))=y𝖳y2μq(β)𝖳X𝖳y+tr[(X𝖳X)(Ψq(β)+μq(β)μq(β)𝖳)]. \begin{aligned} \lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)}) &= \\ y^{\mathsf{T}}y-2\mu_{q(\beta)}^{\mathsf{T}}X^{\mathsf{T}}y&+\text{tr}\left[(X^{\mathsf{T}}X)\left(\Psi_{q(\beta)}+\mu_{q(\beta)}\mu_{q(\beta)}^{\mathsf{T}}\right)\right]. \end{aligned}

The updates are then aq(σ2)a0+N/2Cycle:Ψq(β)(aq(σ2)bq(σ2)X𝖳X+Σ01)1μq(β)Ψq(β)(aq(σ2)bq(σ2)X𝖳y+Σ01μ0)bq(σ2)b0+yXμq(β)2+tr(X𝖳XΨq(β))2 \begin{aligned} a_{q(\sigma^2)} &\leftarrow a_0 + N/2 \\ \text{Cycle:} \\ \Psi_{q(\beta)} &\leftarrow \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &\leftarrow \Psi_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right) \\ b_{q(\sigma^2)} &\leftarrow b_0 + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned} until the change in (q)\mathcal{L}(q) is below a specified tolerance level indicating convergence.

The lower bound itself is (q)=𝔼q[lnp(y,β,σ2)q(β,σ2)]=𝔼q[lnp(y|β,σ2)]+𝔼q[lnp(β)]+𝔼q[lnp(σ2)]+q[β]+q[σ2]𝔼q[lnp(y|β,σ2)]=𝔼q[lnp(β)]=12[dln(2π)+ln|Σ0|+(𝔼q[β]μ0)𝖳Σ01(𝔼q[β]μ0)+tr(Σ01𝕍q[β])]𝔼q[lnp(σ2)]=a0ln(b0)lnΓ(a0)(a0+1)𝔼q[lnσ2]b0𝔼q[σ2] \begin{aligned} \mathcal{L}(q) &= \mathbb E_q[\ln p(y,\beta,\sigma^2) - q(\beta,\sigma^2)] \\ &= \mathbb E_q[\ln p(y|\beta,\sigma^2)] + \mathbb E_q[\ln p(\beta)] + \mathbb E_q[\ln p(\sigma^2)] + \mathbb H_q[\beta] + \mathbb H_q[\sigma^2] \\ \mathbb E_q[\ln p(y|\beta,\sigma^2)] &= \\ \mathbb E_q[\ln p(\beta)] &= -\frac{1}{2}\left[d\ln(2\pi)+\ln|\Sigma_0|+(\mathbb E_q[\beta]-\mu_0)^\mathsf{T}\Sigma_0^{-1}(\mathbb E_q[\beta]-\mu_0)+\text{tr}\left(\Sigma_0^{-1}\mathbb V_q[\beta]\right)\right]\\ \mathbb E_q[\ln p(\sigma^2)] &= a_0\ln(b_0)-\ln\Gamma(a_0)-(a_0+1)\mathbb E_q[\ln\sigma^2]-b_0\mathbb E_q[\sigma^{-2}] \end{aligned}

Half-tt prior on σ\sigma

(Maestrini and Wand 2020)

The optimal density for β\beta is unchanged from the previous section.

For λ\lambda, lnp(λ|σ2)λa02(1/2+1)lnλb0/2lnλλσ2/b0p(λ|σ2)=Inverse-Gamma(aq(λ),bq(λ))aλ|σ2=(b0+1)/2bλ|σ2=a02+b0σ2 \begin{aligned} \ln p(\lambda|\sigma^2) &\simeq -\lambda a_0^2 -(1/2+1)\ln\lambda -b_0/2\ln\lambda -\lambda\sigma^2/b_0\\ \implies p(\lambda|\sigma^2) &= \text{Inverse-Gamma}\left(a_{q(\lambda)}, b_{q(\lambda)}\right) \\ a_{\lambda|\sigma^2} &=(b_0+1)/2 \\ b_{\lambda|\sigma^2} &= a_0^{-2}+b_0\sigma^{-2} \end{aligned} From which, the optimal density is q(λ)exp(𝔼q(σ2)[lnp(λ|σ2)])q(λ)=Inverse-Gamma(aq(λ),bq(λ))aq(λ)=(b0+1)/2bq(λ)=a02+b0𝔼q(σ2)[σ2]. \begin{aligned} q^\star(\lambda) &\propto \exp\left(\mathbb E_{q(\sigma^2)}[\ln p(\lambda|\sigma^2)]\right) \\ \implies q^\star(\lambda) &= \text{Inverse-Gamma}(a_{q(\lambda)}, b_{q(\lambda)}) \\ a_{q(\lambda)} &= (b_0+1)/2\\ b_{q(\lambda)} &= a_0^{-2}+b_0\mathbb E_{q(\sigma^2)}[\sigma^{-2}]. \end{aligned}

For σ2\sigma^2, lnp(σ2|y,β,λ)N2ln(σ2)12σ2||yXβ||2(b0/2+1)ln(σ2)σ2b0/λp(σ2|y,β,λ)=Inverse-Gamma(aσ2|y,β,λ,bσ2|y,β,λ)aσ2|y,β,λ=(b0+N)/2bσ2|y,β,λ=b0/λ+||yXβ||22 \begin{aligned} \ln p(\sigma^2|y,\beta,\lambda) &\simeq \frac{N}{2}\ln(\sigma^2) - \frac{1}{2}\sigma^{-2}||y-X\beta||^2 -(b_0/2+1)\ln(\sigma^2)-\sigma^{-2}b_0/\lambda \\ \implies p(\sigma^2|y,\beta,\lambda) &= \text{Inverse-Gamma}(a_{\sigma^2|y,\beta,\lambda}, b_{\sigma^2|y,\beta,\lambda}) \\ a_{\sigma^2|y,\beta,\lambda} &= (b_0+N)/2 \\ b_{\sigma^2|y,\beta,\lambda} &= b_0/\lambda + \frac{||y-X\beta||^2}{2} \end{aligned} from which, the optimal density is q(σ2)exp(𝔼q(β,λ)[lnp(σ2|y,β,λ)])q(σ2)=Inverse-Gamma(aq(σ2),bq(σ2))aq(σ2)=(b0+N)/2bq(σ2)=b0𝔼q(λ)[λ1]+||yXμq(β)||2+tr(X𝖳XΨq(β))2=b0aq(λ)bq(λ)+||yXμq(β)||2+tr(X𝖳XΨq(β))2. \begin{aligned} q^\star(\sigma^2) &\propto \exp\left(\mathbb E_{q(\beta,\lambda)}[\ln p(\sigma^2|y,\beta,\lambda)]\right) \\ \implies q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ a_{q(\sigma^2)} &= (b_0 + N)/2 \\ b_{q(\sigma^2)} &= b_0\mathbb E_{q(\lambda)}[\lambda^{-1}] + \frac{||y-X\mu_{q(\beta)}||^2+\text{tr}\left(X^\mathsf{T}X\Psi_{q(\beta)}\right)}{2}\\ &= b_0\frac{a_{q(\lambda)}}{b_{q(\lambda)}} + \frac{||y-X\mu_{q(\beta)}||^2+\text{tr}\left(X^\mathsf{T}X\Psi_{q(\beta)}\right)}{2}. \end{aligned}

The updates are then aq(λ)(b0+1)/2aq(σ2)(b0+N)/2Cycle:Ψq(β)(aq(σ2)bq(σ2)X𝖳X+Σ01)1μq(β)Ψq(β)(aq(σ2)bq(σ2)X𝖳y+Σ01μ0)bq(λ)a02+b0aq(σ2)bq(σ2)bq(σ2)b0aq(λ)bq(λ)+yXμq(β)2+tr(X𝖳XΨq(β))2 \begin{aligned} a_{q(\lambda)} &\leftarrow (b_0+1)/2 \\ a_{q(\sigma^2)} &\leftarrow (b_0 + N)/2 \\ \text{Cycle:} \\ \Psi_{q(\beta)} &\leftarrow \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} X + \Sigma_0^{-1}\right)^{-1} \\ \mu_{q(\beta)} &\leftarrow \Psi_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} X^{\mathsf{T}} y + \Sigma_0^{-1}\mu_0\right) \\ b_{q(\lambda)} &\leftarrow a_0^{-2} + b_0 \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}} \\ b_{q(\sigma^2)} &\leftarrow b_0\frac{a_{q(\lambda)}}{b_{q(\lambda)}} + \frac{\lVert y - X\mu_{q(\beta)}\rVert^2+\text{tr}(X^{\mathsf{T}} X\Psi_{q(\beta)})}{2} \end{aligned}

tt Linear Regression

(Wand et al. 2010)

We replace p(y|β,σ2)=Normal(y|Xβ,Σ)p(y|\beta,\sigma^2) = \text{Normal}(y|X\beta, \Sigma) by p(yi|β,σ,ν)=Student-t(xi𝖳β,σ,ν)p(ν)=Uniform(ν0,ν1) \begin{aligned} p(y_i|\beta,\sigma,\nu) &= \text{Student-t}(x_i^{\mathsf{T}}\beta, \sigma, \nu) \\ p(\nu) &= \text{Uniform}(\nu_0, \nu_1) \end{aligned} which is equivalent to p(yi|β,σ,ν)=Normal(xi𝖳β,λiσ2)p(λi|ν)=Inverse-Gamma(ν/2,ν/2)p(ν)=Uniform(ν0,ν1). \begin{aligned} p(y_i|\beta,\sigma,\nu) &= \text{Normal}(x_i^{\mathsf{T}}\beta, \lambda_i\sigma^2) \\ p(\lambda_i|\nu) &= \text{Inverse-Gamma}(\nu/2, \nu/2) \\ p(\nu) &= \text{Uniform}(\nu_0, \nu_1). \end{aligned}

Assuming an inverse-gamma prior on σ2\sigma^2, the full-conditionals satisfy lnp(β|y,σ2,ν,λ)=constlnp(σ2|y,β,ν,λ)=const(a0+n/2)ln(σ2)[b0+12((yXβ)𝖳diag(1/λ)(yXβ))]/σ2lnp(λi|y,β,ν,σ2)=const+i=1n[12(ν+1)ln(λi)12[ν+σ2(yixi𝖳β)2]/λi]lnp(ν|y,β,σ2,λ)=const+n[ν2ln(ν/2)lnΓ(ν/2)](ν/2)1𝖳(lnλ+1/λ),ν0<ν<ν1 \begin{aligned} \ln p(\beta|y,\sigma^2,\nu,\lambda) &= \text{const} \\ \ln p(\sigma^2|y,\beta,\nu,\lambda) &= \text{const} -(a_0+n/2)\ln(\sigma^2) - \left[b_0+\frac{1}{2}\left((y-X\beta)^{\mathsf{T}} \text{diag}(1/\lambda)(y - X\beta)\right)\right]/\sigma^2 \\ \ln p(\lambda_i|y,\beta,\nu,\sigma^2) &= \text{const} + \sum_{i=1}^n\left[ -\frac{1}{2}(\nu + 1)\ln(\lambda_i) - \frac{1}{2}\left[\nu + \sigma^{-2}(y_i-x_i^{\mathsf{T}}\beta)^2\right]/\lambda_i\right] \\ \ln p(\nu|y,\beta,\sigma^2,\lambda) &= \text{const} + n\left[\frac{\nu}{2}\ln(\nu/2)-\ln\Gamma(\nu/2)\right]-(\nu/2)1^{\mathsf{T}}(\ln\lambda+1/\lambda),\quad\nu_0<\nu<\nu_1 \end{aligned}

We use the factorisation q(β,ν,σ2,λ)=q(β,ν)q(σ2)q(λ)q(\beta,\nu,\sigma^2,\lambda)=q(\beta,\nu)q(\sigma^2)q(\lambda) which, from the form of the full-conditionals above, results in q(β)=Normal(μq(β),Σq(β))q(σ2)=Inverse-Gamma(aq(σ2),bq(σ2))q(λi)=Inverse-Gamma(aq(λi),bq(λi))q(ν)=exp{n[ν2ln(ν/2)lnΓ(ν/2)](ν/2)Cν}(0,n,Cν,ν0,ν1)Cν=i=1n𝔼[lnλi]+𝔼[λi1]=i=1nln(bq(λi))ψ(12(μq(ν)+1))+aq(λi)bq(λi)μq(ν)=(1,n,Cν,ν0,ν1)(0,n,Cν,ν0,ν1)aq(λi)=μq(ν)+12bq(λi)=12[μq(ν)+aq(σ2)bq(σ2){(yXμq(β))i2+(XΣq(β)X𝖳)ii}]Dλ=diag(aq(λi)/bq(λi))aq(σ2)=a0+n/2bq(σ2)=b0+12[(yXμq(β))𝖳Dλ(yXμq(β))+tr(Σq(β)X𝖳DλX)]Σq(β)=(aq(σ2)bq(σ2)X𝖳DλX+Σ01)1μq(β)=Σq(β)(aq(σ2)bq(σ2)X𝖳Dλy+Σ01μ0). \begin{aligned} q^\star(\beta) &= \text{Normal}(\mu_{q(\beta)}, \Sigma_{q(\beta)}) \\ q^\star(\sigma^2) &= \text{Inverse-Gamma}(a_{q(\sigma^2)}, b_{q(\sigma^2)}) \\ q^\star(\lambda_i) &= \text{Inverse-Gamma}(a_{q(\lambda_i)}, b_{q(\lambda_i)}) \\ q^\star(\nu) &= \frac{\exp\left\{n\left[\frac{\nu}{2}\ln(\nu/2)-\ln\Gamma(\nu/2)\right]-(\nu/2)C_\nu\right\}}{\mathcal{F}(0,n,C_\nu,\nu_0,\nu_1)} \\ C_\nu &= \sum_{i=1}^n \mathbb E[\ln\lambda_i] + \mathbb E[\lambda_i^{-1}] \\ &= \sum_{i=1}^n \ln(b_{q(\lambda_i)}) - \psi\left(\frac{1}{2}(\mathbb \mu_{q(\nu)}+1)\right) + \frac{a_{q(\lambda_i)}}{b_{q(\lambda_i)}} \\ \mu_{q(\nu)} &= \frac{\mathcal{F}(1,n,C_\nu,\nu_0,\nu_1)}{\mathcal{F}(0,n,C_\nu,\nu_0,\nu_1)} \\ a_{q(\lambda_i)} &= \frac{\mu_{q(\nu)} + 1}{2}\\ b_{q(\lambda_i)} &= \frac{1}{2}\left[\mu_{q(\nu)} + \frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}\left\{(y-X\mu_{q(\beta)})_i^2 + (X\Sigma_{q(\beta)}X^{\mathsf{T}})_{ii}\right\}\right] \\ D_\lambda &= \text{diag}(a_{q(\lambda_i)}/b_{q(\lambda_i)}) \\ a_{q(\sigma^2)} &= a_0 + n/2\\ b_{q(\sigma^2)} &= b_0 + \frac{1}{2}\left[(y-X\mu_{q(\beta)})^{\mathsf{T}} D_\lambda(y-X\mu_{q(\beta)})+\text{tr}(\Sigma_{q(\beta)}X^{\mathsf{T}} D_\lambda X)\right]\\ \Sigma_{q(\beta)} &= \left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} D_\lambda X + \Sigma_0^{-1}\right)^{-1}\\ \mu_{q(\beta)} &= \Sigma_{q(\beta)}\left(\frac{a_{q(\sigma^2)}}{b_{q(\sigma^2)}}X^{\mathsf{T}} D_\lambda y + \Sigma_0^{-1}\mu_0\right). \end{aligned}

The ELBO is then (q)=𝔼[lnp(y,β,σ2,λ,ν)lnq(β,σ2,λ,ν)]= \begin{aligned} \mathcal{L}(q) &= \mathbb E[\ln p(y,\beta,\sigma^2,\lambda,\nu) - \ln q(\beta,\sigma^2,\lambda,\nu)] \\ &= \end{aligned}

Linear Mixed Models

Either of the previous likelihoods may be extended to a hiearchical model p(y|β,σ2)=Normal(y|Xβ+Zγ,Σ)p(γ|G)=Normal(γ|0,G). \begin{aligned} p(y|\beta,\sigma^2) &= \text{Normal}(y|X\beta + Z\gamma, \Sigma) \\ p(\gamma|G) &= \text{Normal}(\gamma|0,G). \end{aligned} Where we specify grouped parameters, γ=(γ1𝖳,...,γK𝖳)𝖳\gamma = (\gamma_1^{\mathsf{T}},...,\gamma_K^{\mathsf{T}})^{\mathsf{T}}Z=[Z1ZkN×JkRkZK]N×k(JkRk)γk|ΣkRk×RkNormal(0,IJkΣk)γkj|ΣkNormal(0,Σk),j=1,...,JkG=k=1KIJkΣk=diag(IJ1Σ1,...,IJKΣK) \begin{aligned} Z &= \underset{N\times \sum_k(J_kR_k)}{\underbrace{\begin{bmatrix} Z_1 & \cdots & \overset{N\times J_kR_k}{\overbrace{Z_k}} &\cdots &Z_K \end{bmatrix}}} \\ \gamma_k|\underset{R_k\times R_k}{\underbrace{\Sigma_k}} &\sim \text{Normal}\left(0, I_{J_k} \otimes \Sigma_k\right) \\ \gamma_{kj}|\Sigma_k &\sim \text{Normal}(0, \Sigma_k),\quad j=1,...,J_k\\ G &= \bigoplus_{k=1}^K I_{J_k}\otimes \Sigma_k \\ &= \text{diag}(I_{J_1}\otimes \Sigma_1,...,I_{J_K}\otimes \Sigma_K) \end{aligned} such that Jk,RkJ_k,R_k hint at the structure of γk\gamma_k, e.g. (but not restricted to) Zk=IJkZ̃kRk×Rk. Z_k = I_{J_k}\otimes \underset{R_k\times R_k}{\underbrace{\tilde Z_k}}.

This covers combinations of different hierarchical terms in the linear predictor for example, Z=Z1=I1013γ1N(0,τ12I10) \begin{aligned} Z &= Z_1 \\ &= I_{10}\otimes 1_3\\ \gamma_1&\sim N(0, \tau_1^2I_{10}) \end{aligned} or Z=[Z1Z2Z3]Z1=I314Z2=I4(111213)Z3=(z3,1,1z3,1,6z3,12,1z3,12,6)γ1|σ12N(0,σ12I3)γ2|Σ2N(0,I4Σ2)γ3|σ32N(0,σ32I6)G=bdiag(σ12I3.I4Σ2,σ32I6) \begin{aligned} Z &= \begin{bmatrix} Z_1 & Z_2 & Z_3 \end{bmatrix} \\ Z_1 &= I_{3}\otimes 1_4\\ Z_2 &= I_{4} \otimes \begin{pmatrix}1&1\\1&2\\1&3\end{pmatrix} \\ Z_3 &= \begin{pmatrix} z_{3,1,1}&\cdots&z_{3,1,6}\\ \vdots &\ddots &\vdots \\ z_{3,12,1}&\cdots&z_{3,12,6} \end{pmatrix}\\ \gamma_1|\sigma_1^2&\sim N(0, \sigma_1^2I_{3}) \\ \gamma_2|\Sigma_2&\sim N(0, I_{4} \otimes\Sigma_2) \\ \gamma_3|\sigma_3^2 &\sim N(0, \sigma_3^2I_{6}) \\ G &= \text{bdiag}(\sigma_1^2 I_3. I_4\otimes\Sigma_2, \sigma_3^2 I_6) \end{aligned}

Similarly to the fixed effects case, we can consider priors on the variance components ΣkInverse-Wishart(ξk,Λk),ξk>2(Rk1)Σk|ΛkInverse-Wishart(νk+2(Rk1),Λk),νk>0ΛkInverse-Wishart(1,[νkdiag(ak12,...,akRk2)]1) \begin{aligned} \Sigma_k &\sim \text{Inverse-Wishart}(\xi_k,\Lambda_k),\quad \xi_k>2(R_k-1)\\ \Sigma_k|\Lambda_k &\sim \text{Inverse-Wishart}(\nu_k+2(R_k-1),\Lambda_k),\quad \nu_k>0 \\ \Lambda_k &\sim \text{Inverse-Wishart}\left(1, \left[\nu_k\text{diag}(a_{k1}^2,...,a_{kR_k}^2)\right]^{-1}\right) \end{aligned} where if Rk=1R_k=1 then Σk=τkIJk\Sigma_k=\tau_kI_{J_k} and τkInverse-Gamma(ξk/2,ξkλk/2)τk|λkInverse-Gamma(ξk/2,ξkλk/2)λkInverse-Gamma(1/2,ak2) \begin{aligned} \tau_k &\sim \text{Inverse-Gamma}(\xi_k/2,\xi_k\lambda_k/2) \\ \tau_k|\lambda_k&\sim \text{Inverse-Gamma}(\xi_k/2,\xi_k\lambda_k/2) \\ \lambda_k &\sim \text{Inverse-Gamma}\left(1/2,a_k^{-2}\right) \end{aligned}

The derivation for the optimal densities follows that for the linear regression model. Define C=[XZ]C = [X \ Z], ζ=(βγ)\zeta=(\beta^\top \ \gamma^\top)^\top, and Ξ=Σ0GΞ1=Σ01G1 \Xi = \Sigma_0 \oplus G \implies \Xi^{-1} = \Sigma_0^{-1}\oplus G^{-1}

Then p(β,γ|y,Σ,G)p(β,γ|y,Σ,Ω1,...,ΩK)p(y|β,γ,Σ)p(β)k=1Kp(γk|Ωk)lnp(β,γ|)12[ζ𝖳(CΣ1C+Ξ1)ζ2ζ𝖳C𝖳Σ1y]ζ|restNormal(μζ|rest,Σζ|rest)Σζ|rest=(C𝖳Σ1C+Ξ1)1μζ|rest=Σζ|rest(C𝖳Σ1y+Ξ1[μ00]). \begin{aligned} p(\beta,\gamma|y,\Sigma,G) &\propto p(\beta,\gamma|y,\Sigma,\Omega_1,...,\Omega_K) \\ &\propto p(y|\beta,\gamma,\Sigma)p(\beta)\prod_{k=1}^Kp(\gamma_k|\Omega_k) \\ \ln p(\beta,\gamma|) &\simeq -\frac{1}{2}\left[\zeta^{^{\mathsf{T}}}(C^\top\Sigma^{-1}C+\Xi^{-1})\zeta - 2\zeta^{\mathsf{T}}C^{\mathsf{T}}\Sigma^{-1} y\right] \\ \zeta|\text{rest} &\sim \text{Normal}(\mu_{\zeta|\text{rest}},\Sigma_{\zeta|\text{rest}}) \\ \Sigma_{\zeta|\text{rest}} &= \left(C^{\mathsf{T}}\Sigma^{-1}C + \Xi^{-1}\right)^{-1}\\ \mu_{\zeta|\text{rest}} &= \Sigma_{\zeta|\text{rest}}\left(C^{\mathsf{T}}\Sigma^{-1}y + \Xi^{-1}\begin{bmatrix}\mu_0\\ 0 \end{bmatrix}\right). \end{aligned} From which, the optimal density is q(ζ)=Normal(ζ|μζ,Σζ)Σζ=(C𝖳𝔼q[Σ1]C+𝔼q[Ξ1])1μζ=Σζ(C𝖳𝔼q[Σ1]y+𝔼q[Ξ1][μ00])𝔼q[Ξ1]=[Σ0100k=1K(IJk𝔼[Ωk1])] \begin{aligned} q(\zeta) &= \text{Normal}(\zeta|\mu_\zeta,\Sigma_\zeta) \\ \Sigma_\zeta &= \left(C^{\mathsf{T}}\mathbb E_q[\Sigma^{-1}]C + \mathbb E_q[\Xi^{-1}]\right)^{-1} \\ \mu_\zeta &= \Sigma_\zeta \left(C^{\mathsf{T}}\mathbb E_q[\Sigma^{-1}]y + \mathbb E_q[\Xi^{-1}]\begin{bmatrix}\mu_0\\ 0 \end{bmatrix}\right) \\ \mathbb E_q[\Xi^{-1}] &= \begin{bmatrix} \Sigma_0^{-1} & 0 \\ 0 & \bigoplus_{k=1}^K\left(I_{J_k}\otimes \mathbb E[\Omega_k^{-1}]\right)\end{bmatrix} \end{aligned}

Suppose we again assumed Σ=σ2IN\Sigma = \sigma^2 I_N and p(σ2)=Inv-Gamma(a0,b0)p(\sigma^2)=\text{Inv-Gamma}(a_0,b_0), then similar to before q(σ2)=Inverse-Gamma(aσ2,bσ2)aσ2=a0+N2bσ2=b0+12{yCμq(ζ)2+tr(C𝖳CΣq(ζ))}. \begin{aligned} q(\sigma^2) &= \text{Inverse-Gamma}(a_{\sigma^2}, b_{\sigma}^2) \\ a_{\sigma^2} &= a_0 + \tfrac{N}{2} \\ b_{\sigma^2} &= b_0 + \tfrac{1}{2}\left\{\lVert y - C\mu_{q(\zeta)}\rVert^2+\text{tr}(C^{\mathsf{T}} C\Sigma_{q(\zeta)})\right\}. \end{aligned}

The new variational densities are those for Ωk\Omega_k. Consider just one Ωk\Omega_k where γi|ΩN(0,Ω),i=1,...,m\gamma_i | \Omega \sim N(0, \Omega), i=1,...,m (i.e. a standard random effects model) and where ΩInverse-Wishart(ξ0,Λ0)\Omega \sim \text{Inverse-Wishart}(\xi_0, \Lambda_0) then p(Ω|rest)p(Ω|γ)p(γ|Ω)p(Ω)lnp(Ω|γ)m2ln|Ω|12i=1mγi𝖳Ω1γi(ξ0+p+1)/2ln|Ω|tr(Λ0Ω1)𝔼q[lnp(Ω|γ)]12(ξ0+p+1+m)ln|Ω|12i=1m𝔼q[γi𝖳Ω1γi]tr(Λ0Ω1) \begin{aligned} p(\Omega|\text{rest}) &\propto p(\Omega|\gamma) \\ &\propto p(\gamma|\Omega)p(\Omega) \\ \ln p(\Omega|\gamma) &\simeq -\frac{m}{2} \ln |\Omega| - \frac{1}{2}\sum_{i=1}^m\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i - (\xi_0 + p + 1)/2\ln|\Omega| - \text{tr}(\Lambda_0\Omega^{-1}) \\ \mathbb {E}_q[\ln p(\Omega|\gamma)] &\simeq - \frac{1}{2}(\xi_0 + p + 1 + m)\ln|\Omega| - \frac{1}{2}\sum_{i=1}^m \mathbb{E}_q[\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i] - \text{tr}(\Lambda_0\Omega^{-1}) \end{aligned} Note that 𝔼q[γi𝖳Ω1γi]=(𝔼q[γi])𝖳Ω1𝔼q[γi]+tr(𝕍q[γi]Ω1)=tr(𝔼q[γi](𝔼q[γi])𝖳Ω1+𝕍q[γi]Ω1)=tr((𝔼q[γi](𝔼q[γi])𝖳+𝕍q[γi])Ω1) \begin{aligned} \mathbb{E}_q[\gamma_i^\mathsf{T}\Omega^{-1}\gamma_i] &= (\mathbb{E}_q[\gamma_i])^\mathsf{T}\Omega^{-1}\mathbb{E}_q[\gamma_i] + \text{tr}(\mathbb{V}_q[\gamma_i]\Omega^{-1}) \\ &= \text{tr}\left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T}\Omega^{-1} + \mathbb{V}_q[\gamma_i]\Omega^{-1}\right) \\ &= \text{tr}\left(\left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T} + \mathbb{V}_q[\gamma_i]\right)\Omega^{-1}\right) \end{aligned} So the optimal density is also Inverse-Wishart, i.e. q(Ω)=Inverse-Wishart(Ω|ξq(Ω),Λq(Ω)) q(\Omega) = \text{Inverse-Wishart}(\Omega| \xi_{q(\Omega)}, \Lambda_{q(\Omega)}) where ξq(Ω)=ξ0+mΛq(Ω)=Λ0+i=1m(𝔼q[γi](𝔼q[γi])𝖳+𝕍q[γi]) \begin{aligned} \xi_{q(\Omega)} &= \xi_0 + m \\ \Lambda_{q(\Omega)} &= \Lambda_0 + \sum_{i=1}^m \left(\mathbb{E}_q[\gamma_i](\mathbb{E}_q[\gamma_i])^\mathsf{T} + \mathbb{V}_q[\gamma_i]\right) \end{aligned} and where the required expectations are 𝔼q[Ω1]=ξq(Ω)Λq(Ω)1𝔼q[ln|Ω|]=ln|Λq(Ω)/2|ψd(ξq(Ω)d+1) \begin{aligned} \mathbb{E}_q[\Omega^{-1}] &= \xi_{q(\Omega)}\Lambda_{q(\Omega)}^{-1} \\ \mathbb{E}_q[\ln|\Omega|] &= \ln|\Lambda_{q(\Omega)}/2| - \psi_d(\xi_{q(\Omega)} - d + 1) \end{aligned}

Examples

Notation

The symbol \simeq is used to indicate equality up to an additive constant, similar to \propto for multiplicative constants.

References

Wand, Matt, JT Ormerod, SA Padoan, and R Fruhwirth. 2010. “Variational Bayes for Elaborate Distributions.”