ML notes: Bayesian

$\newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb}$

Classic statistics notions

The world is in a parametric (density) family $\mc{P}_{\Theta} = \{ p(y \vert\theta) \;\vert \;\theta \in \Theta\}$, while the true $\theta$ is hidden. We have a dataset $\mc{D}$ within which the sample points are drawn i.i.d. $p(y \vert \theta)$.

The density for $\mc{D}$ under $\theta$ is: $p(\mc{D} \vert \theta) = \prod_i p(y_i \vert \theta)$.

For fixed $\theta$, $p(\mc{D} \vert \theta)$ is a density function on $\mc{Y}^n$
For fixed $\mc{D}$, the function $\theta \mapsto p(\mc{D} \vert \theta)$ is the likelihood: $L_{\mc{D}}(\theta) = p(\mc{D} \vert \theta)$

The whole story of modeling and learning is about designing good a statistic/point estimator of $\theta$, i.e. $\hat \theta = \hat \theta(\mc{D})$, and computing an optimal.

Desirable properties of point estimators

Consistency: $\hat \theta_n \rightarrow \theta$ as $n \rightarrow \infty$
Efficiency: $\hat \theta_n$ is as accurate as we can make if from a sample of size $n$

The maximum likelihood estimator (MLE) for $\theta$ in $\mc{P}_{\Theta}$ is: $\hat \theta_{\text{MLE}} = \arg \max_{\theta \in \Theta} L_{\mc{D}}(\theta) \tag{1}$

Bayesian statistics

Introduces a new distribution before seeing any data: the prior, $p(\theta)$ over $\Theta$.

A (parametric) bayesian model consists of two folds:

A parametric family of densities: $\mc{P}_{\Theta} = \{ p(y\vert\theta) \;\vert \;\theta \in \Theta\}$
A prior distribution: $p(\theta)$

With the two folds combined, we have a joint density: $p(\mc{D}, \theta) = p(\mc{D} \vert \theta)p(\theta)$.

The posterior distribution for $\theta$ is $p(\theta \vert \mc{D})$, which represents the rationally updated belief about $\theta$ after seeing $\mc{D}$, while the prior is the initial belief before seeing $\mc{D}$.

Conjugate prior

Let $\pi = \{ p(\theta) \vert \theta \in \Theta \} = \{ p(\cdot) : \Theta \rightarrow [0, 1] \}$, i.e. a family of prior distributions on $\Theta$.

And $\mc{P}_{\Theta} = \{ p(y\vert\theta) \;\vert \;\theta \in \Theta\} = \{ p(\cdot \vert \theta):\mc{Y} \rightarrow [0, 1] \}$, i.e. the parametric family of our model.

$\pi$ is conjugate to parametric family/model $\mc{P}_\Theta$ if for any prior in $\pi$, the posterior is still in $\pi$.

e.g.

Conjugate prior family	Model
Beta	Bernoulli/Binomial
Dirichlet	Categorical/Multinomial
Gamma	Poisson

Bayesian point estimators

Common options:

posterior mean: $\hat \theta = \mb E[\theta \vert \mc{D}]$
maximum a posteriori(MAP): $\hat \theta = \arg \max_\theta p(\theta \vert \mc{D})$
- which is also the mode of the posterior distribution

Bayesian decision theory

Ingredients:

parameter space: $\Theta$
prior: $p(\theta)$
action space: $\mc{A}$
loss: $l: \mc{A} \times \Theta \rightarrow \mathbf R$

The posterior risk of an action $a \in \mc{A}$ is: $r(a) = \mb E[l(a, \theta)\vert \mc{D}] = \int l(a, \theta)p(\theta \vert \mc{D}) d\theta$, i.e. the expected loss under the posterior.

A Bayes action $a^\ast$ is an action that minimizes $r(a)$.

Example

If the action space $\mc{A} = \Theta$, i.e. we are estimating the parameter $\theta$.

squre loss $\Rightarrow$ posterior mean
absolute loss $\Rightarrow$ posterior median
zero-one loss $\Rightarrow$ posterior mode, i.e. MAP

Steps of Bayesian method

Define the model
- choose a parametric family $\mc{P}_{\Theta} = \{ p(y \vert \theta) \;\vert \;\theta \in \Theta\}$
- choose a prior: $p(\theta)$
Compute posterior: $p(\theta \vert \mc{D})$
Make decision
- choose a loss $l(a, \theta)$
- compute the Bayesian action $a^\ast$ by minimizing the posterior risk