Link

Minimum Risk Training (MRT)

Equations

\[\begin{gathered} \mathcal{D}=\{x^i,y^i\}_{i=1}^N \\ \\ \begin{aligned} \mathcal{R}(\theta)&=\sum_{i=1}^N{ \mathbb{E}_{\hat{y}\sim{P(\text{y}|x^i;\theta)}}[\Delta(\hat{y},y^i)] } \\ &=\sum_{i=1}^N{ \sum_{\hat{y}\in\mathcal{Y}(x^i)}{ P(\hat{y}|x^i;\theta)\Delta(\hat{y},y^i) } } \end{aligned} \\ \\ \hat{\theta}_\text{MRT}=\underset{\theta\in\Theta}{\text{argmin }}\mathcal{R}(\theta) \end{gathered}\] \[\begin{gathered} \begin{aligned} \tilde{\mathcal{R}}(\theta)&=\sum_{i=1}^N{ \mathbb{E}_{\hat{y}\sim{Q(\text{y}|x^i;\theta,\alpha)}}[ \Delta(\hat{y},y^i) ] } \\ &=\sum_{i=1}^N{ \sum_{\hat{y}\in\mathcal{S}(x^i)}{ Q(\hat{y}|x^i;\theta,\alpha)\Delta(\hat{y},y^i) } } \end{aligned} \\ \text{where }\mathcal{S}(x^i)\text{ is a sampled subset of the full search space }\mathcal{Y}(x^i), \\ \text{and }Q(\hat{y}|x^i;\theta,\alpha)\text{ is a distribution defined on the subspace }\mathcal{S}(x^i): \\ Q(\hat{y}|x^i;\theta,\alpha)=\frac{P(\hat{y}|x^i;\theta)^\alpha}{\sum_{y'\in\mathcal{S}(x^i)}{ P(y'|x^i;\theta)^\alpha }}. \end{gathered}\] \[\begin{gathered} \begin{aligned} \nabla_\theta\tilde{\mathcal{R}}(\theta) &=\alpha\sum_{i=1}^N{ \mathbb{E}_{\hat{y}\sim{P(\text{y}|x^i;\theta)^\alpha}}\Big[ \frac{\nabla_\theta{P(\hat{y}|x^i;\theta)}}{P(\hat{y}|x^i;\theta)}\times\big( \Delta(\hat{y},y^i)-\mathbb{E}_{y'\sim{P(\text{y}|x^i;\theta)^\alpha}}[ \Delta(y',y^i) ] \big) \Big] } \\ &=\alpha\sum_{i=1}^N{ \mathbb{E}_{\hat{y}\sim{P(\text{y}|x^i;\theta)^\alpha}}\Big[ \nabla_\theta\log{P(\hat{y}|x^i;\theta)}\times\big( \Delta(\hat{y},y^i)-\mathbb{E}_{y'\sim{P(\text{y}|x^i;\theta)^\alpha}}[ \Delta(y',y^i) ] \big) \Big] } \\ &\approx\alpha\sum_{i=1}^N{ \nabla_\theta\log{P(\hat{y}|x^i;\theta)\times\big( \Delta(\hat{y},y^i)-\frac{1}{K}\sum_{k=1}^K{ \Delta(y^k,y^i) } \big)} }\text{, where }\hat{y}\sim{P(\text{y}|x^i;\theta)^\alpha}. \end{aligned} \\ \\ \theta\leftarrow\theta-\eta\nabla_\theta{\tilde{\mathcal{R}}(\theta)} \end{gathered}\]

Evaluations