Autoregressive and Teacher Forcing
Inference
\[\hat{x}_t=\underset{x_t\in\mathcal{X}}{\text{argmax}}\log{P(x_t|\hat{x}_{<t};\theta)}\]
Auto-regressive
\[\begin{aligned}
\hat{x}_{t=1}&=\underset{x_t\in\mathcal{X}}{\text{argmax}}\log{P(x_{t=1}|x_0;\theta)}\text{ where }x_0=\text{<BOS>}. \\
\hat{x}_{t=2}&=\underset{x_t\in\mathcal{X}}{\text{argmax}}\log{P(x_{t=2}|x_0,\hat{x}_1;\theta)} \\
\hat{x}_{t=3}&=\underset{x_t\in\mathcal{X}}{\text{argmax}}\log{P(x_{t=3}|x_0,\hat{x}_1,\hat{x}_2;\theta)} \\
&\cdots \\
\hat{x}_t&=\underset{x_t\in\mathcal{X}}{\text{argmax}}\log{P(x_t|x_0,\hat{x}_{<t};\theta)} \\
\end{aligned}\]
Training with MLE
\[\begin{gathered}
\mathcal{D}=\{x^i\}_{i=1}^N \\
\begin{aligned}
\hat{\theta}&=\underset{\theta\in\Theta}{\text{argmax}}\sum_{i=1}^N{\log{P(x^i;\theta)}} \\
&=\underset{\theta\in\Theta}{\text{argmax}}\sum_{i=1}^N{\sum_{j=1}^n{\log{P(x_j^i|x_{<j}^i;\theta)}}},
\end{aligned} \\
\text{where }x^i=x_{1:n}^i=\{x_1^i,\cdots,x_n^i\}.
\end{gathered}\]