Decoder with Masking

Equations

Datatset

\[\begin{gathered} \mathcal{D}=\{x^i,y^i\}_{i=1}^N \\ x^i=\{x^i_1,\cdots,x^i_m\}\text{ and }y^i=\{y_0^i,y_1^i,\cdots,y_n^i\}, \\ \text{where }y_0=\text{<BOS>}\text{ and }y_n=\text{<EOS>}. \end{gathered}\]

Decoder Block

Self-Attention with Masking

\[\begin{gathered} h_{0,1:n}^\text{dec}=\text{emb}(y_{0:n-1})+\text{pos}(0,n-1) \\ \tilde{h}_{i,1:n}^\text{dec}=\text{LayerNorm}(\text{Multihead}_i(Q,K,V)+h_{i-1,1:n}^\text{dec}), \\ \text{where }Q=K=V=h_{i-1,1:n}^\text{dec}. \\ \end{gathered}\]

Masking

Attention

\[\begin{gathered} \tilde{h}_{i,1:n}^\text{dec}=\text{LayerNorm}(\text{Multihead}_i(Q,K,V)+h_{i-1,1:n}^\text{dec}), \\ \text{where }Q=\tilde{h}_{i,1:n}^\text{dec}\text{ and }K=V=h_{\ell,1:m}^\text{dec}. \end{gathered}\]

Masking

Feed-forward Networks

\[\begin{gathered} \text{FFN}(h_{i,t})=\text{ReLU}(h_{i,t}\cdot{W_i^1})\cdot{W}_i^2 \\ \text{where }W_i^1\in\mathbb{R}^{d_\text{model}\times{d_\text{ff}}}\text{ and }W_i^2\in\mathbb{R}^{d_\text{ff}\times{d_\text{model}}}. \\ \\ h_{i,1:m}^\text{dec}=\text{LayerNorm}([\text{FFN}(\tilde{h}_{i,1}^\text{dec});\cdots;\text{FFN}(\tilde{h}_{i,m}^\text{dec})]+\tilde{h}_{i,1:m}^\text{dec}) \end{gathered}\]

Decoder

\[\begin{gathered} h_{\ell_\text{dec},1:m}^\text{dec}=\text{Block}_\text{dec}(h_{\ell_\text{dec}-1,1:m}^\text{dec}) \\ \cdots \\ h_{1,1:m}^\text{dec}=\text{Block}_\text{dec}(h_{0,1:m}^\text{dec}) \\ \end{gathered}\]

Generator

\[\begin{gathered} \hat{y}_{1:n}=\text{softmax}(h_{\ell_\text{dec},1:m}^\text{dec}\cdot{W}_\text{gen}), \\ \text{where }h_{\ell_\text{dec},1:m}^\text{dec}\in\mathbb{R}^{\text{batch}\_\text{size}\times{n}\times\text{hidden}\_\text{size}}\text{ and }W_\text{gen}\in\mathbb{R}^{\text{hidden}\_\text{size}\times|V|}. \end{gathered}\]