Few-shot Learning with Smaller PLM

Pattern Exploting Training (PET)

\[\begin{gathered} \mathcal{D}_{train}=\{(x_i, y_i)\}_{i=1}^N \\ |\mathcal{D}_{train}|=|\mathcal{D}_{dev}|=N \end{gathered}\] \[\begin{gathered} x_{in}=\begin{cases} x^1, &\text{if a single sentence} \\ (x^1,x^2), &\text{if a pair of sentence} \end{cases} \end{gathered}\] \[\begin{gathered} \mathcal{M}:\mathcal{Y}\rightarrow\mathcal{V}\text{, where }\mathcal{M}\text{ is mapping function from class label to word in vocabulary }\mathcal{V}. \\ x_\text{prompt}=\mathcal{T}(x_{in})\text{, where }x_\text{prompt}\text{ contains exactly one [MASK] token.} \end{gathered}\] \[\begin{aligned} P(y|x_\text{in})&=P(\text{[MASK]}=\mathcal{M}(y)|x_\text{prompt}) \\ &=\frac{\exp(\text{w}_{\mathcal{M}(y)}\cdot\text{h}_\text{[MASK]})}{\sum_{y'\in\mathcal{Y}}{\exp(\text{W}_{\mathcal{M}(y')}\cdot\text{h}_\text{[MASK]})}} \end{aligned}\]

Regression

\[\begin{aligned} y&=v_\text{low}\times{P(\mathcal{M}(v_\text{low})|x_{in})}+v_\text{high}\times{P(\mathcal{M}(v_\text{high})|x_{in})} \\ &=v_\text{low}\times\big(1-P(\mathcal{M}(v_\text{high})|x_{in})\big)+v_\text{high}\times{P(\mathcal{M}(v_\text{high})|x_{in})} \end{aligned}\] \[\begin{gathered} P(\mathcal{M}(v_\text{high})|x_{in})=\frac{ \exp( \text{w}_{\mathcal{M}(v_\text{high})}\cdot\text{h}_\text{[MASK]} ) }{ \sum_{w'\in\{\mathcal{M}(v_\text{low}),\mathcal{M}(v_\text{high})\}}{ \exp( \text{w}_{w'}\cdot\text{h}_\text{[MASK]} ) } } \end{gathered}\]

Training with examples as demonstrations

\[\begin{gathered} \mathcal{T}(x_{in})=\tilde{\mathcal{T}}(x_{in},\text{[MASK]}) \end{gathered}\] \[\begin{gathered} \mathcal{T}(x_i);\tilde{\mathcal{T}}(x^{(1)},\mathcal{M}(y^{(2)}));\cdots;\tilde{\mathcal{T}}(x^{|\mathcal{Y}|},\mathcal{M}(y^{|\mathcal{Y}|})) \end{gathered}\]