Knowledge Distillation
Equation
\[
D={(xi,yi)}Ni=1ˆθT,ˆWT=argmaxθT∈Θ,WT∈WN∑i=1logP(yi|xi;θT,WT,τ=1)\]
\[
P(⋅|xi;θ,W,τ)=softmax(W⋅f(xi;θ)τ)=softmax(W⋅hiτ).\]
\[
LKD(θS,WS)=−N∑i=1∑c∈CP(y=c|xi;ˆθT,ˆWT,τ)logP(y=c|xi;θS,WS,τ)≈−Ex∼P(x)[Ey∼P(⋅|x;ˆθT,ˆWT,τ)[logP(y|x;θS,WS,τ)]]\]
\[
LCE(θS,WS)=−N∑i=1logP(yi|xi;θS,WS)\]
\[
L(θS,WS)=(1−α)LCE(θS,WS)+αLKD(θS,WS)ˆθS,ˆWS=argminθS∈Θ,WS∈WL(θS,WS)\]