Was Two Stage Knowledge Distillation used as in BinaryBERT in Table 7 (https://arxiv.org/pdf/2012.15701.pdf) to get these results?
Was Two Stage Knowledge Distillation used as in BinaryBERT in Table 7 (https://arxiv.org/pdf/2012.15701.pdf) to get these results?