Is BERT’s LayerNorm What You Think It Is?
© Author | Wang Kunze Affiliation | The University of Sydney Research Direction | NLP The comparison between Batch Norm and Layer Norm has become a cliché in the field of algorithms. The question of why BERT uses layer norm instead of batch norm has been asked to death, and a casual search on Zhihu … Read more