What does Masked LM mean?

shaownhasan · Post by **shaownhasan** » Tue Jan 28, 2025 3:39 am

Before feeding a sequence of words into BERT, 15% of the words in each sequence are replaced with a [MASK] token . The model then attempts to predict the original value of the masked words, based on the context provided by the other unmasked words in the sequence. In technical terms, predicting the output words requires:

Added a classification layer above the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the why choose our service vocabulary size.
Calculating the probability of each word in the vocabulary with softmax.
The BERT loss function only takes into account the prediction of masked values and ignores the prediction of unmasked words. As a result, the model converges more slowly than directional models, a characteristic that is compensated by its greater context awareness .

In practice, BERT's implementation is a little more elaborate and does not replace all 15% masked words.

What is Next Sentence Prediction?
In the BERT alignment process, the model receives sentence pairs as input and learns to predict whether the second sentence in the pair is the next sentence in the original document . During alignment, 50% of the inputs are pairs where the second sentence is the next sentence in the original document, while the other 50% choose a random sentence from the corpus as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.