# The Cross Entropy Loss: a log-likelihood in disguise ðŸ¥·

The **cross-entropy loss** that is so much used in classification tasks is nothing butâ€¦ the **negative log-likelihood** in disguise! ðŸ¥¸ In this short post, we will clarify this point and make the connection explicit.

Letâ€™s consider a dataset of input-label pairs \(\{x^i, y^i\}_{i=1}^N\) similar to what we considered in the last post. For Italian names, \(x^i\) consisted of a set of characters that are used to predict \(y^i\), the next character. The model we used, `n`

-grams, consists of look-up-tables that, given \(x^i\), return an array of normalized probabilities that the model assigns to the next character \(y^i\) being the first, second,â€¦ `L`

-th element in the vocabulary. With the use of some notation, our model \(\mathcal{M}\) that takes as input the \(x^i\)â€™s and returns a length-`L`

vector of probabilities \(p^i = \mathcal{M}(x^i)\). Neural networks, as we will see, also fit in this setting.

We recall that the log-likelihood of a given sample corresponds to the negative log probability that the model assigns to the correct class \(y^i\):

\[\mathcal{l}^i = -\log [\mathcal{M}(x^i)]_{y^i} = -\log [p^i]_{y^i}\]where we used the notation \([\mathbf{x}]_{i}\) to indicate the \(i\)-th element of vector \(\mathbf{x}\) and lowercase \(\mathcal{l}\) to indicate the loss of one single data point.

At this stage, the \(y^i\)â€™s take values in \(\{0, 1,\dots, L\}\). If we instead represent them in one-hot-encoding, i.e.

\[\begin{aligned} \cdot &\rightarrow 0 \rightarrow [1,0,0,..., 0] \\ a &\rightarrow 1 \rightarrow [0,1,0,..., 0]\\ &\qquad\quad\vdots\\ Ã¹ &\rightarrow L \rightarrow [0, ...,0,0,1]\\ \end{aligned}\]the negative log-likelihood of a sample simply reads

\[\mathcal{l}^i = -\sum_{c=1}^L y^i_c\log (p^i_c),\]that is **the definition of cross-entropy loss**!!

I hope that this helped shed some light ðŸ”¦ on the widespread use of this seemingly obscure loss function, and its justification.