The cross-entropy loss that is so much used in classification tasks is nothing but… the negative log-likelihood in disguise! 🥸 In this short post, we will clarify this point and make the connection explicit.

Let’s consider a dataset of input-label pairs $$\{x^i, y^i\}_{i=1}^N$$ similar to what we considered in the last post. For Italian names, $$x^i$$ consisted of a set of characters that are used to predict $$y^i$$, the next character. The model we used, n-grams, consists of look-up-tables that, given $$x^i$$, return an array of normalized probabilities that the model assigns to the next character $$y^i$$ being the first, second,… L-th element in the vocabulary. With the use of some notation, our model $$\mathcal{M}$$ that takes as input the $$x^i$$’s and returns a length-L vector of probabilities $$p^i = \mathcal{M}(x^i)$$. Neural networks, as we will see, also fit in this setting.

We recall that the log-likelihood of a given sample corresponds to the negative log probability that the model assigns to the correct class $$y^i$$:

$\mathcal{l}^i = -\log [\mathcal{M}(x^i)]_{y^i} = -\log [p^i]_{y^i}$

where we used the notation $$[\mathbf{x}]_{i}$$ to indicate the $$i$$-th element of vector $$\mathbf{x}$$ and lowercase $$\mathcal{l}$$ to indicate the loss of one single data point.

At this stage, the $$y^i$$’s take values in $$\{0, 1,\dots, L\}$$. If we instead represent them in one-hot-encoding, i.e.

\begin{aligned} \cdot &\rightarrow 0 \rightarrow [1,0,0,..., 0] \\ a &\rightarrow 1 \rightarrow [0,1,0,..., 0]\\ &\qquad\quad\vdots\\ ù &\rightarrow L \rightarrow [0, ...,0,0,1]\\ \end{aligned}

the negative log-likelihood of a sample simply reads

$\mathcal{l}^i = -\sum_{c=1}^L y^i_c\log (p^i_c),$

that is the definition of cross-entropy loss!!

I hope that this helped shed some light 🔦 on the widespread use of this seemingly obscure loss function, and its justification.