privacyber.blogg.se - What is cross entropy

I find some useful blogs which discussed the difference of MSE vs. The following figure demonstrates the concept (notice from the figure that BCE becomes low when both of y and p are high or both of them are low simultaneously, i.e., there is an agreement):Ĭross-entropy is closely related to relative entropy or KL-divergence that computes distance between two probability distributions.What is the different between MSE error and Cross-entropy error in NN Sep 5, 2017 Otherwise it increases as the predicted probability corresponding to the target class becomes smaller. When the model-computed (softmax) class-probability becomes close to 1 for the target label for a training instance (represented with one-hot-encoding, e.g.,), the corresponding CCE loss decreases to zero Yes, the cross-entropy loss function can be used as part of gradient descent.įurther reading: one of my other answers related to TensorFlow.Īdding to the above posts, the simplest form of cross-entropy loss is known as binary-cross-entropy (used as loss function for binary classification, e.g., with logistic regression), whereas the generalized version is categorical-cross-entropy (used as loss function for multi-class classification problems, e.g., with neural networks). Then we can use, for example, gradient descent algorithm to find the It is one of many possible loss functions. Is it only a method to describe the loss function?Ĭorrect, cross-entropy describes the loss between two probability distributions. So to answer your original questions directly: But note that you need to compute the derivative of H(p, q) with respect to the parameters first. In the equation below, you would replace J(theta) with H(p, q). These loss functions are typically written as J(theta) and can be used within gradient descent, which is an iterative algorithm to move the parameters (or coefficients) towards the optimum values. p = np.array()Ĭross entropy is one out of many possible loss functions (another popular one is SVM hinge loss). Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.

The resulting loss of 6.91 will reflect the larger error. When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002: p = np.array()Īt the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability: Pr(Class A) Pr(Class B) Pr(Class C) To gain more intuition on what these loss values reflect, let's look at some extreme examples.Īgain, let's suppose the true (one-hot) distribution is: Pr(Class A) Pr(Class B) Pr(Class C) If the log were instead log base 2, then the units are in bits. Because we are using the natural log (log base e), the units are in nats, so we say that the loss is 0.4797 nats. We see in the above example that the loss is 0.4797. it will try to reduce the loss from 0.479 to 0.0). A machine learning optimizer will attempt to minimize the loss (i.e. So that is how "wrong" or "far away" your prediction is from the true distribution. Q = np.array() # Predicted probabilityĬross_entropy_loss = -np.sum(p * np.log(q)) P = np.array() # True probability (one-hot) Here is the above example expressed in Python using Numpy: import numpy as np As it happens, the Python Numpy log() function computes the natural log (log base e). Note that it does not matter what logarithm base you use as long as you consistently use the same one. The sum is over the three classes A, B, and C. Where p(x) is the true probability distribution (one-hot) and q(x) is the predicted probability distribution. How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Now, suppose your machine learning algorithm predicts the following probability distribution: Pr(Class A) Pr(Class B) Pr(Class C) You can interpret the above true distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C. The one-hot distribution for this training instance is therefore: Pr(Class A) Pr(Class B) Pr(Class C) Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.įor example, suppose for a specific training instance, the true label is B (out of the possible labels A, B, and C). In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Cross-entropy is commonly used to quantify the difference between two probability distributions.