Backpropagation with max log likelihood

Backpropagation is most often explained using mean square deviation

loglikelihood



for the cost function.A more elegant approach, that usually works better, is the use of the maximum log likelihood function for the cost calculation (Logistic regression using max log likelihood for a detailed description)

In my article Backpropagation I used a neural net with 3 layers and 2 input features like:

loglikelihood


With the mean square deviation the gradients for the learning of this net were:

loglikelihoodloglikelihood
loglikelihood

loglikelihood loglikelihood
loglikelihood

loglikelihood loglikelihood
loglikelihood

loglikelihood loglikelihood
loglikelihood


and with the local gradients

loglikelihood


and

loglikelihood
loglikelihood


The gradients became:

loglikelihood
loglikelihood
loglikelihood
loglikelihood


and

loglikelihood
loglikelihood


were the local gradients for the first layer.


If maximum log likelihood is used as cost function, that means

loglikelihood


With the sigmoid function

loglikelihood


similar to Logistic regression using max log likelihood. Only the f(x) in the sigmoid function is more complex here and there will be more chain rules to be applied for the differentiation of it. But that does not bother as we already have all these things further above.

In Logistic regression using max log likelihood we saw that the only difference between the log likelihood and mean square deviation approach was the vanished part

loglikelihood


In backpropagation with the max log likelihood the same happens. Only

loglikelihood


and the part that drops out is

loglikelihood


which is basically the outermost part of the differentiation for the local gradient of the last layer (of the mean square deviation approach).

With this the gradient for max log likelihood becomes:

loglikelihood
loglikelihood
loglikelihood
loglikelihood


with

loglikelihood


For the local gradients of the last layer.

That affects only the last layer. All the layers further left are untouched and remain the same as in the mean square deviation approach.

So there is just one small modification in the BackwardProp() function to modify the Backpropagation algorithm to the maximum log likelihood approach:


Replace

actLayer.gradient[j] = -(y[j] - actLayer.o[j]) * actLayer.dAct(actLayer.o[j]);



in the first loop by


actLayer.gradient[j] = -(y[j] - actLayer.o[j]);



That’s


private void BackwardProp(double[] x, double[] y)
{
     int i, j, k;
     TLayer actLayer = net.ElementAt(layers - 2);   // actual computed layer
     TLayer layerRight = net.ElementAt(layers - 1); // next layer to the right of the actual one
     double[] actX = new double[x.Length];
 
     for (j = 0; j < actLayer.x.Length; j++)
         actX[j] = actLayer.x[j];
     actLayer = net.ElementAt(layers - 1);
     //last layer
     for (j = 0; j < actLayer.featuresOut; j++)
     {
         costSum = costSum + ((y[j] - actLayer.o[j]) * (y[j] - actLayer.o[j]));
         // actLayer.gradient[j] = -(y[j] - actLayer.o[j]) * actLayer.dAct(actLayer.o[j]);
         actLayer.gradient[j] = -(y[j] - actLayer.o[j]);
     }
 
     for (j = 0; j < actLayer.featuresIn; j++)
     {
         for (k = 0; k < actLayer.featuresOut; k++)
              actLayer.deltaW[j, k] = actLayer.deltaW[j, k] + actLayer.gradient[k] * actX[j];
     }
 
     for (j = 0; j < actLayer.featuresOut; j++)
     {
         actLayer.deltaOffs[j] = actLayer.deltaOffs[j] + actLayer.gradient[j];
     }
 
     // all layers except the last one
     if (layers > 1)
     {
         for (i = layers - 2; i >= 0; i--)
         {
              actLayer = net.ElementAt(i);
              layerRight = net.ElementAt(i + 1);
              if (i > 0)
              {
                   TLayer layerLeft = net.ElementAt(i - 1);
                   for (j = 0; j < layerLeft.o.Length; j++)
                        actX[j] = layerLeft.o[j];
              }
              else
              {
                   for (j = 0; j < x.Length; j++)
                        actX[j] = x[j];
              }
              for (j = 0; j < actLayer.featuresOut; j++)
              {
                   actLayer.gradient[j] = 0;
                   for (k = 0; k < layerRight.featuresOut; k++)
                        actLayer.gradient[j] = actLayer.gradient[j] + (layerRight.gradient[k] * actLayer.dAct(actLayer.o[j]) * layerRight.w[j, k]);
                   actLayer.deltaOffs[j] = actLayer.deltaOffs[j] + actLayer.gradient[j];
                   for (k = 0; k < actLayer.featuresIn; k++)
                        actLayer.deltaW[k, j] = actLayer.deltaW[k, j] + actLayer.gradient[j] * actX[k];
              }
         }
     }
}



All the rest can remain the same.


With this and 150000 iterations with learningrate = 0.1 the backpropagation algorithm computes the values:

loglikelihood

loglikelihood

loglikelihood

loglikelihood


With Cost = 0.013846

That does not look too much better than 0.015 with the mean square deviation approach in Backpropagation. But if we compare the test application running with these parameters with the test of the mean square deviation approach, there is quite some improvement:

With the maximum log likelihood data the 15 test flowers are recognized with a mean probability of 99.78 %. The mean square deviation approach gets 98.25 % for the same flowers. If we switch these probabilities to uncertainty which is 100 – probability, the mean square deviation approach gets 1.75 % whereas the maximum log likelihood approach gets 0.22 %. That’s quite some difference :-)


C# Demo Projects Backpropagation with maximum log likelihood
  • BackpropagationLog.zip
  • BackpropagationLogTest.zip
  • Iris_Data.zip