Friday, October 7, 2016

How to avoid loss = nan while training deep neural network using Caffe

The following problem  occurs in Caffe when loss value become very large (infinity) and 

I0917 15:45:07.232023 1936130816 sgd_solver.cpp:106] Iteration 9500, lr = 0.000575702
I0917 15:45:08.376780 1936130816 solver.cpp:228] Iteration 9600, loss = nan
I0917 15:45:08.376814 1936130816 solver.cpp:244]     Train net output #0: loss = nan (* 1 = nan loss)
I0917 15:45:08.376822 1936130816 sgd_solver.cpp:106] Iteration 9600, lr = 0.000573498
I0917 15:45:09.522541 1936130816 solver.cpp:228] Iteration 9700, loss = nan
I0917 15:45:09.522573 1936130816 solver.cpp:244]     Train net output #0: loss = nan (* 1 = nan loss)
I0917 15:45:09.522581 1936130816 sgd_solver.cpp:106] Iteration 9700, lr = 0.000571313
I0917 15:45:10.663610 1936130816 solver.cpp:228] Iteration 9800, loss = nan
I0917 15:45:10.663782 1936130816 solver.cpp:244]     Train net output #0: loss = nan (* 1 = nan loss)
I0917 15:45:10.663791 1936130816 sgd_solver.cpp:106] Iteration 9800, lr = 0.000569147
I0917 15:45:11.808089 1936130816 solver.cpp:228] Iteration 9900, loss = nan
I0917 15:45:11.808120 1936130816 solver.cpp:244]     Train net output #0: loss = nan (* 1 = nan loss)


I0917 15:45:11.808128 1936130816 sgd_solver.cpp:106] Iteration 9900, lr = 0.000567001

I was able to fix this by adjusting learning rate. (mostly decreasing) Values for learning rate that usually works are 0.001 and 0.0001. (learning rate can be configure in solver.prototxt file) 

Following thread contains useful information on other possible reasons that this issue might occur.
http://stackoverflow.com/questions/33962226/common-causes-of-nans-during-training 

Sometimes, you might notice that loss values won't change, even if the values don't become nan. That issue also can be fixed by fine tuning the learning rate. 

E.g., base_lr: 0.00009

Sometimes, memory issues can occur, when changing learning rate. 

caffe(1636,0x201105b9000) malloc: *** error for object 0x7fe8a0c2ab20: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
*** Aborted at 1474110894 (unix time) try "date -d @1474110894" if you are using GNU date ***
PC: @     0x7fff893d6286 __pthread_kill
*** SIGABRT (@0x7fff893d6286) received by PID 1636 (TID 0x201105b9000) stack trace: ***
    @     0x7fff8f8f9f1a _sigtramp
    @                0x0 (unknown)
    @     0x7fff8685db53 abort
    @     0x7fff89124e06 szone_error
    @     0x7fff8911b9dd szone_free_definite_size
    @     0x7fff91681c13 _dispatch_client_callout
    @     0x7fff9168488f _dispatch_root_queue_drain
    @     0x7fff91692fe4 _dispatch_worker_thread3
    @     0x7fff8af61637 _pthread_wqthread
    @     0x7fff8af5f40d start_wqthread

That can be fixed by adjusting (mostly decreasing) the batch size in train_test. prototxt. (for both train and test input layers)

batch_size: the number of inputs to process at one time

E.g., 

name: "LeNet"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "train_lmdb"
    batch_size: 16
    backend: LMDB
  }
}