ML paper notes, ongoing

Posted on Wed 02 May 2018 in Projects

Baydin, Pearlmutter, Radul, and Siskind

Automatic differentiation in machine learning: a survey
https://arxiv.org/abs/1502.05767

  • AD is not numeric differentiation: $\frac{\partial f(x)}{\partial x_i}\approx \frac{f(x+he_i)-f(x)}{h}$

    • subject to numerical instabilities.
    • thou shalt not add small numbers to big numbers
    • thou shalt not subtract numbers which are approximately equal
    • truncation errors (ala dissertation)
    • $\mathcal{O}(n)$ complexity for gradient in n dimensions is the main challenge of numerical approaches, since n can be millions and billions.
  • AD is not symbolic differentiation, ie. chain rule/ mathematica: $\frac{d}{dx}(f(x)+g(x))=\frac{d}{dx}f(x)+\frac{d}{dx}g(x)$

    • machanistic, carried out by Mathematica/SymPy symbol manipulation rules
    • calculation of symbolic derivatives can get exponentially larger in complexity than the original expression. Basic example: chain rule.
  • AD fwd and backprop

    • apply symbolic differentiation at the elementary operation level and keep intermediate numerical results in lockstep with the evaluation of the main function.
    • differentiate the analytic equation, then plug in numbers. repeat.
    • intermediate variables $\dot v_i$ assiciated with a derivative $\frac{\partial v_i}{\partial x_1}$
    • evaluate primals along the way with tangents, get final variable $\dot v_5=\frac{\partial y}{\partial x_1}$.
    • Generalizes to calculating Jacobians.
    • Backprop utilizes adjoint

Check out PyTorch for general-purpose AD and define-by-run or dynamic computational graph, where model is all in python and exacuation dynamically constructs a computational graph on-the-fly which can freely change in each iteration (as opposed to TensorFlow, which is apparently static?) and which uses stupid tf.while statements, etc.

Sergey Ioffe and Szegedy; Google

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
https://arxiv.org/pdf/1502.03167.pdf

  • Training DNNs is complicated by the fact that each layer’s inputs change during training, as the parameters of the previous layers change. This slows training (req. lower learning rates, careful parameter initialization).
  • Therefore, notoriously hard to train noisy models (those with saturating nonlinearities).
  • This is "internal covariate shift"
  • Fix by normalizing layer inputs.
  • Batch Normalization also acts as a regularizer, in some cases eliminating the need for Dropout.

Pro

  • gradient of the loss over mini-batch is a valid estimate
  • efficient computation

How

  • normalization fixes the means and variances of input layers --> higher learning rates
  • Stochastic GD: approximate with a subsample of the training set.
  • Mini-batch size of one: one handwritten figure sent through the entire model
  • whitening the layer inputs (mean = 0, variance = 1)
  • regularizarion in DL: Dropout vs. $L_2$ vs. Batch Normalization for regularization.
  • can still use BN for convolutional neural nets.
  • train horizontally across layers, ensure whitening.
  • also: (pg 313) http://www.deeplearningbook.org/contents/optimization.html

Gatys, Ecker, and Bethge

A Neural Algorithm of Artistic Style
https://arxiv.org/abs/1508.06576

*