ar
X
iv
:1
81
0.
02
28
1v
1
[
cs
.L
G
]
4
O
ct
2
01
8
A CONVERGENCE ANALYSIS OF GRADIENT DESCENT
FOR DEEP LINEAR NEURAL NETWORKS
Sanjeev Arora
Princeton University and Institute for Advanced Study
arora@cs.princeton.edu
Nadav Cohen
Institute for Advanced Study
cohennadav@ias.edu
Noah Golowich
Harvard University
ngolowich@college.harvard.edu
Wei Hu
Princeton University
huwei@cs.princeton.edu
ABSTRACT
We analyze speed of convergence to global optimum for gradient descent training
a deep linear neural network (parameterized as x 7→WN · · ·W1x) by minimizing
the ℓ2 loss over whitened data. Convergence at a linear rate is guaranteed when the
following hold: (i) dimensions of hidden layers are at least the minimum of the in-
put and output dimensions; (ii) weight matrices at initialization are approximately
balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient
solution. The assumptions on initialization (conditions
loss/dimensions/linear/wich/ton/Institute/Advanced/descent/gradient/deep/
loss/dimensions/linear/wich/ton/Institute/Advanced/descent/gradient/deep/
-->