0%

softmax

Softmax 与梯度下降

参考1:http://dlsys.cs.washington.edu/schedule lecture3

参考上图的网络结构,输入层,输出层。

$W_{i,j}$的含义:输出神经元 $a_i$ 关联的第 $j$ 个权重,要从输出神经元的角度去理解,这样比较清楚。

神经元的构造如下:

$zi = w_i \cdot x \rightarrow z_i = \sum_j w{i,j} \cdot x_j $ ,意味着$a_i$ 的每一个权重和输入$x$的相应feature相乘

$W^T = \begin{bmatrix} -&-& -\-&w_i&- \-&-&- \end{bmatrix} ​$,$x = \begin{bmatrix} | \ x_j\| \end{bmatrix} ​$

logic regression 的损失函数为

损失函数对每个权重$w_{i,j}​$的导数:

$\frac{\partial C}{\partial w{i,j}} = \frac{\partial C}{\partial z_i} \cdot \frac{\partial z_i}{\partial w{i,j}} = (a_i - y_i)\cdot x_j ​$

loss function对每个参数的导数构成梯度向量,即标量对矩阵的求导:$\frac {\partial L}{\partial W}​$

在下面的代码中,W_grad = np.dot(batch_xs.T, y_grad) 这一步正是利用的梯度向量。思考这段代码是如何形成两个矩阵相乘形式的。batch_xs 原始布局如下:

$X = \begin{bmatrix} -&-& -\-&xi&- \-&-&- \end{bmatrix} $,其中 $x_i$为行向量$(x{i,1}, x{i,2}, … , x{i,n})$ (备注:也可以写为 $x_1^{(i)}$,取决于样本 $i$ 的表示方式,用数组numpy表达时为前者)。Loss function是对所有的本批次样本计算的,所以 $Loss = \sum_i loss(f(x^i), y^i)$ ,其中 ​$i$ 为batch_size , 综合(1) 和(2)可以得出Loss要对所有样本loss1 + loss2 + loss3+ …求梯度,所以是一个加和。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
from tinyflow.datasets import get_mnist
def softmax(x):
x = x - np.max(x, axis=1, keepdims=True)
x = np.exp(x)
x = x / np.sum(x, axis=1, keepdims=True)
return x
# get the mnist dataset
mnist = get_mnist(flatten=True, onehot=True)
learning_rate = 0.5 / 100
W = np.zeros((784, 10))
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
# forward
y = softmax(np.dot(batch_xs, W))
# backward
y_grad = y - batch_ys
W_grad = np.dot(batch_xs.T, y_grad)
# update
W = W - learning_rate * W_grad

重构程 tensorflow API 方式代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import tinyflow as tf
from tinyflow.datasets import get_mnist
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.nn.softmax(tf.matmul(x, W))
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
# Update rule
learning_rate = 0.5
W_grad = tf.gradients(cross_entropy, [W])[0]
train_step = tf.assign(W, W - learning_rate * W_grad)
# Training Loop
sess = tf.Session()
sess.run(tf.initialize_all_variables())
mnist = get_mnist(flatten=True, onehot=True)
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys}) #Real execution happens here

注释:

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

  • 这里面的reducemean, 第二个参数是reduction_indices,可以这样理解,0 代表对第一个维度规约,即$\sum\limits{i=0}^m a{i,j}$ ,1则代表对第二个维度规约$\sum\limits{j=0}^n a_{i,j}$,以下是tensorflow的官方文档示例:
1
2
3
4
5
6
7
# 'x' is [[1,1,1],
[1,1,1]]
tf.reduce_sum(x) = 6
tf.reduce_sum(x, 0) = [2, 2, 2]
tf.reduce_sum(x, 1) = [3,3]
tf.reduce_sum(x, 1, keep_dims=True) = [[3], [3]]
tf.reduce_sum(x, [0,1]) = 6