时间分布（密集）与密集在Keras - 相同数量的参数

我正在建立一个模型，使用周期图层（GRUs）将字符串转换为另一个字符串。我已经尝试了密集型和时间分布（密集）层作为最后一层，但我不明白使用return_sequences = True时两者之间的差异，特别是因为它们看起来具有相同数量的参数。时间分布（密集）与密集在Keras - 相同数量的参数

我的简化模型如下：

InputSize = 15 
MaxLen = 64 
HiddenSize = 16 

inputs = keras.layers.Input(shape=(MaxLen, InputSize)) 
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) 
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x) 
predictions = keras.layers.Activation('softmax')(x)

网络的总结是：

_________________________________________________________________ 
Layer (type)     Output Shape    Param # 
================================================================= 
input_1 (InputLayer)   (None, 64, 15)   0   
_________________________________________________________________ 
gru_1 (GRU)     (None, 64, 16)   1536  
_________________________________________________________________ 
time_distributed_1 (TimeDist (None, 64, 15)   255  
_________________________________________________________________ 
activation_1 (Activation) (None, 64, 15)   0   
=================================================================

这是有道理的，以我为我的TimeDistributed的理解是，它适用于同一层的所有时间点，所以密集层有16 * 15 + 15 = 255个参数（权重+偏差）。

但是，如果我切换到一个简单的致密层：

inputs = keras.layers.Input(shape=(MaxLen, InputSize)) 
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) 
x = keras.layers.Dense(InputSize)(x) 
predictions = keras.layers.Activation('softmax')(x)

我仍然只有255个参数：

_________________________________________________________________ 
Layer (type)     Output Shape    Param # 
================================================================= 
input_1 (InputLayer)   (None, 64, 15)   0   
_________________________________________________________________ 
gru_1 (GRU)     (None, 64, 16)   1536  
_________________________________________________________________ 
dense_1 (Dense)    (None, 64, 15)   255  
_________________________________________________________________ 
activation_1 (Activation) (None, 64, 15)   0   
=================================================================

我不知道这是因为密集（）将只使用最后维度的形状，并将其他所有内容有效地视为批量维度。但是，我不确定密集和TimeDistributed（密集）之间有什么不同。

更新看着https://github.com/fchollet/keras/blob/master/keras/layers/core.py它似乎是密集使用最后一个维度只大小本身：

def build(self, input_shape): 
    assert len(input_shape) >= 2 
    input_dim = input_shape[-1] 

    self.kernel = self.add_weight(shape=(input_dim, self.units),

它还使用keras.dot申请权：

def call(self, inputs): 
    output = K.dot(inputs, self.kernel)

的keras.dot意味着它可以很好地处理n维张量。我想知道它的确切行为是否意味着Dense（）将在每个时间步骤被调用。如果是这样，这个问题仍然是TimeDistributed（）在这种情况下实现的。

来源

2017-06-18 thon

让我补充一点，这两个模型在训练过程中的表现几乎完全相同。 – thon

我也一直在想。所以你确认了Dense（）和TimeDistributed（Dense（））在你的情况下具有相同的性能？我认为更好的API设计将允许用户设置参数，无论是在时间步上使用相同的密集层，还是在每个时间步使用单独的密集层。 – ymeng

在GRU/LSTM单元展开期间，TimeDistributedDense对每个时间步都应用相同的密度。所以误差函数将在预测标签序列和实际标签序列之间。（这通常是序列标签问题的顺序要求）。

但是，在return_sequences = False的情况下，密集层仅在最后一个单元处应用一次。当RNN用于分类问题时通常是这种情况。如果return_sequences = True，则紧密层将应用于每个时间步，就像TimeDistributedDense一样。

因此，根据您的模型，两者都是相同的，但如果您将第二个模型更改为“return_sequences = False”，那么密度将仅应用于最后一个单元格。尝试改变它，模型会抛出错误，因为那么Y的大小就是[Batch_size，InputSize]，它不再是一个序列序列，而是一个完整序列来标记问题。

from keras.models import Sequential 
from keras.layers import Dense, Activation, TimeDistributed 
from keras.layers.recurrent import GRU 
import numpy as np 

InputSize = 15 
MaxLen = 64 
HiddenSize = 16 

OutputSize = 8 
n_samples = 1000 

model1 = Sequential() 
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) 
model1.add(TimeDistributed(Dense(OutputSize))) 
model1.add(Activation('softmax')) 
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop') 


model2 = Sequential() 
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) 
model2.add(Dense(OutputSize)) 
model2.add(Activation('softmax')) 
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop') 

model3 = Sequential() 
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize))) 
model3.add(Dense(OutputSize)) 
model3.add(Activation('softmax')) 
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop') 

X = np.random.random([n_samples,MaxLen,InputSize]) 
Y1 = np.random.random([n_samples,MaxLen,OutputSize]) 
Y2 = np.random.random([n_samples, OutputSize]) 

model1.fit(X, Y1, batch_size=128, nb_epoch=1) 
model2.fit(X, Y1, batch_size=128, nb_epoch=1) 
model3.fit(X, Y2, batch_size=128, nb_epoch=1) 

print(model1.summary()) 
print(model2.summary()) 
print(model3.summary())

在MODEL1和MODEL2的上述示例性架构是样品（序列到序列模型）和model3是一个完整的序列标签模型。

来源

2017-06-18 15:51:26 mujjiga

谢谢你的回答。我不确定我可以跟随，但我知道这两种情况下的输出都是一个序列。在这两种情况下，递归层都有return_sequences = True，并且两种情况下的输出形状都是3D并且完全相同（batch_size，64,15）。所以在我看来，密集层也适用于每个时间步骤。 – thon

我已经用更好的解释更新了我的答案，希望它对你有所帮助。 – mujjiga

谢谢。为了避免疑问，当你说“因为你的模型都是一样的，但如果你改变你的第二个模型为”return_sequences = True“，那么密度将只应用在最后一个单元格。” - 你是说如果我把return_sequences改成False？你的答案似乎意味着如果return_sequences为True，Dense（）和TimeDistributed（Dense（））完全一样。你能证实这一点吗？这是有道理的，但为什么Keras需要TimeDistributed（）呢？ – thon

时间分布（密集）与密集在Keras - 相同数量的参数

回答

相关问题