2017-10-04 81 views
0

我对tensorflow很陌生,我试图用批处理从我的csv文件中进行训练。Tensorflow - 批处理问题

这是我读的CSV文件中的代码并进行批量

filename_queue = tf.train.string_input_producer(
    ['BCHARTS-BITSTAMPUSD.csv'], shuffle=False, name='filename_queue') 

reader = tf.TextLineReader() 
key, value = reader.read(filename_queue) 

# Default values, in case of empty columns. Also specifies the type of the 
# decoded result. 
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]] 
xy = tf.decode_csv(value, record_defaults=record_defaults) 

# collect batches of csv in 
train_x_batch, train_y_batch = \ 
    tf.train.batch([xy[0:-1], xy[-1:]], batch_size=100) 

和这里的训练:

# initialize 
sess = tf.Session() 
sess.run(tf.global_variables_initializer()) 

# Start populating the filename queue. 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 


# train my model 
for epoch in range(training_epochs): 
    avg_cost = 0 
    total_batch = int(2193/batch_size) 

    for i in range(total_batch): 
     batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch]) 
     feed_dict = {X: batch_xs, Y: batch_ys} 
     c, _ = sess.run([cost, optimizer], feed_dict=feed_dict) 
     avg_cost += c/total_batch 

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost)) 

coord.request_stop() 
coord.join(threads) 

这里是我的问题:

1.

我的csv文件有2193条记录,我的批量大小是100.所以我想要的是:在每一个“时代”开始'第一条记录',共培训21批次,共100条记录,最后1批93条记录。所以共有22批次。

但是,我发现每批有100个尺寸 - 即使是最后一个。而且,它不是从第二个“时代”开始的“第一条记录”。

2.

如何获取记录大小(本例中为2193)?我应该硬编码吗?还是有其他聪明的方式来做到这一点?我使用了tendor.get_shape()。as_list(),但它不适用于batch_xs。它只是返回给我空的形状[]。

回答

1

我们最近为TensorFlow添加了一个名为tf.contrib.data的新API,可以更轻松地解决这样的问题。 (“队列亚军”为基础的API使得它很难写上准确的时代界限的计算,因为起始界限丢失。)

这里是你如何使用tf.contrib.data重写你的程序的例子:

lines = tf.contrib.data.TextLineDataset("BCHARTS-BITSTAMPUSD.csv") 

def decode(line): 
    record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]] 
    xy = tf.decode_csv(value, record_defaults=record_defaults) 
    return xy[0:-1], xy[-1:] 

decoded = lines.map(decode) 

batched = decoded.batch(100) 

iterator = batched.make_initializable_iterator() 

train_x_batch, train_y_batch = iterator.get_next() 

然后训练部分可以成为一个有点简单:

# initialize 
sess = tf.Session() 
sess.run(tf.global_variables_initializer()) 

# train my model 
for epoch in range(training_epochs): 
    avg_cost = 0 
    total_batch = int(2193/batch_size) 

    total_cost = 0.0 
    total_batch = 0 

    # Re-initialize the iterator for another epoch. 
    sess.run(iterator.initializer) 

    while True: 

    # NOTE: It is inefficient to make a separate sess.run() call to get each batch 
    # of input data and then feed it into a different sess.run() call. For better 
    # performance, define your training graph to take train_x_batch and train_y_batch 
    # directly as inputs. 
    try: 
     batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch]) 
    except tf.errors.OutOfRangeError: 
     break 

    feed_dict = {X: batch_xs, Y: batch_ys} 
    c, _ = sess.run([cost, optimizer], feed_dict=feed_dict) 
    total_cost += c 
    total_batch += batch_xs.shape[0] 

    avg_cost = total_cost/total_batch 

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost)) 

有关如何使用新的API,请参阅"Importing Data" programmer's guide更多细节。

+0

所以仍然没有办法获得'记录数(2193)'? – BlakStar

+0

''total_batch'变量将在'while'循环结束时包含2193(或者记录的实际数量)。 – mrry

+0

我今天运行它......并且它犯了错误。这是因为batch_xs的形状[7,100],所以它不能被馈送到形状为[?,7]的X.我阅读了你链接的指南,并发现它有意成型[7,100]。但我不明白为什么batch_xs已经塑造[7,100]而不是[100,7] ...所以我应该改变我的训练模型?还是有另一种方式? – BlakStar