如何从内存有限但大型数据集的tfrecords中获取洗牌批次？

使用张量流函数tf.train.shuffle_batch我们通过将tfrecord作为队列读入内存并在队列中进行混洗（如果得到正确的理解）来获得混洗批处理。现在我有一个高度有序的tfrecords（相同标签的图片一起写入）和一个非常大的数据集（约2,550,000图片）。我想用一批随机标签给我的Vgg-net喂食，但它不可能和丑陋地将所有图片读入内存并被洗牌。有没有解决这个问题的方法？如何从内存有限但大型数据集的tfrecords中获取洗牌批次？

我想过，也许第一次做洗牌，然后写他们入TFrecord，但我不能找出一种有效的方式这样做......

我的数据保存在这样：

enter image description here

这里是我的代码获得TFRecords：

dst = "/Users/cory/Desktop/3_key_frame" 

classes=[] 
for myclass in os.listdir(dst): 
    if myclass.find('.DS_Store')==-1: 
     classes.append(myclass) 


writer = tf.python_io.TFRecordWriter("train.tfrecords") 
for index, name in enumerate(classes): 
    class_path = dst +'/' + name 
    #print(class_path) 
    for img_seq in os.listdir(class_path): 
     if img_seq.find('DS_Store')==-1: 
      seq_pos = class_path +'/' + img_seq 
      if os.path.isdir(seq_pos): 
       for img_name in os.listdir(seq_pos): 
        img_path = seq_pos +'/' + img_name 
        img = Image.open(img_path) 
        img = img.resize((64,64)) 
        img_raw = img.tobytes() 
        #print (img,index) 
        example = tf.train.Example(features=tf.train.Features(feature={ 
         "label":tf.train.Feature(int64_list=tf.train.Int64List(value=[index])), 
         'img_raw':tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw])) 
         })) 
        writer.write(example.SerializeToString()) 
writer.close()

来源

2017-08-12 Mcory

假设您的数据存储这样的：

/path/to/images/LABEL_1/image001.jpg 
/path/to/images/LABEL_1/image002.jpg 
... 
/path/to/images/LABEL_10/image001.jpg

获取在一个平面列表中的所有文件名和洗牌他们：

import glob 
import random 
filenames = glob.glob('/path/to/images/**/*.jpg) 
random.shuffle(filenames)

创建字典从标签名称去数字标签：

class_to_index = {'LABEL_1':0, 'LABEL_2': 1} # more classes I assume...

现在，您可以遍历所有图像和检索标签

writer = tf.python_io.TFRecordWriter("train.tfrecords") 
for f in filenames: 
    img = Image.open(f) 
    img = img.resize((64,64)) 
    img_raw = img.tobytes() 
    label = f.split('/')[-2] 
    example = tf.train.Example(features=tf.train.Features(feature={ 
        "label":tf.train.Feature(int64_list=tf.train.Int64List(value= class_to_index[label])), 
        'img_raw':tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw])) 
        })) 
       writer.write(example.SerializeToString()) 
writer.close()

希望这有助于:)

来源

2017-10-11 17:20:04

我假设你有已知的标签数据集的文件名和/或结构列表。可能值得每次在每个类的基础上迭代通过它们，每次取N量。本质上是交错数据集，以便不存在顺序问题。如果我正确理解这一点，那么您主要关心的是从TFRecord抽样数据集时，您的数据的子集可能完全包含1个类，而不是一个好的表示？

如果其结构为：

0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 ... etc

这可能使shuffle_batch更容易创建培训更好样品。

这是我遵循的解决方案，因为似乎没有附加的混洗参数，您可以指定保持集合中类标签的均匀分布。

来源

2017-10-28 15:28:05 awilliamson

如何从内存有限但大型数据集的tfrecords中获取洗牌批次？

回答

相关问题