使用HDFStore查询对选择进行迭代

我在HDFStore中有一个非常大的表，我想用查询选择一个子集，然后通过块遍历子集块。我希望之前的查询发生在之前，选择被分解为块，以便所有块都具有相同的大小。使用HDFStore查询对选择进行迭代

文档here似乎表明这是默认行为，但不是很清楚。然而，在我看来，该分块实际发生查询之前，如本例所示：

In [1]: pd.__version__ 
Out[1]: '0.13.0-299-gc9013b8' 

In [2]: df = pd.DataFrame({'number': np.arange(1,11)}) 

In [3]: df 
Out[3]: 
    number 
0  1 
1  2 
2  3 
3  4 
4  5 
5  6 
6  7 
7  8 
8  9 
9  10 

[10 rows x 1 columns] 


In [4]: with pd.get_store('test.h5') as store: 
      store.append('df', df, data_columns=['number']) 

In [5]: evens = [2, 4, 6, 8, 10] 

In [6]: with pd.get_store('test.h5') as store: 
      for chunk in store.select('df', 'number=evens', chunksize=5): 
       print len(chunk) 

     2 
     3

我希望只有大小5的单个块，如果查询了结果之前发生的事情被分割成块，但这个例子给出了两个长度为2和3的块。

这是预期的行为，如果有的话，是否有一个有效的解决方法来给出相同大小的块而不将表读入内存？

来源

2014-01-24 mcwitt

我想我写这个时，意图是使用chunksize查询的结果。我认为它在实施过程中发生了变化。块大小决定查询应用的部分，然后对这些部分进行迭代。问题是你不知道你会得到多少行。

然而，他们是一种方式来做到这一点。这是草图。使用select_as_coordinates来实际执行您的查询;这将返回行号（坐标）的Int64Index。然后将迭代器应用到您根据这些行选择的位置。

像这样的东西（这使得一个好用的秘方，将包括文档我认为）：

In [15]: def chunks(l, n): 
     return [l[i:i+n] for i in xrange(0, len(l), n)] 
    ....: 

In [16]: with pd.get_store('test.h5') as store: 
    ....:  coordinates = store.select_as_coordinates('df','number=evens') 
    ....:  for c in chunks(coordinates, 2): 
    ....:   print store.select('df',where=c) 
    ....:   

    number 
1  2 
3  4 

[2 rows x 1 columns] 


    number 
5  6 
7  8 

[2 rows x 1 columns] 


    number 
9  10 

[1 rows x 1 columns]

在开发文档

来源

2014-01-24 21:22:39 Jeff

现在：http://pandas.pydata.org/pandas-docs/dev/ io.html＃迭代器 – Jeff

使用HDFStore查询对选择进行迭代

回答

相关问题