如何在不耗尽内存的情况下读取tsv文件并将其存储为hdf5？

我有几个数据集（以tsv格式）大于10 gb，我需要hdf5格式。我正在与Python合作。我读过关于Pandas包在读取文件并将它们存储为hdf5时不占用太多内存。但是，如果没有我的机器内存不足，我无法做到这一点。我也尝试过Spark，但是在那里感觉不到我的轻松。那么，除了读取内存中的大文件之外，还有其他解决方案吗？如何在不耗尽内存的情况下读取tsv文件并将其存储为hdf5？

来源

2015-06-24 boh

thnx男子。正是我需要的。 – boh

-1

import pandas as pd 
import numpy as np 

# I use python3.4 
# if your python version is 2.x, replace it with 'import StringIO' 
import io 


# generate some 'large' tsv 
raw_data = pd.DataFrame(np.random.randn(10000, 5), columns='A B C D E'.split()) 
raw_tsv = raw_data.to_csv(sep='\t') 
# start to read csv in chunks, 50 rows per chunk (adjust it to the potential of your PC) 
# the use of StringIO is just to provide a string buffer, you don't need this 
# if you are reading from an external file, just put the file path there 
file_reader = pd.read_csv(filepath_or_buffer=io.StringIO(raw_tsv), sep='\t', chunksize=50) 
# try to show you what's inside each chunk 
# if you type:  list(file_reader)[0] 
# exactly 50 rows 
# don't do this in your real processing, file_reader is a lazy generator 
# and it can only be consumed once 

    Unnamed: 0  A  B  C  D  E 
0   0 -1.2553 0.1386 0.6201 0.1014 -0.4067 
1   1 -1.0127 -0.8122 -0.0850 -0.1887 -0.9169 
2   2 0.5512 0.7816 0.0729 -1.1310 -0.8213 
3   3 0.1159 1.1608 -0.4519 -2.1344 0.1520 
4   4 -0.5375 -0.6034 0.7518 -0.8381 0.3100 
5   5 0.5895 0.5698 -0.9438 3.4536 0.5415 
6   6 -1.2809 0.5412 0.5298 -0.8242 1.8116 
7   7 0.7242 -1.6750 1.0408 -0.1195 0.6617 
8   8 -1.4313 -0.4498 -1.6069 -0.7309 -1.1688 
9   9 -0.3073 0.3158 0.6478 -0.6361 -0.7203 
..   ...  ...  ...  ...  ...  ... 
40   40 -0.3143 -1.9459 0.0877 -0.0310 -2.3967 
41   41 -0.8487 0.1104 1.2564 1.0890 0.6501 
42   42 1.6665 -0.0094 -0.0889 1.3877 0.7752 
43   43 0.9872 -1.5167 0.0059 0.4917 1.8728 
44   44 0.4096 -1.2913 1.7731 0.3443 1.0094 
45   45 -0.2633 1.8474 -1.0781 -1.4475 -0.2212 
46   46 -0.2872 -0.0600 0.0958 -0.2526 0.1531 
47   47 -0.7517 -0.1358 -0.5520 -1.0533 -1.0962 
48   48 0.8421 -0.8751 0.5380 0.7147 1.0812 
49   49 -0.8216 1.0702 0.8911 0.5189 -0.1725 

[50 rows x 6 columns] 

# set up your HDF5 file with highest possible compress ratio 9 
h5_file = pd.HDFStore('your_hdf5_file.h5', complevel=9, complib='blosc') 

h5_file 
Out[18]: 
<class 'pandas.io.pytables.HDFStore'> 
File path: your_hdf5_file.h5 
Empty 


# now, start processing 
for df_chunk in file_reader: 
    # must use append method 
    h5_file.append('big_data', df_chunk, complevel=9, complib='blosc') 

# after processing, close hdf5 file 
h5_file.close() 


# check your hdf5 file, 
pd.HDFStore('your_hdf5_file.h5') 
# now it has all 10,000 rows, and we did this chunk by chunk 

Out[21]: 
<class 'pandas.io.pytables.HDFStore'> 
File path: your_hdf5_file.h5 
/big_data   frame_table (typ->appendable,nrows->10000,ncols->6,indexers->[index])

来源

2015-06-24 22:03:27

如何在不耗尽内存的情况下读取tsv文件并将其存储为hdf5？

回答

相关问题