2014-04-22 75 views
15

我想将字符串的可变长度列表存储到HDF5数据集。该代码,这是从Python中将字符串列表存储到HDF5数据集

import h5py 
h5File=h5py.File('xxx.h5','w') 
strList=['asas','asas','asas'] 
h5File.create_dataset('xxx',(len(strList),1),'S10',strList) 
h5File.flush() 
h5File.Close() 

我得到一个错误,指出“类型错误:为D型没有转换路径:D型(” & LT U3' )” 其中& LT是指除符号
如何实际少我可以解决这个问题吗?

+0

对于初学者,您在'create_dataset'上有错字。你能给出你正在使用的确切代码,特别是在'strList'来自哪里? – SlightlyCuban

+0

对于错字感到抱歉,我试图将熊猫数据帧序列化为HDF5文件,所以我必须创建一个包含所有列名称的标题,以便我提取列表中的列名并尝试将其写入到HDF5数据集。 – gman

+0

除了上面的代码错字模拟完全相似的情况 – gman

回答

14

您正在使用Unicode字符串阅读,但将您的数据类型指定为ASCII。根据the h5py wiki,h5py目前不支持此转换。

你需要编码字符串格式h5py处理:

asciiList = [n.encode("ascii", "ignore") for n in strList] 
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList) 

注:并非一切都在UTF-8编码可以在ASCII编码!

+0

谢谢你的工作完美 – gman

+0

从hdf5文件(在python3中)重新提取这些字符串的正确方法是什么? – DilithiumMatrix

+0

@DilithiumMatrix ASCII也是有效的UTF-8,但是如果你需要'str'类型的话你可以使用'ascii.decode('utf-8')'。 注意:我的答案会丢弃非ASCII字符。如果你用'encode('unicode_escape')'保存了它们,那么你需要'decode('unicode_escape')'将其转换回来。 – SlightlyCuban

1

In HDF5, data in VL format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the de facto standard for representing strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the “object” (“O”) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an “O” dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str) 

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt) 

In [29]: dset[0] = 'the change of water into water vapour' 

In [30]: dset[0] 
Out[30]: 'the change of water into water vapour' 
3

我在一个类似的情况希望数据框的列名存储为HDF5文件中的数据集。假设df.columns是我要存储什么,我发现了以下工作:

h5File = h5py.File('my_file.h5','w') 
h5File['col_names'] = df.columns.values.astype('S') 

这是假设的列名是可以在ASCII编码的“简单”的字符串。

相关问题