有了:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
我看到:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
也就是说h5py
没有看到/解释字符串为Unicode - 写入和读取。
随着dump工具:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { (3)/(3) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
注意,在这两种情况下,datatype
标记UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
这就是文档说:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
它们可以存储Python unicode字符串可以存储的任何字符,NULL值除外。在文件中,它们被创建为字符集为H5T_CSET_UTF8的可变长度字符串。
让h5py
(或其他读者)担心将\37777777703\37777777670
解释为适当的unicode字符。
用Python3'h5py'读取字符看起来很好。我确实用'h5dump'来看你的代码。 – hpaulj
'h5dump'也显示该字符串的'DATATYPE'是'CSET H5T_CSET_UTF8;' – hpaulj