在排序数据上创建索引

我有一个带有一些排序数据的文本文件，用换行符分隔。例如：在排序数据上创建索引

... abc123 abc124 abd123 abd124 abd125 ...

现在我想创建一个数据集的索引，它应该（至少）支持：

getStringByIndex（N）：返回的第n项排序列表;
getIndexByString（s）：在所有项目中查找s，返回其索引（如果找不到则返回-1）;

我读过一些索引算法，如哈希和B树。具有儿童大小的额外领域的B型树应该做得很好。但是由于日期排序，我想知道是否有比通过插入所有项目来构建B-Tree更有效的解决方案？

2013-04-04 richselian

我可以创建一个数据结构吗？说一个具有数组和哈希集的数据结构。插入数组很容易，每次将其插入到数组中时，将该项插入哈希集。当你做getStringByIndex时，使用数组和getIndexByString，使用hashset？ – Calpis 2013-04-05 00:00:39

它更可能是“数据库”而不是内存结构。索引应存储在磁盘文件中。 – richselian 2013-04-05 00:16:15

数据集可能大到100GB，无法加载到内存中。否则简单的二分查找可以解决这个问题。 – richselian 2013-04-05 00:26:55

由于数据是经过排序的，您只需在内存中保留一小段稀疏的数据子集，即可快速有效地定位内容。例如，假设我们决定将每个第N个元素存储在内存中。为了有效地初始化您的API，您希望将这个稀疏列表编译到磁盘上的单独文件中，因此您不必通过100GB数据流来获取它。

对于这些术语中的每一个，都需要保存相对于文件开头的磁盘偏移量。然后，所有你所要做的就是加载稀疏列表/偏移对到内存中，你的两个请求的实现变得简单明了：

getStringByIndex(n): 
     Get floor(n/N)-th string/offset pair from list 
     Seek offset position in index 
     Read/Skip n mod N strings, then return the next one 

    getIndexByString(s): 
     Binary search over sparse list in memory 
      Locate lower and upper bound string/offset pairs 
     If a string/offset pair is in the i-th position in our sparse list, 
      then the string itself is the (N x i)-th string in our index. 
      We can use this information to compute the return value 
     If the string we want isn't in memory: 
      Seek lower-bound offset in index 
      Read strings until we: 
       a) Find a match 
       b) Reach the high-bound offset 
       c) Reach a string which is lexicographically greater than the one we are looking for 
     Else 
      Just return the index for the matching string in our sparse list

如果您索引中的字符串是固定宽度，可以做出更大的优化。

如果你实现它，你会希望小心你对这个算法的'N'选择。请记住，从磁盘位置读取10个字节的成本并不比从相同位置读取10,000个字节的成本低很多：这是磁盘搜寻的开销，进出I/O调用伤害最大。

来源

2013-04-05 04:03:02 torquestomp

非常感谢您的提示。我会尝试一个2级索引（稀疏子集的稀疏子集）。因为对于100GB和N = 4KB，我必须将25MB索引数据加载到1级内存中，而2级只需要6.25KB。 – richselian 2013-04-06 05:10:25

在排序数据上创建索引

回答

相关问题