对于包含多个项目的域,您应该能够使用随机抽样方法获得高度准确的属性列表。下面是一些C#-ish伪代码:
int domainCount = "select count(*) from Person";
int avgSkipCount = domainCount/2500;
int processedCount = 0;
string nextToken = null;
Set attributeNames;
do
{
int nextSkipCount = Random.Next(0, avgSkipCount*2);
string nextToken = "select count(*) from Person limit " + nextSkipCount;
var countRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select count(*) from Person limit " + nextSkipCount
};
var countResponse = SimpleDb.Select(countRequest);
nextToken = countResponse.NextToken;
processedCount += countResponse.Count;
var getRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select * from Person limit 1"
};
var getResponse = SimpleDb.Select(getRequest);
nextToken = getResponse.NextToken;
processedCount++;
attributeNames.Add(getResponse.AttributeNames);
} while (domainCount > processedCount);
这依赖于事实,你可以使用的nextToken从SELECT COUNT(*)查询返回的记载跳过的SimpleDB的。 Mocky写了an excellent explanation of how to accomplish this。我已经解释了how to accomplish efficient paging like this with Simple Savant。
这对大多数数据集来说可以达到99%的准确率,这些数据集对于大多数实际应用来说应该足够好。统计理论认为,2500的样本大小可以为任何大小的数据集有效地提供相同的精度,所以这种方法甚至可以扩展到数百万个项目。
这显然不理想,因为它仍然需要大量的查询,但如果数据集的属性变化数量相对有限,您应该可以用更小的样本大小完成同样的事情。