并行计数文档中的单词

你知道如何使这个循环更快吗？它计算每个单词在文档中的出现频率。并行计数文档中的单词

_documentVectorSpace是包含有关文档的其他信息的对象列表。

_documentVectorSpace [i] .Terms是文档中的单词数组。

_distinctTerms是包含在所有文档中的所有唯一字的HashSet。

Parallel.For(0, _documentVectorSpace.Count, i => 
{ 
    int count = 0; 
    double[] vec = new double[_distinctTerms.Count]; 
    foreach (string term in _distinctTerms) 
    { 
     vec[count++] = Weight(_documentVectorSpace[i].Terms, term); 
    } 
    _documentVectorSpace[i].VectorSpace = vec; 
});

，其中作为权重定义：

private float Weight(string[] document, string term) 
{ 
    return document.Where(s => s == term).Count(); 
}

来源

2013-10-02 Mateusz Puwałowski

对于字符串比较，您应该使用'String.Equals（s，term，StringComparison.OrdinalIgnoreCase）'。 – MichaelS

听起来像“信息检索”作业:) – Alireza

您为每个term枚举您_documentVectorSpace[i].Terms。你应该扭转你的循环，所以你从_documentVectorSpace[i].Terms开始，并在_distinctTerms中查找值。

此外，很难从这个例子中知道您的代_documentVectorSpace是多么有效率。它很快可能会在运行时跳过这个功能所要做的很多工作。

来源

2013-10-02 17:59:25 Guvante

为_distinctTerms中的每个单词扫描一次文档非常昂贵，而且您没有充分利用HashSet查找的功能。你应该做的是扫描文件一次，识别_distinctTerms中的每个单词，并更新向量。沿着线的东西：（未经测试的代码）

Parallel.For(0, _documentVectorSpace.Count, i => 
{ 
    int count = 0; 
    double[] vec = new double[_distinctTerms.Count]; 
    Parallel.ForEach(_documentVectorSpace[i].Terms, term => 
    { 
     if (_distinctTerms.ContainsKey(term)) 
     { 
      Interlocked.Increment(ref vec[_distinctTerms[term]]); 
     } 
    }); 
    _documentVectorSpace[i].VectorSpace = vec; 
});

当然，_distinctTerm应该是现在映射术语来索引的字典。

来源

2013-10-02 18:04:36

并行计数文档中的单词

回答

相关问题