2014-10-28 104 views
2

我想分割,组和计算C#中的大字符串中特定短语的出现次数。拆分,组和计数字符串

下面的伪代码应该给出我想要实现的一些指示。

var my_string = "In the end this is not the end"; 
my_string.groupCount(2); 

==> 
    [0] : {Key: "In the", Count:1} 
    [1] : {Key: "the end", Count:2} 
    [2] : {Key: "end this", Count: 1} 
    [3] : {Key: "this is", Count: 1} 
    [4] : {Key: "is not", Count: 1} 
    [5] : {Key: "not the", Count: 1} 

正如您会注意到的,这不像分割字符串和计算每个子字符串那样简单。这个例子每两个字组一次,但理想情况下它应该能够处理任何数字。

+0

谢谢,我错过了。 – tribe84 2014-10-28 20:38:51

+0

@GrantWinney - 不,这两个问题是相似的,但不一样。 – tribe84 2014-10-28 20:39:39

+0

你想如何分割'input'? – 2014-10-28 20:49:59

回答

1

这里是你如何处理这个大纲:

  • 使用string经常Split方法来获得个人的话
  • 做一个字典的计数
  • 通过所有对进入建立复合键和递增计数

这里是你如何实现这个:

var counts = new Dictionary<string,int>(); 
var tokens = str.Split(' '); 
for (var i = 0 ; i < tokens.Length-1 ; i++) { 
    var key = tokens[i]+" "+tokens[i+1]; 
    int c; 
    if (!counts.TryGetValue(key, out c)) { 
     c = 0; 
    } 
    counts[key] = c + 1; 
} 

Demo.

+0

如果字符串很大,会发生什么情况? – gabba 2014-10-28 21:23:55

+0

@gabba在不是很大的字符串的情况下会发生同样的情况:-)任务在时间和内存上是线性的。 – dasblinkenlight 2014-10-28 21:26:38

+0

当您将2GB的字符串拆分为数千个小字符串时,您将获得更多的双倍内存消耗。我们不需要这样做。我们只需要做一次扫描,还有小字典。 – gabba 2014-10-28 21:31:46

0

这里是我的实现。我已经更新它将工作转移到函数中,并允许您指定任意组大小。

public static Dictionary<string,int> groupCount(string str, int groupSize) 
{ 
    string[] tokens = str.Split(new char[] { ' ' }); 

    var dict = new Dictionary<string,int>(); 
    for (int i = 0; i < tokens.Length - (groupSize-1); i++) 
    { 
     string key = ""; 
     for (int j = 0; j < groupSize; j++) 
     { 
      key += tokens[i+j] + " "; 
     } 
     key = key.Substring(0, key.Length-1); 

     if (dict.ContainsKey(key)) { 
      dict[key]++; 
     } else { 
      dict[key] = 1; 
     } 
    } 

    return dict; 
} 

使用方法如下:

string str = "In the end this is not the end"; 
int groupSize = 2; 
var dict = groupCount(str, groupSize); 

Console.WriteLine("Group Of {0}:", groupSize); 
foreach (string k in dict.Keys) { 
    Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]); 
} 

.NET Fiddle

+0

我会注意到它与dasblinkenlight的拍摄非常相似。它使用Split来获取单个单词,使用for循环获取令牌,并使用字典来维护要获取的令牌计数。 – 2014-10-28 20:51:15

0

您可以创建方法,建立从给出的单词短语。效率不是很高(因为跳过),但简单的实现:

private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount) 
{ 
    for(int i = 0; i <= words.Length - wordsCount; i++) 
     yield return String.Join(" ", words.Skip(i).Take(wordsCount)); 
} 

休息很简单 - 分割你的串入的话,建立短语,并获得原始字符串每个短语的出现:

var my_string = "In the end this is not the end"; 
var words = my_string.Split(); 
var result = from p in CreatePhrases(words, 2) 
      group p by p into g 
      select new { g.Key, Count = g.Count()}; 

结果:

[ 
    Key: "In the", Count: 1, 
    Key: "the end", Count: 2, 
    Key: "end this", Count: 1, 
    Key: "this is", Count: 1, 
    Key: "is not", Count: 1, 
    Key: "not the", Count: 1 
] 

创建项目的连续组(更有效的方法适用于任何我枚举):

public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
    this IEnumerable<T> source, int size) 
{ 
    // You can check arguments here    
    Queue<T> bucket = new Queue<T>(); 

    foreach(var item in source) 
    { 
     bucket.Enqueue(item); 
     if (bucket.Count == size) 
     { 
      yield return bucket.ToArray(); 
      bucket.Dequeue(); 
     } 
    } 
} 

而且所有的计算可以在一个行完成:

var my_string = "In the end this is not the end"; 
var result = my_string.Split() 
       .ToConsecutiveGroups(2) 
       .Select(words => String.Join(" ", words)) 
       .GroupBy(p => p) 
       .Select(g => new { g.Key, Count = g.Count()}); 
+1

Yeeaahh,在最后一行的正则表达式最好的解决方案在这里 – gabba 2014-10-28 21:29:04

+1

@gabba是,最好从我在凌晨1点:)而不是只是返回计数,我做不同:) – 2014-10-28 21:40:36

+0

如果你编写SplitToConsecutiveGroups方法迭代通过,你的灵魂会更糟糕源字符串和返回字组的组合 – gabba 2014-10-29 09:07:03

1

下面是使用ILookup<string, string[]>计算每个阵列的发生另一种方法:

var my_string = "In the end this is not the end"; 
int step = 2; 
string[] words = my_string.Split(); 
var groupWords = new List<string[]>(); 
for (int i = 0; i + step <= words.Length; i++) 
{ 
    string[] group = new string[step]; 
    for (int ii = 0; ii < step; ii++) 
     group[ii] = words[i + ii]; 
    groupWords.Add(group); 
} 
var lookup = groupWords.ToLookup(w => string.Join(" ", w)); 

foreach(var kv in lookup) 
    Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count()); 

输出:

Key: "In the", Count: 1 
Key: "the end", Count: 2 
Key: "end this", Count: 1 
Key: "this is", Count: 1 
Key: "is not", Count: 1 
Key: "not the", Count: 1 
+1

不错!在这里查找是很好的 – gabba 2014-10-28 21:44:06

0

假设你需要处理大字符串,我不会推荐你分割整个字符串。 你需要去通过它,还记得去年groupCount单词和在词典]数组合:@dasblinkenlight

var my_string = "In the end this is not the end"; 

    var groupCount = 2; 

    var groups = new Dictionary<string, int>(); 
    var lastGroupCountWordIndexes = new Queue<int>(); 

    for (int i = 0; i < my_string.Length; i++) 
    { 
     if (my_string[i] == ' ' || i == 0) 
     { 
      lastGroupCountWordIndexes.Enqueue(i); 
     } 

     if (lastGroupCountWordIndexes.Count >= groupCount) 
     { 
      var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue(); 

      var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex); 

      if (!groups.ContainsKey(gruopKey)) 
      { 
       groups.Add(gruopKey, 1); 
      } 
      else 
      { 
       groups[gruopKey]++; 
      } 
     } 

    }