拆分，组和计数字符串

我想分割，组和计算C＃中的大字符串中特定短语的出现次数。拆分，组和计数字符串

下面的伪代码应该给出我想要实现的一些指示。

var my_string = "In the end this is not the end"; 
my_string.groupCount(2); 

==> 
    [0] : {Key: "In the", Count:1} 
    [1] : {Key: "the end", Count:2} 
    [2] : {Key: "end this", Count: 1} 
    [3] : {Key: "this is", Count: 1} 
    [4] : {Key: "is not", Count: 1} 
    [5] : {Key: "not the", Count: 1}

正如您会注意到的，这不像分割字符串和计算每个子字符串那样简单。这个例子每两个字组一次，但理想情况下它应该能够处理任何数字。

来源

2014-10-28 tribe84

谢谢，我错过了。 – tribe84 2014-10-28 20:38:51

@GrantWinney - 不，这两个问题是相似的，但不一样。 – tribe84 2014-10-28 20:39:39

你想如何分割'input'？ – 2014-10-28 20:49:59

这里是你如何处理这个大纲：

使用string经常Split方法来获得个人的话
做一个字典的计数
通过所有对进入建立复合键和递增计数

这里是你如何实现这个：

var counts = new Dictionary<string,int>(); 
var tokens = str.Split(' '); 
for (var i = 0 ; i < tokens.Length-1 ; i++) { 
    var key = tokens[i]+" "+tokens[i+1]; 
    int c; 
    if (!counts.TryGetValue(key, out c)) { 
     c = 0; 
    } 
    counts[key] = c + 1; 
}

Demo.

来源

2014-10-28 20:47:38 dasblinkenlight

如果字符串很大，会发生什么情况？ – gabba 2014-10-28 21:23:55

@gabba在不是很大的字符串的情况下会发生同样的情况:-)任务在时间和内存上是线性的。 – dasblinkenlight 2014-10-28 21:26:38

当您将2GB的字符串拆分为数千个小字符串时，您将获得更多的双倍内存消耗。我们不需要这样做。我们只需要做一次扫描，还有小字典。 – gabba 2014-10-28 21:31:46

这里是我的实现。我已经更新它将工作转移到函数中，并允许您指定任意组大小。

public static Dictionary<string,int> groupCount(string str, int groupSize) 
{ 
    string[] tokens = str.Split(new char[] { ' ' }); 

    var dict = new Dictionary<string,int>(); 
    for (int i = 0; i < tokens.Length - (groupSize-1); i++) 
    { 
     string key = ""; 
     for (int j = 0; j < groupSize; j++) 
     { 
      key += tokens[i+j] + " "; 
     } 
     key = key.Substring(0, key.Length-1); 

     if (dict.ContainsKey(key)) { 
      dict[key]++; 
     } else { 
      dict[key] = 1; 
     } 
    } 

    return dict; 
}

使用方法如下：

string str = "In the end this is not the end"; 
int groupSize = 2; 
var dict = groupCount(str, groupSize); 

Console.WriteLine("Group Of {0}:", groupSize); 
foreach (string k in dict.Keys) { 
    Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]); 
}

.NET Fiddle

来源

2014-10-28 20:50:05

我会注意到它与dasblinkenlight的拍摄非常相似。它使用Split来获取单个单词，使用for循环获取令牌，并使用字典来维护要获取的令牌计数。 – 2014-10-28 20:51:15

您可以创建方法，建立从给出的单词短语。效率不是很高（因为跳过），但简单的实现：

private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount) 
{ 
    for(int i = 0; i <= words.Length - wordsCount; i++) 
     yield return String.Join(" ", words.Skip(i).Take(wordsCount)); 
}

休息很简单 - 分割你的串入的话，建立短语，并获得原始字符串每个短语的出现：

var my_string = "In the end this is not the end"; 
var words = my_string.Split(); 
var result = from p in CreatePhrases(words, 2) 
      group p by p into g 
      select new { g.Key, Count = g.Count()};

结果：

[ 
    Key: "In the", Count: 1, 
    Key: "the end", Count: 2, 
    Key: "end this", Count: 1, 
    Key: "this is", Count: 1, 
    Key: "is not", Count: 1, 
    Key: "not the", Count: 1 
]

创建项目的连续组（更有效的方法适用于任何我枚举）：

public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
    this IEnumerable<T> source, int size) 
{ 
    // You can check arguments here    
    Queue<T> bucket = new Queue<T>(); 

    foreach(var item in source) 
    { 
     bucket.Enqueue(item); 
     if (bucket.Count == size) 
     { 
      yield return bucket.ToArray(); 
      bucket.Dequeue(); 
     } 
    } 
}

而且所有的计算可以在一个行完成：

var my_string = "In the end this is not the end"; 
var result = my_string.Split() 
       .ToConsecutiveGroups(2) 
       .Select(words => String.Join(" ", words)) 
       .GroupBy(p => p) 
       .Select(g => new { g.Key, Count = g.Count()});

来源

2014-10-28 20:58:28

Yeeaahh，在最后一行的正则表达式最好的解决方案在这里 – gabba 2014-10-28 21:29:04

@gabba是，最好从我在凌晨1点:)而不是只是返回计数，我做不同:) – 2014-10-28 21:40:36

如果你编写SplitToConsecutiveGroups方法迭代通过，你的灵魂会更糟糕源字符串和返回字组的组合 – gabba 2014-10-29 09:07:03

下面是使用ILookup<string, string[]>计算每个阵列的发生另一种方法：

var my_string = "In the end this is not the end"; 
int step = 2; 
string[] words = my_string.Split(); 
var groupWords = new List<string[]>(); 
for (int i = 0; i + step <= words.Length; i++) 
{ 
    string[] group = new string[step]; 
    for (int ii = 0; ii < step; ii++) 
     group[ii] = words[i + ii]; 
    groupWords.Add(group); 
} 
var lookup = groupWords.ToLookup(w => string.Join(" ", w)); 

foreach(var kv in lookup) 
    Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count());

输出：

Key: "In the", Count: 1 
Key: "the end", Count: 2 
Key: "end this", Count: 1 
Key: "this is", Count: 1 
Key: "is not", Count: 1 
Key: "not the", Count: 1

来源

2014-10-28 21:11:12

不错！在这里查找是很好的 – gabba 2014-10-28 21:44:06

假设你需要处理大字符串，我不会推荐你分割整个字符串。你需要去通过它，还记得去年groupCount单词和在词典]数组合：@dasblinkenlight

var my_string = "In the end this is not the end"; 

    var groupCount = 2; 

    var groups = new Dictionary<string, int>(); 
    var lastGroupCountWordIndexes = new Queue<int>(); 

    for (int i = 0; i < my_string.Length; i++) 
    { 
     if (my_string[i] == ' ' || i == 0) 
     { 
      lastGroupCountWordIndexes.Enqueue(i); 
     } 

     if (lastGroupCountWordIndexes.Count >= groupCount) 
     { 
      var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue(); 

      var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex); 

      if (!groups.ContainsKey(gruopKey)) 
      { 
       groups.Add(gruopKey, 1); 
      } 
      else 
      { 
       groups[gruopKey]++; 
      } 
     } 

    }

来源

2014-10-28 21:20:24 gabba

拆分，组和计数字符串

回答

相关问题