如何加快此代码？

我得到了以下用于读取txt文件并返回字典的方法。读取〜5MB文件需要大约7分钟（67000行，每行70个字符）。如何加快此代码？

public static Dictionary<string, string> FASTAFileReadIn(string file) 
{ 
    Dictionary<string, string> seq = new Dictionary<string, string>(); 

    Regex re; 
    Match m; 
    GroupCollection group; 
    string currentName = string.Empty; 

    try 
    { 
     using (StreamReader sr = new StreamReader(file)) 
     { 
      string line = string.Empty; 
      while ((line = sr.ReadLine()) != null) 
      { 
       if (line.StartsWith(">")) 
       {// Match Sequence 
        re = new Regex(@"^>(\S+)"); 
        m = re.Match(line); 
        if (m.Success) 
        { 
         group = m.Groups; 
         if (!seq.ContainsKey(group[1].Value)) 
         { 
          seq.Add(group[1].Value, string.Empty); 
          currentName = group[1].Value; 
         } 
        } 
       } 
       else if (Regex.Match(line.Trim(), @"\S+").Success && 
          currentName != string.Empty) 
       { 
        seq[currentName] += line.Trim(); 
       } 
      } 
     } 
    } 
    catch (IOException e) 
    { 
     Console.WriteLine("An IO exception has benn thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 
    finally { } 

    return seq; 
}

代码的哪些部分是最耗时的，如何加快步伐？

感谢

来源

2012-07-24 Mavershang

相关：http://stackoverflow.com/questions/3927/what-are-some-good-net-profilers – 2012-07-24 03:05:33

@布莱恩，谢谢，这可以节省一些时间。 :) – sarnold 2012-07-24 03:05:49

不要每次都创建一个新的正则表达式。创建一次，并使用'RegexOptions.Compiled'标志来获得额外的性能。 – Ryan 2012-07-24 03:06:55

缓存并编译正则表达式，重新排序条件，减少配料数量等。

public static Dictionary<string, string> FASTAFileReadIn(string file) { 
    var seq = new Dictionary<string, string>(); 

    Regex re = new Regex(@"^>(\S+)", RegexOptions.Compiled); 
    Regex nonWhitespace = new Regex(@"\S", RegexOptions.Compiled); 
    Match m; 
    string currentName = string.Empty; 

    try { 
     foreach(string line in File.ReadLines(file)) { 
      if(line[0] == '>') { 
       m = re.Match(line); 

       if(m.Success) { 
        if(!seq.ContainsKey(m.Groups[1].Value)) { 
         seq.Add(m.Groups[1].Value, string.Empty); 
         currentName = m.Groups[1].Value; 
        } 
       } 
      } else if(currentName != string.Empty) { 
       if(nonWhitespace.IsMatch(line)) { 
        seq[currentName] += line.Trim(); 
       } 
      } 
     } 
    } catch(IOException e) { 
     Console.WriteLine("An IO exception has been thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 

    return seq; 
}

然而，这只是一个呐ï已经优化。阅读FASTA格式，我写道：

public static Dictionary<string, string> ReadFasta(string filename) { 
    var result = new Dictionary<string, string> 
    var current = new StringBuilder(); 
    string currentKey = null; 

    foreach(string line in File.ReadLines(filename)) { 
     if(line[0] == '>') { 
      if(currentKey != null) { 
       result.Add(currentKey, current.ToString()); 
       current.Clear(); 
      } 

      int i = line.IndexOf(' ', 2); 

      currentKey = i > -1 ? line.Substring(1, i - 1) : line.Substring(1); 
     } else if(currentKey != null) { 
      current.Append(line.TrimEnd()); 
     } 
    } 

    if(currentKey != null) 
     result.Add(currentKey, current.ToString()); 

    return result; 
}

告诉我，如果它的工作;它应该快得多。

来源

2012-07-24 03:14:27 Ryan

File.ReadAllLines（）中的字符串行是否一次性从文件构建整个（数组？列表？），还是按需构建每个“行”？ – sarnold 2012-07-24 03:16:59

@sarnold：对不起，你是对的。我的意思是'ReadLines（）'，它创建一个'IEnumerable '。（虽然如果该文件只有5MB，那么读起来可能是有益的，因为开始时...） – Ryan 2012-07-24 03:18:21

是的，五个megs，它可能并不重要。但是，我已经看到过一些_huge_FASTA文件.. – sarnold 2012-07-24 03:19:38

我希望编译器会自动执行此操作，但我注意到的第一件事是你重新编译每个匹配的行正则表达式：

  while ((line = sr.ReadLine()) != null) 
      { 
       if (line.StartsWith(">")) 
       {// Match Sequence 
        re = new Regex(@"^>(\S+)");

即使你更好可以完全删除正则表达式;大多数语言提供某种经常抽烟的正则表达式的split功能...

来源

2012-07-24 03:08:42 sarnold

同意，'re'应该在循环之外明确定义。 – matchdav 2012-07-24 03:11:33

我对此做了统计，最好的方法是使它们成为静态的并使用'RegexOptions.Compiled'。 – 2012-07-24 03:22:42

您可以通过大幅度提高阅读速度的BufferedStream：

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) 
using (BufferedStream bs = new BufferedStream(fs)) 
using (StreamReader sr = new StreamReader(bs)) 
{ 
    // Use the StreamReader 
}

提到的Regex重新编译@sarnold可能是你最大的性能杀手，但是，如果你的处理时间是5分钟。

来源

2012-07-24 03:10:49

哈，当我看到你的回答时，我的第一个想法是，“嘿，我敢打赌，这是减速90％来自哪里。” – sarnold 2012-07-24 03:15:41

下面是我将如何写它。没有更多的信息（即平均字典条目的时间），我无法优化StingBuilder的容量。您也可以关注Eric J.的建议并添加BufferedStream。理想情况下，如果您想要提高性能，则完全不用Regular Expressions，但编写和管理起来要容易得多，所以我明白您为什么要使用它们。

public static Dictionary<string, StringBuilder> FASTAFileReadIn(string file) 
{ 
    var seq = new Dictionary<string, StringBuilder>(); 
    var regName = new Regex("^>(\\S+)", RegexOptions.Compiled); 
    var regAppend = new Regex("\\S+", RegexOptions.Compiled); 

    Match tempMatch = null; 
    string currentName = string.Empty; 
    try 
    { 
     using (StreamReader sReader = new StreamReader(file)) 
     { 
      string line = string.Empty; 
      while ((line = sReader.ReadLine()) != null) 
      { 
       if ((tempMatch = regName.Match(line)).Success) 
       { 
        if (!seq.ContainsKey(tempMatch.Groups[1].Value)) 
        { 
         currentName = tempMatch.Groups[1].Value; 
         seq.Add(currentName, new StringBuilder()); 
        } 
       } 
       else if ((tempMatch = regAppend.Match(line)).Success && currentName != string.Empty) 
       { 
        seq[currentName].Append(tempMatch.Value); 
       } 
      } 
     } 
    } 
    catch (IOException e) 
    { 
     Console.WriteLine("An IO exception has been thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 

    return seq; 
}

正如你所看到的，我稍微改变你的字典使用优化StringBuilder类附加价值。我也一次性预编译正则表达式，以确保您不会一遍又一遍重复编译相同的正则表达式。我也提取了你的“附加”情况以编译成正则表达式。

让我知道，如果这可以帮助你表现明智。

来源

2012-07-24 03:32:38

如何加快此代码？

回答

相关问题