2016-06-28 46 views
0

我有2文本文件是如下(如1466786391大量是唯一时间戳):合并两个文本文件删除重复

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

.... 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 

这:

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 
PING 10.0.0.6 (10.0.0.6): 56 data byte 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 44 packets received, 12% packet loss 
round-trip min/avg/max = 30.238/62.772/102.959 ms 
1466786442 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 
.... 

所以第一文件以timestamp 结尾,并且第二个文件在中间的某个位置具有相同的数据块,之后具有更多的数据,具体时间戳之前的数据是与第一个文件完全相同。

所以我想输出是这样的:

--- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 49 packets received, 2% packet loss 
    round-trip min/avg/max = 20.917/70.216/147.258 ms 
    1466786342 
    PING 10.0.0.6 (10.0.0.6): 56 data bytes 

    .... 

    --- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 50 packets received, 0% packet loss 
    round-trip min/avg/max = 29.535/65.768/126.983 ms 
    1466786391 

--- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 44 packets received, 12% packet loss 
    round-trip min/avg/max = 30.238/62.772/102.959 ms 
    1466786442 
    PING 10.0.0.6 (10.0.0.6): 56 data bytes 
.... 

也就是说,将两者连接起来的文件,并创建第三个去除第二文件的副本(文字块那是已经存在于第一个文件。这里是我的代码:

public static void UnionFiles() 
{ 

    string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http"); 
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat"); 
    var union = Enumerable.Empty<string>(); 

    foreach (string filePath in Directory 
       .EnumerateFiles(folderPath, "*.txt") 
       .OrderBy(x => Path.GetFileNameWithoutExtension(x))) 
    { 
     union = union.Union(File.ReadAllLines(filePath)); 
    } 
    File.WriteAllLines(outputFilePath, union); 
} 

这是错误的输出我得到(文件结构被破坏):

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 
round-trip min/avg/max = 30.238/62.772/102.959 ms 
1466786442 
round-trip min/avg/max = 5.475/40.986/96.964 ms 
1466786492 
round-trip min/avg/max = 5.276/61.309/112.530 ms 

编辑:此代码被编写来处理多个文件,但是我很高兴,即使只有2可以正确完成。

但是,这并不会删除textblocks,因为它会删除几条有用的行,并使输出完全无用。我被卡住了。

如何实现这一目标? 谢谢。

+0

'工会= union.Union(File.ReadAllLines(文件路径));'这应该不创建布尔结合,从而去除重复块? –

+0

是的,它应该,我假设格式(UTF8?)或空白问题? – Ouarzy

+0

您需要实际_parse_文件并提取各个块作为Ouarzy建议的比较。其他一切都将导致丑陋,无法维护的黑客行为。 –

回答

3

我想你想比较块,而不是每行真正的行。

类似的东西应该工作:

public static void UnionFiles() 
{ 
    var firstFilePath = "log1.txt"; 
    var secondFilePath = "log2.txt"; 

    var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath); 
    var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath); 

    var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks); 

    var cleanLog = new StringBuilder(); 
    foreach (var block in cleanLogBlock) 
    { 
     cleanLog.Append(block); 
    } 

    File.WriteAllText("cleanLog.txt", cleanLog.ToString()); 
} 

private static List<LogBlock> ReadFileAsLogBlocks(string filePath) 
{ 
    var allLinesLog = File.ReadAllLines(filePath); 

    var logBlocks = new List<LogBlock>(); 
    var currentBlock = new List<string>(); 

    var i = 0; 
    foreach (var line in allLinesLog) 
    { 
     if (!string.IsNullOrEmpty(line)) 
     { 
      currentBlock.Add(line); 
      if (i == 4) 
      { 
       logBlocks.Add(new LogBlock(currentBlock.ToArray())); 
       currentBlock.Clear(); 
       i = 0; 
      } 
      else 
      { 
       i++; 
      } 
     } 
    } 

    return logBlocks; 
} 

随着日志块定义如下:

public class LogBlock 
{ 
    private readonly string[] _logs; 

    public LogBlock(string[] logs) 
    { 
     _logs = logs; 
    } 

    public override string ToString() 
    { 
     var logBlock = new StringBuilder(); 
     foreach (var log in _logs) 
     { 
      logBlock.AppendLine(log); 
     } 

     return logBlock.ToString(); 
    } 

    public override bool Equals(object obj) 
    { 
     return obj is LogBlock && Equals((LogBlock)obj); 
    } 

    private bool Equals(LogBlock other) 
    { 
     return _logs.SequenceEqual(other._logs); 
    } 

    public override int GetHashCode() 
    { 
     var hashCode = 0; 
     foreach (var log in _logs) 
     { 
      hashCode += log.GetHashCode(); 
     } 
     return hashCode; 
    } 
} 

请小心覆盖LogBlock平等的,有一个一致的GetHashCode的实现作为联盟使用他们两人,如解释here

+0

不,我检查了MSDN示例应用程序。它保留了重复项,它们的一个副本。 –

+0

谢谢,我现在会测试它。你测试过了吗? –

+1

是的,但我试图改进它,感谢您的评论,仍然在此。 – Ouarzy

-2

拼接唯一记录存在问题。 你可以查看下面的代码吗?

public static void UnionFiles() 
{ 

    string folderPath =  Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http"); 
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat"); 
    var union =new List<string>(); 

    foreach (string filePath in Directory 
      .EnumerateFiles(folderPath, "*.txt") 
      .OrderBy(x => Path.GetFileNameWithoutExtension(x))) 
    { 
     var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList(); 
    union.AddRange(filter); 

    } 
    File.WriteAllLines(outputFilePath, union); 
} 
+0

同样的错误,我错过了信息。 –

1

使用正则表达式A,而不哈克溶液:

var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)\s+" 
           + @"(^.+packets transmitted.+packets received.+packet loss$)\s+" 
           + @"(^round-trip min/avg/max.+$)\s+" 
           + @"(^\d+$)\s*" 
           + @"(^PING.+$)?", 
           RegexOptions.Multiline); 

var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList(); 
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList(); 

var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 => 
    logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value))); 

var mergedLogContents = string.Join("\n\n", mergedLogBlocks); 

Groups集合的正则表达式Match的包含一个记录块的每一行(因为在图案中的每个线被包裹在括号())和完整匹配在索引0。因此,索引为4的匹配组是我们可以用来比较日志块的时间戳。

工作实施例:https://dotnetfiddle.net/kAkGll

+0

非常感谢!一个好的解决方案 –

相关问题