找到重复的列并将其替换为计数

我有一个制表符分隔的文件，它具有重复的命名标题;找到重复的列并将其替换为计数

[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [test] \t [Column3] \t [Column4]

我想要做的，是重新命名是重复的[测试]用整数列。所以会成为像

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

到目前为止，我可以隔离的第一行。再算上我发现

string destinationUnformmatedFileName = @"C:\New\20130816_Opportunities_unFormatted.txt"; 
string destinationFormattedFileName = @"C:\New\20130816_Opportunities_Formatted.txt"; 
var unformattedFileStream = File.Open(destinationUnformmatedFileName, FileMode.Open, FileAccess.Read); // Open (unformatted) file for reading 
var formattedFileStream = File.Open(destinationFormattedFileName, FileMode.Create, FileAccess.Write); // Create (formattedFile) for writing 

StreamReader sr = new StreamReader(unformattedFileStream); 
StreamWriter sw = new StreamWriter(formattedFileStream); 

int rowCounter = 0; 
// Read each row in the unformatted file 
while ((currentRow = sr.ReadLine()) != null) 
{ 
    //First row, lets check for duplicate names 
    if (rowCounter = 0) 
    { 

    // Write column name to array 
    string delimiter = "\t"; 
    string[] fieldNames = currentRow.Split(delimiter.ToCharArray()); 

    foreach (string fieldName in fieldNames) 
    { 
     // fieldName must be followed by a tab for it to be a duplicate 
     // original code - causing the issue 
     //Regex rgx = new Regex("\\t(" + fieldName + ")\\t"); 
     // Edit - resolved the issue 
     Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+"); 

     // Count how many occurances of fieldName in currentRow 
     int count = rgx.Matches(currentRow).Count;    
     //MessageBox.Show("Match Count = " + count.ToString()); 

     // If we have a duplicate field name 
     if (count > 1)           
     { 
      string newFieldName = "\t" + fieldName + count.ToString() + "\t"; 
      //MessageBox.Show(newFieldName); 
      currentRow = rgx.Replace(currentRow, newFieldName, 1); 
     } 
    } 
    } 
rowCounter++; 
}

我觉得我在正确的轨道上比赛，但我不认为的是正常工作的正则表达式？

编辑：我想我已经想通了如何找到使用模式;

Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

它不是一个交易断路器，但现在唯一的问题是，它标签;

[Column1] \t [Column2] \t [test4] \t [test3] \t [test2] \t [test] \t [Column3] \t [Column4]

相反

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

来源

2013-08-26 Chris Hillman

“我不认为正则表达式工作正常”听起来像你甚至不确定是否有一个问题。什么不工作？你有例外吗？错误的结果？没有结果？另外，你可能希望为你的模式使用逐字字符串以避免双重转义：'@“\ t（'。其次，你应该在将'regex.Escape（）'连接成模式之前运行'fieldName'，因为它可能包含元字符 –

关于你的编辑，如果修改它，那么问题是匹配永远不会重叠，因为你在字段名称前后需要一个'\ t'，所以相邻字段的匹配会重叠。这是一个很好的解决方法，另外，请将您的解决方案作为答案（并接受它，如果你没有得到一个更好的） –

谢谢m.buettner - 我已经发布了答案，但需要等待2天才能接受。感觉不好，现在浪费人们的时间应该等待一段时间，再研究一下。感谢您的帮助！ –

使用下面

Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

解决使用环视，我发现这里的问题; http://www.regular-expressions.info/duplicatelines.html

可能应该在发布前花费几分钟的时间研究它。

来源

2013-08-27 00:40:02

测试您的正则在RegExr首。我认为“\ t”是一个特殊字符。尝试“\\ t”。在你的C＃这将是“\\\\ T”

来源

2013-08-27 00:27:45

他做到了，反正也没关系。正则表达式引擎可以处理实际的制表符以及转义的\ t。 –

这里是Regex和LINQ之间的大组合：

var input = @"[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [foo] \t [test] \t [Column3] \t [foo] \t [Column4]"; 
Regex reg = new Regex(@"(?<=\\t)[[](.+?)[]]"); 
string output = ""; 
int k = 0;   
foreach (var m in reg.Matches(input) 
        .OfType<Match>() 
        .Select((x,i)=>new {x,i}) 
        .GroupBy(g=>g.x.Value) 
        .Where(g=>g.Count()>1) 
        .SelectMany(x=> x.Select((a,i)=>new {a,i=i+1})) 
        .OrderBy(x=>x.a.i)){       
    output += input.Substring(k, m.a.x.Index - k) + m.a.x.Result("[${1}" + m.i + "]"); 
    k = m.a.x.Index + m.a.x.Length; 
} 
output += input.Substring(k);

结果： [column1的] \吨[列2] \吨[TEST1] \吨[TEST2] \吨[TEST3] \ t [foo1] \ t [test4] \ t [Column3] \ t [foo2] \ t [Column4]

来源

2013-08-27 03:55:53

找到重复的列并将其替换为计数

回答

相关问题