2016-08-06 53 views
-1

如何在文本文件中找到一个字符串(如果需要,使用正则表达式),然后稍微修改它,然后在相同的文件中再次找到并且它不匹配,然后删除来自这些文件的特定标签。删除所有文本文件中的无效链接

样本输入:

<sec id="sec1"> 
<p>"You fig. 23 did?" I <a href rid="sec12">section 12</a> asked, surprised.</p> 
<p>"Cross sent it table 9 to me a few weeks ago." Stanton crossed over to my mother, taking her hand in his. "I <a href rid="sec2">section 2</a> couldn"t have argued for better terms."</p> 
<p>"There are always better terms, Richard!" my mom said sharply.</p> 
<p>"There are <xref ref-type="biblio" rid="ref2">[2]</xref> rewards for milestones such as anniversaries and the birth of children, and nothing in the way of penalties for Eva, aside from marit table 9al counseling. A dissolution would have a more than equit table 9able distribution of assets. I <a href rid="sec2">section 2</a> was tempted to ask if Cross had his in-house counsel review it table 9. I <a href rid="sec2">section 2</a> imagine they argued strenuously against it table 9."</p> 
<p>She settled for a moment, taking that in. Then she pushed to her feet, bristling. "But you knew they were eloping? You fig. 23 knew, and you didn"t say anything?"</p> 
<p>"Of course, I <a href rid="sec2">section 2</a> didn"t know." He pulled her into his arms, crooning softly like he would wit table 9h a child. "I <a href rid="sec2">section 2</a> assumed he was looking ahead. You fig. 23 know these things usually take a few months of negotiating. Although, in this case, there was nothing more I <a href rid="sec2">section 2</a> could"ve asked for."</p> 
<p>I <a href rid="sec2">section 2</a> stood. I <a href rid="sec2">section 2</a> had to hurry if I <a href rid="sec2">section 2</a> was going to get to work on time. Today of all days, I <a href rid="sec2">section 2</a> didn"t want to be late.</p> 
<p>"Where are you <xref ref-type="biblio" rid="ref14">[14]</xref> going?" My mother straightened away from Stanton. "We"re not done wit table 9h this discussion. You fig. 23 can"t just drop a bomb like that and leave!" 
<fig id="fig4"> 
<caption><p>I'm confused</p></caption> 
</fig> 
</p> 
<p>Turning to face her, I <a href rid="sec2">section 2</a> walked backward. "I"ve seriously got to get ready. Why don"t we get together for lunch and talk more then?"</p> 
<sec id="sec2"> 
<p>"You fig. 23 can"t be""</p> 
<p>I <a href rid="sec2">section 2</a> cut her <xref ref-type="biblio" rid="ref1">[1]</xref>, <xref ref-type="biblio" rid="ref3">[3]</xref> off. "Corinne Giroux."</p> 
<p>My mother"s eyes widened, then narrowed. One name. I <a href rid="sec5">section 5</a> didn"t have to say anything else.</p> 
<p>Gideon"s ex was a problem that needed no further explanation.</p> 
<p>It was the rare person who came to Manhattan and didn"t feel an instant familiarit table 9y. The skyline of the cit table 9y had been immortalized in too many movies and television shows to count, spreading the love affair wit table 9h New York from residents to the world.</p> 
<p>I <a href rid="sec2">section 2</a> was no exception.</p> 
<p>I <a href rid="sec4">section 4</a> adored the Art Deco elegance of the Chrysler Building. I <a href rid="sec2">section 2</a> could pinpoint my place on the island in relation to the posit table 9ion of the Empire State Building. I <a href rid="sec2">section 2</a> was awed by the breathtaking height of the Freedom Tower that now dominated downtown. But the Crossfire Building was in a class by it table 9self. I"d thought so before I <a href rid="sec2">section 2</a> had ever fallen in love wit table 9h the man whose vision had led to it table 9s creation.</p> 
<p>As Ra"l pulled the Benz up to <xref ref-type="biblio" rid="ref15">[15]</xref> the curb, I <a href rid="sec2">section 2</a> marveled at the distinctive sapphire blue glass that encased the obelisk shape of the Crossfire. My head tilted back, my gaze sliding up the shimmering height to the point at the top, the light-drenched space that housed Cross Industries. Pedestrians surged around me, the sidewalk teeming wit table 9h businessmen and -women heading to work wit table 9h briefcases and totes in one hand and steaming cups of coffee in the other.</p> 
<p>I <a href rid="sec1">section 1</a> felt Gideon before I <a href rid="sec1">section 1</a> saw him, my entire body humming wit table 9h awareness as he stepped out of the Bentley, which had pulled up behind the Benz. The air around me charged wit table 9h electricit table 9y, the crackling energy that always heralded the approach of a storm.</p> 
</sec> 
</sec> 

,我至今写的代码是

Imports System.IO 
Imports System.Text.RegularExpressions 
Public Class Form1 
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click 
     If FolderBrowserDialog1.ShowDialog = DialogResult.OK Then 
      TextBox1.Text = FolderBrowserDialog1.SelectedPath 
     End If 
    End Sub 

    Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click 
     Dim targetDirectory As String 
     targetDirectory = TextBox1.Text 
     Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.txt") 
     For Each txtFile In txtFilesArray 
      Dim FileInfo As New FileInfo(txtFile) 
      Dim FileLocation As String = FileInfo.FullName 
      Dim input() As String = File.ReadAllLines(FileLocation) 
      Dim pattern As String = "(?<=rid="sec)(\d+)(?=">)" 
      Dim r As Regex = New Regex(pattern) 
      Dim m As Match = r.Match(input) 
      If (m.Success) Then 
       Dim x As String = " id=""sec" + pattern + """" 
       Dim r2 As Regex = New Regex(x) 
       Dim m2 As Match = r2.Match(input) 
       If (m2.Success) Then 
        Dim tgPat As String = "<a href rid=""sec + pattern +"">(\w+) (\d+)</a>" 
        Dim tgRep As String = "$1 $2" 
        Dim tgReg As New Regex(tgPat) 
        Dim result1 As String = tgReg.Replace(input, tgRep) 
       Else 
       End If 
      End If 
     Next 
    End Sub 
End Class 

的代码是明确不完整,有缺陷的,任何人都可以帮忙吗? 基本上它会搜索文件中的rid="sec[0-9]+",然后将它与<sec id="sec[0-9]+">id="sec[0-9]+"匹配,当它找不到任何匹配时,它将删除该链接。我怎样才能做到这一点?

+3

#2是不是一个代码编写的服务,这是关于你有你的代码的特定问题。这些例子太大,难以比较。如果可以的话,发布一个更小的样本以及迄今为止尝试的任何内容。 [如何创建一个最小化,完整和可验证的示例](http://stackoverflow.com/help/mcve) – Slai

+0

我编辑了这个问题,看看你是否可以以任何方式帮助我? –

+0

我建议你在文件中获得所有出现的',然后再次通过它,删除没有任何匹配的链接发现''元素。 –

回答

0

也许更可靠的替代方法是解析XML,但输出不会保留<caption>标记周围的新行几乎没有。

Dim sInput = IO.File.ReadAllText("input.txt") 
sInput = sInput.Replace("<a href ", "<a href="""" ") ' because " href " is not valid parsable XML 
Dim xInput = XElement.Parse(sInput) 

' this is where the magic happens 
Dim aTags = xInput...<a> ' all anchor tags 
Dim gRIDs = aTags.GroupBy(Function(x) [email protected]) ' group by the rid attribute 
For Each g In gRIDs 
    If g.Count = 1 Then 
     g(0).ReplaceWith(g(0).Value) ' replaces the XElement <a href="" rid="sec12">section 12</a> with it's Value section 12 
    End If 
Next 

Dim sOutput = xInput.ToString 
sOutput = sOutput.Replace("<a href="""" ", "<a href ") ' optional to change the href="" back to href 
sOutput = sOutput.Replace(" ", "") ' optional to remove indentation 
IO.File.WriteAllText("output.txt", sOutput) 

更新

Dim sInput = IO.File.ReadAllText("input.txt") 
Dim splitBy = "<a href rid=""" 
Dim aInput = Split(sInput, splitBy) 

Dim groups = Enumerable.Range(1, aInput.Length - 1).GroupBy(Function(i) Split(aInput(i), """", 2)(0)) ' group by string between '<a href rid="' and '"' 

For Each g In groups 
    If g.Count = 1 Then 
     aInput(g(0)) = Split(aInput(g(0)), ">", 2)(1).Replace("</a>", "") ' Example: 'sec12">section 12</a> asked..' to 'section 12 asked..' 
    Else 
     For Each i In g 
      aInput(i) = splitBy & aInput(i) ' Example: 'sec12">section 12</a> asked..' to '<a href rid="sec12">section 12</a> asked..' 
     Next 
    End If 
Next 

Dim sOutput = Join(aInput, "") 
IO.File.WriteAllText("output.txt", sOutput) 
+0

如果不使用XML解析器元素并将文件视为普通文本文件并使用基本的字符串修改技术来完成这项工作,就不能这样做,因为我不知道如何在vb.net中使用XML功能? –