2016-03-10 144 views
3

我与Excel单元格内,用逗号分割成其组成部分的一些英国的地址数据的工作。Excel中 - 字符串删除重复

我有一些VBA我已经从已经除去了一些精确复制的条目的网络中获得,但我留下大量已重复段一些顺序和一些非顺序数据。

附件是图像强调什么,我试图实现,但迄今为止这是不是我包括我使用的代码向您展示中,我一直在寻找的方向。任何人都可以进一步思考如何实现这一目标?

Function stringOfUniques(inputString As String, delimiter As String) 
Dim xVal As Variant 
Dim dict As Object 
Set dict = CreateObject("Scripting.Dictionary") 

For Each xVal In Split(inputString, delimiter) 
dict(xVal) = xVal 
Next xVal 

stringOfUniques = Join(dict.Keys(), ",") 
End Function 

这确实设法摆脱了一些他们,但有一个庞大的人口,我正在努力如此自动化这将是不可思议的。

Ideal Outcome

+0

一个RegExp与反向引用将是另一个可能的选择 – brettdj

回答

4

可能不是最优雅的答案,但确实的伎俩。 这里我使用Split命令在每个逗号处分割字符串。 从这个返回的结果是

bat ball banana 

代码:

Option Explicit 
Private Sub test() 
Dim Mystring As String 
Dim StrResult As String 

Mystring = "bat,ball,bat,ball,banana" 
StrResult = shed_duplicates(Mystring) 
End Sub 
Private Function shed_duplicates(ByRef Mystring As String) As String 
Dim MySplitz() As String 
Dim J As Integer 
Dim K As Integer 
Dim BooMatch As Boolean 
Dim StrTemp(10) As String ' assumes no more than 10 possible splits! 
Dim StrResult As String 


MySplitz = Split(Mystring, ",") 
    For J = 0 To UBound(MySplitz) 
    BooMatch = False 
    For K = 0 To UBound(StrTemp) 
     If MySplitz(J) = StrTemp(K) Then 
      BooMatch = True 
      Exit For 
     End If 
    Next K 
    If Not BooMatch Then 
     StrTemp(J) = MySplitz(J) 
    End If 
Next 
For J = 0 To UBound(StrTemp) 
    If Len(StrTemp(J)) > 0 Then ' ignore blank entries 
     StrResult = StrResult + StrTemp(J) + " " 
    End If 
Next J 
Debug.Print StrResult 
End Function 
3

你可能真的使用正则表达式替换:

^(\d*\s*([^,]*),.*)\2(,|$) 

替换模式是

$1$3 

请参阅regex demo。所述图案解释

  • ^ - 串的开始(或线的如果.MultiLine = True
  • (\d*\s*([^,]*),.*) - 第1组(稍后参考与从替换模式$1反向引用)匹配:
    • \d* - 0+位数字与
    • \s* - 0+空格字符
    • ([^,]*) - 第2组(以后,我们可以使用在\2图案反向引用来引用与该子模式捕捉到的值)相匹配比逗号
    • ,.*其他0+字符 - 逗号后具有比其他换行符0+字符
  • \2 - 由组2
  • (,|$)捕获的文本 - 第3组(稍后从替换模式参照与$3 - 还原逗号)匹配逗号或字符串的末尾(或行,如果.MultiLine = True)。

注意:你不需要.MultiLine = True如果你只是检查单个细胞含有一个地址。

下面是展示如何在VBA中使用的样本VBA子:

Sub test() 
    Dim regEx As Object 
    Set regEx = CreateObject("VBScript.RegExp") 
    With regEx 
     .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)" 
     .Global = True 
     .MultiLine = True ' Remove if individual addresses are matched 
    End With 
    s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _ 
     "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _ 
     "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _ 
     "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD" 
    MsgBox regEx.Replace(s, "$1$3") 
End Sub 

enter image description here

+0

伟大的工作!请注意,在正则表达式替换后删除两倍逗号的小调整是值得的。 – brettdj

+1

我从未见过正则表达式函数,示例解决方案! –

1

第一种解决方案是使用一个字典来获得独特的段列表。 它随后将被作为分裂段之前跳过第一地址数作为简单:

Function RemoveDuplicates1(text As String) As String 
    Static dict As Object 
    If dict Is Nothing Then 
    Set dict = CreateObject("Scripting.Dictionary") 
    dict.CompareMode = 1 ' set the case sensitivity to All 
    Else 
    dict.RemoveAll 
    End If 

    ' Get the position just after the address number 
    Dim c&, istart&, segment 
    For istart = 1 To Len(text) 
    c = Asc(Mid$(text, istart, 1)) 
    If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] 
    Next 

    ' Split the segments and add each one of them to the dictionary. No need to keep 
    ' a reference to each segment since the keys are returned by order of insertion. 
    For Each segment In Split(Mid$(text, istart), ",") 
    If Len(segment) Then dict(segment) = Empty 
    Next 

    ' Return the address number and the segments by joining the keys 
    RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",") 
End Function 

第二种解决方案将是,以提取所有的段,然后搜索如果它们中的每一个存在于先前的位置是:

Function RemoveDuplicates2(text As String) As String 
    Dim c&, segments$, segment$, length&, ifirst&, istart&, iend& 

    ' Get the position just after the address number 
    For ifirst = 1 To Len(text) 
    c = Asc(Mid$(text, ifirst, 1)) 
    If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] 
    Next 

    ' Get the segments without the address number and add a leading/trailing comma 
    segments = "," & Mid$(text, ifirst) & "," 
    istart = 1 

    ' iterate each segment 
    Do While istart < Len(segments) 

    ' Get the next segment position 
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF 
    If iend - istart Then 

     ' Get the segment 
     segment = Mid$(segments, istart, iend - istart + 2) 

     ' Rewrite the segment if not present at a previous position 
     If InStr(1, segments, segment, vbTextCompare) = istart Then 
     Mid$(segments, length + 1) = segment 
     length = length + Len(segment) - 1 
     End If 
    End If 

    istart = iend + 1 
    Loop 

    ' Return the address number and the segments 
    RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1) 

End Function 

和第三解决方案将是使用正则表达式来除去所有的重复链段:

Function RemoveDuplicates3(ByVal text As String) As String 

    Static re As Object 
    If re Is Nothing Then 
    Set re = CreateObject("VBScript.RegExp") 
    re.Global = True 
    re.IgnoreCase = True 
    ' Match any duplicated segment separated by a comma. 
    ' The first segment is compared without the first digits. 
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)" 
    End If 

    ' Remove each matching segment 
    Do While re.test(text) 
    text = re.Replace(text, "$1") 
    Loop 

    RemoveDuplicates3 = text 
End Function 

这些都是对于10000次迭代(越低越好),执行时间:

input text : "123 abc,,1 abc,abc 2,ABC,abc,a,c" 
output text : "123 abc,1 abc,abc 2,a,c" 

RemoveDuplicates1 (dictionary) : 718 ms 
RemoveDuplicates2 (text search) : 219 ms 
RemoveDuplicates3 (regex)  : 1469 ms 
+0

再一次感谢您的众多优雅的解决方案,每个似乎都做我需要的东西。不胜感激! –