2015-12-16 27 views
4

我用vba刮了一些网站以获得乐趣,并使用VBA作为工具。我使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。vba,getElementsByClassName,HTMLSource的双引号不见了

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

    Dim XMLHTTPReq As Object 
    Dim htmlDoc As HTMLDocument 

    Dim postURL As String 

    postURL = "http://foodffs.tumblr.com/archive/2015/11" 

     Set XMLHTTPReq = New MSXML2.XMLHTTP 

     With XMLHTTPReq 
      .Open "GET", postURL, False 
      .Send 
     End With 

     Set htmlDoc = New HTMLDocument 
     With htmlDoc 
      .body.innerHTML = XMLHTTPReq.responseText 
     End With 

     i = 0 

     Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") 

     For Each vr In varTemp 
      ''''the next line is important to solve this issue *1 
      Cells(1, 1) = vr.outerHTML 
      Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date") 
      Cells(i + 1, 3) = varTemp2.Item(0).innerText 
      ''''the next line occur 438Error'''' 
      Set varTemp2 = vr.getElementsByClassName("hover_inner") 
      Cells(i + 1, 4) = varTemp2.innerText 

      i = i + 1 

     Next vr 
End Sub 

我* 1个 细胞(1,1)弄清楚这个问题表明了我接下来的事情就

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank> 
<DIV class=hover_inner><SPAN class=post_date>............... 

呀所有的类标签丢失 “”。只有第一个功能的类有“” 我真的不知道为什么会出现这种情况。

//好的,我可以通过getElementsByTagName(“span”)进行分析。但我更喜欢“类”标记.....

+0

http://stackoverflow.com/questions/7927905/internet-explorer-innerhtml-outputs-attributes-without-quotes我不认为HTML需要引号属性值时,值不包含空格,并且你是什么看看何时看到outerHTML反映了IE对此的表示。这可能不是你所得到的错误的根源。 –

+0

如果您尝试设置varTemp2 = vr.querySelectorAll(“span.post_date”)'会发生什么? – barrowc

+0

感谢所有! @TimWilliams我明白了。那么getElementsByTagName(“span”)是我可以分析innerText的唯一方法? – Soborubang

回答

4

getElementsByClassName method不被认为是一种方法本身;只有父HTMLDocument。如果您想用它来定位DIV元素中的元素,您需要创建一个由该特定DIV元素的.outerHtml组成的子HTMLDocument。

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

    Dim xmlHTTPReq As New MSXML2.XMLHTTP 
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument 
    Dim iDIV As Long, iSPN As Long, iEL As Long 
    Dim postURL As String, nr As Long, i As Long 

    postURL = "http://foodffs.tumblr.com/archive/2015/11" 

    With xmlHTTPReq 
     .Open "GET", postURL, False 
     .Send 
    End With 

    'Set htmlDOC = New HTMLDocument 
    With htmlDOC 
     .body.innerHTML = xmlHTTPReq.responseText 
    End With 

    i = 0 

    With htmlDOC 
     For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1 
      nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row 
      With .getElementsByClassName("post_glass post_micro_glass")(iDIV) 
       'method 1 - run through multiples in a collection 
       For iSPN = 0 To .getElementsByTagName("span").Length - 1 
        With .getElementsByTagName("span")(iSPN) 
         Select Case LCase(.className) 
          Case "post_date" 
           Cells(nr, 3) = .innerText 
          Case "post_notes" 
           Cells(nr, 4) = .innerText 
          Case Else 
           'do nothing 
         End Select 
        End With 
       Next iSPN 
       'method 2 - create a sub-HTML doc to facilitate getting els by classname 
       divSUBDOC.body.innerHTML = .outerHTML 'only the HTML from this DIV 
       With divSUBDOC 
        If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1 
         'use the first 
         Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText 
        End If 
       End With 
      End With 
     Next iDIV 
    End With 

End Sub 

虽然其他.getElementsByXXXX可以很容易地检索另一个元素中收藏,getElementsByClassName method需要考虑它认为是HTMLDocument的整体,即使你已经上当了它,以为。

+0

真的很感谢你!我不知道getElementsByClassName是特殊的。我很佩服你! – Soborubang

+0

MDN有“你也可以在任何元素上调用'getElementsByClassName()';它只会返回具有给定类名称的指定根元素的后代的元素。”我很确定我以前在IE中使用过这种方式... –

+0

https://developer.mozilla.org/zh-CN/docs/Web/API/Element/getElementsByClassName –

1

这是另一种方法。它与原始代码非常相似,但使用querySelectorAll选择相关的span元素。对于这种方法的一个重要的一点是VR必须被声明为是一个特定的元素类型,而不是作为一个IHTMLElement或通用Object:

Option Explicit 

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

' Changed from generic Object to specific type - not 
' strictly necessary to do this 
Dim XMLHTTPReq As MSXML2.XMLHTTP60 
Dim htmlDoc As HTMLDocument 

' These declarations weren't included in the original code 
Dim i As Integer 
Dim varTemp As Object 
' IMPORTANT: vr must be declared as a specific element type and not 
' as an IHTMLElement or generic Object 
Dim vr As HTMLDivElement 
Dim varTemp2 As Object 

Dim postURL As String 

postURL = "http://foodffs.tumblr.com/archive/2015/11" 

' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent 
' to the older XMLHTTP30 
Set XMLHTTPReq = New MSXML2.XMLHTTP60 

With XMLHTTPReq 
    .Open "GET", postURL, False 
    .Send 
End With 

Set htmlDoc = New HTMLDocument 
With htmlDoc 
    .body.innerHTML = XMLHTTPReq.responseText 
End With 

i = 0 

Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") 

For Each vr In varTemp 
    ''''the next line is important to solve this issue *1 
    Cells(1, 1) = vr.outerHTML 

    Set varTemp2 = vr.querySelectorAll("span.post_date") 
    Cells(i + 1, 3) = varTemp2.Item(0).innerText 

    Set varTemp2 = vr.getElementsByClassName("hover_inner") 
    ' incorporating correction from Jeeped's comment (#56349646) 
    Cells(i + 1, 4) = varTemp2.Item(0).innerText 

    i = i + 1 
Next vr 

End Sub 

注:

  • XMLHTTP相当于XMLHTTP30如上所述here
  • 显而易见需要声明在this question探讨,但,不同于getElementsByClassName方法的特定元件类型,querySelectorAll不IHTMLElement
的任何版本存在