2014-02-07 53 views
2

我想从网页http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures获取一些数据。使用XMLHTTP使用vba进行网页抓取

如果我使用旧的InternetExplorer对象(下面的代码),我可以通过HTML文档。但我想用XMLHTTP对象(第二个代码)。

Sub IEZagon() 
    'we define the essential variables 
    Dim ie As Object 
    Dim TDelement, TDelements 
    Dim AnhorLink, AnhorLinks 

    'add the "Microsoft Internet Controls" reference in your VBA Project indirectly 
    Set ie = CreateObject("InternetExplorer.Application") 
    With ie 
     .Visible = True 
     .navigate ("[URL]http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL]") 
     While ie.ReadyState <> 4 
      DoEvents 
     Wend 
     Set AnhorLinks = .document.getElementsbytagname("a") 
     Set TDelements = .document.getElementsbytagname("td") 
     For Each AnhorLink In AnhorLinks 
      Debug.Print AnhorLink.innertext 
     Next 
     For Each TDelement In TDelements 
      Debug.Print TDelement.innertext 
     Next 
    End With 
    Set ie = Nothing 
End Sub 

使用XMLHTTP与对象代码:

Sub FuturesScrap(ByVal URL As String) 
    Dim XMLHttpRequest As XMLHTTP 
    Dim HTMLDoc As New HTMLDocument 

    Set XMLHttpRequest = New MSXML2.XMLHTTP 
    XMLHttpRequest.Open "GET", URL, False 
    XMLHttpRequest.send 
    While XMLHttpRequest.readyState <> 4 
     DoEvents 
    Wend 

    Debug.Print XMLHttpRequest.responseText 
    HTMLDoc.body.innerHTML = XMLHttpRequest.responseText 

    With HTMLDoc.body 
     Set AnchorLinks = .getElementsByTagName("a") 
     Set TDelements = .getElementsByTagName("td") 

     For Each AnchorLink In AnchorLinks 
      Debug.Print AnhorLink.innerText 
     Next 

     For Each TDelement In TDelements 
      Debug.Print TDelement.innerText 
     Next 
    End With 
End Sub 

我只得到基本的HTML:

<html> 
<head> 
<title>Resource Not found</title> 
<link rel= 'stylesheet' type='text/css' href='/blueprint/css/errorpage.css'/> 
</head> 
<body> 
<table class="header"> 
<tr> 
<td class="CMTitle CMHFill"><span class="large">Resource Not found</span></td> 
</tr> 
</table> 
<div class="body"> 
<p style="font-weight:bold;">The requested resource does Not exist.</p> 
</div> 
<table class="footer"> 
<tr> 
<td class="CMHFill"> </td> 
</tr> 
</table> 
</body> 
</html> 

我想通过表格和coresponding数据走...... 最后我想要选择年份到月份的不同时间间隔:

I我真的很感谢任何帮助!谢谢!

+2

看起来像你的要求了不正确的URL ... –

+0

我高林权网址: – Figlio

+0

见@ brettdj的答复[这里](http://stackoverflow.com/questions/8798260/html-解析的cricinfo记分卡) –

回答

3

我可以确认,当我运行代码(带或不带url标记)时,我会得到与您相同的HTML。我发现一个有用的帖子here。我已经使用在那里找到的方法修改了您的代码,现在它似乎已经下载了正确的信息。

Sub test() 
    Call FuturesScrap1("http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures") 
End Sub 

我包含了调用子,因为url标记似乎为MSXML请求导致错误。

Sub FuturesScrap1(ByVal URL As String) 
    Dim HTMLDoc As New HTMLDocument 
    Dim oHttp As MSXML2.XMLHTTP 
    Dim sHTML As String 
    Dim AnchorLinks As Object 
    Dim TDelements As Object 
    Dim TDelement As Object 
    Dim AnchorLink As Object 

    On Error Resume Next 
    Set oHttp = New MSXML2.XMLHTTP 
    If Err.Number <> 0 Then 
     Set oHttp = CreateObject("MSXML.XMLHTTPRequest") 
     MsgBox "Error 0 has occured while creating a MSXML.XMLHTTPRequest object" 
    End If 
    On Error GoTo 0 
    If oHttp Is Nothing Then 
     MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object" 
     Exit Sub 
    End If 

    'Open the URL in browser object 
    oHttp.Open "GET", URL, False 
    oHttp.send 
    sHTML = oHttp.responseText 

    Debug.Print oHttp.responseText 

    HTMLDoc.body.innerHTML = oHttp.responseText 

    With HTMLDoc.body 
     Set AnchorLinks = .getElementsByTagName("a") 
     Set TDelements = .getElementsByTagName("td") 

     For Each AnchorLink In AnchorLinks 
      Debug.Print AnchorLink.innerText 
     Next 

     For Each TDelement In TDelements 
      Debug.Print TDelement.innerText 
     Next 
    End With 

End Sub 

编辑如下因素注释:

我一直没能找到使用MSXML2对象的表元素,源代码似乎并没有包含这些内容。在firebug中,td标签是存在的,所以我认为表是由JavaScript代码生成的。我不知道MSXML2是否可以运行JavaScript,因此我修改了使用Internet Explorer的子程序,它不是快速代码,但它确实找到了td元素,并且确实允许单击这些标签。我发现td元素需要一些时间才能变得可用(大概是因为IE需要运行JavaScript),所以我已经在xl下载数据之前等待了几个步骤。

我已经放入了一些代码,将td元素的内容下载到活动工作表中,如果在工作簿中使用有用数据运行它,请小心。

Sub FuturesScrap3(ByVal URL As String) 

    Dim HTMLDoc As New HTMLDocument 
    Dim AnchorLinks As Object 
    Dim tdElements As Object 
    Dim tdElement As Object 
    Dim AnchorLink As Object 
    Dim lRow As Long 
    Dim oElement As Object 

    Dim oIE As InternetExplorer 

    Set oIE = New InternetExplorer 

    oIE.navigate URL 
    oIE.Visible = True 

    Do Until (oIE.readyState = 4 And Not oIE.Busy) 
     DoEvents 
    Loop 

    'Wait for Javascript to run 
    Application.Wait (Now + TimeValue("0:01:00")) 

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML 

    With HTMLDoc.body 
     Set AnchorLinks = .getElementsByTagName("a") 
     Set tdElements = .getElementsByTagName("td") ' 

     For Each AnchorLink In AnchorLinks 
      Debug.Print AnchorLink.innerText 
     Next AnchorLink 

    End With 

    lRow = 1 
    For Each tdElement In tdElements 
     Debug.Print tdElement.innerText 
     Cells(lRow, 1).Value = tdElement.innerText 
     lRow = lRow + 1 
    Next 

    'Clicking the Month tab 
    For Each oElement In oIE.document.all 
     If Trim(oElement.innerText) = "Month" Then 
      oElement.Focus 
      oElement.Click 
     End If 
    Next oElement 

    Do Until (oIE.readyState = 4 And Not oIE.Busy) 
     DoEvents 
    Loop 

    'Wait for Javascript to run 
    Application.Wait (Now + TimeValue("0:01:00")) 

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML 

    With HTMLDoc.body 
     Set AnchorLinks = .getElementsByTagName("a") 
     Set tdElements = .getElementsByTagName("td") ' 

     For Each AnchorLink In AnchorLinks 
      Debug.Print AnchorLink.innerText 
     Next AnchorLink 
    End With 

    lRow = 1 
    For Each tdElement In tdElements 
     Debug.Print tdElement.innerText 
     Cells(lRow, 2).Value = tdElement.innerText 
     lRow = lRow + 1 
    Next tdElement 

End sub 
+0

我星期六做了相同的代码。但我在这个网页上仍然有问题。随着你和我的代码,我不能列出6个按钮(锚点)名称Year to Day Day。如果我想根据时间窗口(年,季度等)走过不同的表格,我需要点击这些按钮中的任何一个。但这不是最后一个问题,在我们的代码中我们不能用代码列出表格数据:[代码]对于TDelements中的每个TDelement Debug.Print TDelement。innerText 下一页[\ code] – Figlio

+1

@Figlio我修改了获取TD元素并允许更改表格的答案,但它使用interenet资源管理器,而不是MSXML2,这可能因JavaScript而需要。 –

+0

谢谢。随着IE对象的作品。我知道,我做了和你一样的代码。和我有同样的问题需要Application.wait metod。如果是这样,并且不使用XMLHTTP,我将继续使用IE。再次感谢! – Figlio