2016-09-28 173 views
0

我想抓我的雇主网站从他们的博客文章中提取图像大规模。我已经开始使用VBA在Excel中创建一个抓取工具。Excel VBA:从字符串中提取图像源属性作为字符串

(我们没有访问SQL数据库)

我已经安装包含交标识符在列A名单和后的B列的URL的工作表,

到目前为止,我的VBA脚本遍历列B中的URL列表,通过ID从页面上的标签中提取HTML,使用getElementById并将结果输出作为字符串粘贴到列C中。

我现在处于关键位置我正在试图找出如何从结果输出中的每个图像中提取src属性并将其粘贴到相关公司lumns。我不能为我的生活提出一个简单的解决方案。我对RegEx并不是很熟悉,我正在努力使用Excel内置的字符串函数。

的最后一步就是打通每个图像URL来运行和图像保存到磁盘中的文件名格式,如宏“{事件没有。} - {图片号码}”。JPG

任何帮助非常感谢。

Worksheet setup

Sub Get_Image_SRC() 

Dim sht As Worksheet 
Dim LastRow As Long 
Dim i As Integer 
Dim url As String 
Dim IE As Object 
Dim objElement As Object 
Dim objCollection As Object 
Dim Elements As IHTMLElementCollection 
Dim Element As IHTMLElement 


Set sht = ThisWorkbook.Worksheets("Sheet1") 
'Ctrl + Shift + End 
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row 
Set IE = CreateObject("InternetExplorer.Application") 
IE.Visible = True 
For i = 2 To LastRow 
    url = Cells(i, "C").Value 
    MsgBox (url) 
    IE.navigate url 
    Application.StatusBar = url & " is loading..." 
    Do While IE.readyState = 4: DoEvents: Loop 
    Do Until IE.readyState = 4: DoEvents: Loop 
    Application.StatusBar = url & " Loaded" 
    If Cells(i, "B").Value = "WEBNEWS" Then 
     Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML 
     Else 
     Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML 
    End If 



Next i 

Set IE = Nothing 
Set objElement = Nothing 
Set objCollection = Nothing 

End Sub 

实施例得到的HTML:

<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div> <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch. 

Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone! 

<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p> 


</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>" 

My current references

回答

0

对于你应该看看这两个链路的正则表达式的方法:

这基本上归结为:

  • 正则表达式来得到img一个src属性值是src\s*=\s*"(.+?)"
  • 使用VBScript.RegExp库使用VBA正则表达式

我已经使用了后期绑定,但如果需要,可以包含引用。

接着VBA是这样的:

显式的选项

次测试()

Dim strHtml As String 

' sample html, note single img tag 
strHtml = "" 
strHtml = strHtml & "<div id=""foo"">" 
strHtml = strHtml & "<bar class=""baz"">" 
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />" 
strHtml = strHtml & "</bar>" 
strHtml = strHtml & "<bar class=""baz"">" 
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />" 
strHtml = strHtml & "</bar>" 
strHtml = strHtml & "<bar class=""baz"">" 
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />" 
strHtml = strHtml & "</bar>" 
strHtml = strHtml & "</div>" 

Dim strSrc As String 
Dim objRegex As Object 
Dim objMatches As Object 
Dim lngMatchCount As Long, lngCounter As Long 

' create regex 
Set objRegex = CreateObject("VBScript.RegExp") 

' set pattern and execute 
With objRegex 
    .IgnoreCase = True 
    .Pattern = "src\s*=\s*""(.+?)""" 
    .Global = True 

    If .Test(strHtml) Then 
     Set objMatches = .Execute(strHtml) 
     lngMatchCount = objMatches.Count 
     For lngCounter = 0 To lngMatchCount - 1 
      strSrc = objMatches(lngCounter).SubMatches(0) 
      ' youve successfully captured the img src value 
      Debug.Print strSrc 
     Next 
    Else 
     strSrc = "Not found" 
    End If 
End With 

末次

注意,我收到SubMatches集合的第一项以获得src属性的值。在这段代码objMatches(0)objMatches(0).SubMatches(0)之间的区别是:

src="\\server\path\picture.png" 

对战:

\\server\path\picture.png 

你可能想包装这件事作为一个函数并调用它,当你在锻炼身体的IE.document.getElementById("NewsDetail").outerHTML值代码的If..End If块。

+0

谢谢,罗宾。这对于具有单个图像的页面非常有效。我可以问你怎么去修改这个来获取多个图像? – user2866975

+0

@ user2866975 - 查看我的编辑 - 基本上需要将Global标志设置为true,然后遍历所有匹配。 –