2011-08-05 43 views
2

我是PowerShell的新手,已经达到了我的知识水平。我正在编写一个脚本来从内部网页上刮取备份数据,然后从刮取信息中提取信息进行操作,然后在Excel中显示。Powershell:ScreenScraping http并将特定行作为变量返回

$Yesterday = [DateTime]::Now.AddDays(-1) 
$datestr = $Yesterday.ToString("dd-MMM-yyyy") 
$WebClient = New-Object System.Net.WebClient 
$Results = $WebClient.DownloadString("http://fakeurl") 

这导致了大量的含输出HTTP代码,以及在我感兴趣的数据,但所有集束在一起。然后我这样做:

[StringSplitOptions]$option = "None" 
[string[]]$separator = "</td>" 
$SPL = $Results.Split($separator, $option) 

这会将数据拆分为更易读的格式。以下是我对$ SPL感兴趣的部分 的一小节。

<tr><td headers="HOST_NAME" class="t13dataalt">server01 
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01 
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23 
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51 
<td headers="BACKUP_TYPE" class="t13dataalt">differential 
<td headers="SCHEDULE" class="t13dataalt">daily 
<td align="right" headers="SIZE_MB" class="t13dataalt">  2,091.18 
<td headers="IMAGES" class="t13dataalt">1 
<td headers="EXIT_STATUS" class="t13dataalt">0 
</tr><tr><td headers="HOST_NAME" class="t13data">server02 
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02 
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22 
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41 
<td headers="BACKUP_TYPE" class="t13data">differential 
<td headers="SCHEDULE" class="t13data">daily 
<td align="right" headers="SIZE_MB" class="t13data">  2,496.31 
<td headers="IMAGES" class="t13data">1 
<td headers="EXIT_STATUS" class="t13data">0 

从这个我需要提取的开始和结束时间,制定出经过的时间,并且也可以返回最近的备份的EXIT_STATUS。我试过以下,但我觉得我可能会找错了树:

$Position = select-string -inputobject $SPL -pattern $datestr 

$ Position.matches导致:

PS C:\Scripts> $Position.matches 

Groups : {03-Aug-2011} 
Success : True 
Captures : {03-Aug-2011} 
Index : 12056 
Length : 11 
Value : 03-Aug-2011 

我的理论是做使用索引添加一个子到长度来提取日期后的时间值,但我不知道该怎么做。我也认为这有点重要。必须有一种更简单的方式来返回我需要的变量信息,而不必指向现场,然后撕掉其余部分?


好的,因为我不确定如何在页面底部添加这样的部分,我将在此处添加它。

这是我目前的脚本,它没有任何错误地运行,但不返回任何结果。

# Get yesterdays date and convert it to the required search format 
    $Yesterday = [DateTime]::Now.AddDays(-1) 
    $datestr = $Yesterday.ToString("dd-MMM-yyyy") 

# Scrape the webpage 
    $url = "http://fake-url" 
    $WebClient = New-Object System.Net.WebClient 
    $Results = $WebClient.DownloadString($url) 

# Determine if the previous day is listed in the backups 
    $IsDateThere = $Results.Contains($datestr) 
     If ($IsDateThere){ 
      # split the data into rows 
      [StringSplitOptions]$option = "None" 
      [string[]]$separator = "</td>" 
      $SPL = $Results.Split($separator, $option) 

      #strip the data into a hash table 
      $SPL | 
       Foreach-Object { 
        where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
         ForEach-Object { 
         @{ 
           $matches[1] = ($matches[2]).trim() 
          } 
         } 
       }   
     } 
     Else{ 
      Write-Host "Yesterday's date not found" 
     } 

任何想法?我不确定接下来要做什么来获取最新备份和退出代码的开始时间和结束时间作为变量。

回答

3

我想接近它是这样的

$html = @" 
<tr><td headers="HOST_NAME" class="t13dataalt">server01 
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01 
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23 
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51 
<td headers="BACKUP_TYPE" class="t13dataalt">differential 
<td headers="SCHEDULE" class="t13dataalt">daily 
<td align="right" headers="SIZE_MB" class="t13dataalt">  2,091.18 
<td headers="IMAGES" class="t13dataalt">1 
<td headers="EXIT_STATUS" class="t13dataalt">0 
</tr><tr><td headers="HOST_NAME" class="t13data">server02 
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02 
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22 
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41 
<td headers="BACKUP_TYPE" class="t13data">differential 
<td headers="SCHEDULE" class="t13data">daily 
<td align="right" headers="SIZE_MB" class="t13data">  2,496.31 
<td headers="IMAGES" class="t13data">1 
<td headers="EXIT_STATUS" class="t13data">0 
"@ 

$html -split "`r`n" | where {$_ -match 'start_time|end_time'} | 
    ForEach { 
     $pos = $_.IndexOf("headers") 
     $begin = $pos+9 
     $end = $_.IndexOf('"', $begin) 

     new-object PSObject -Property @{ 
      Key = $_.SubString($begin, $end-$begin) 
      Value = Get-Date($_.SubString($_.IndexOf(">")+1)) 
     } 
    } 

结果

Key  Value    
---  -----    
START_TIME 8/1/2011 9:23:00 PM 
END_TIME 8/1/2011 9:51:00 PM 
START_TIME 7/31/2011 9:22:00 PM 
END_TIME 7/31/2011 9:41:00 PM 
1

这不是原单答案 - 道格的使用REG前的只是一个替代版本来捕获所有的数据:

$html -split "`n" | where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
    % { 
     @{ 
       $matches[1] = ($matches[2]).trim() 
      } 
    } 

编辑:使用questi中的代码于:

$Yesterday = [DateTime]::Now.AddDays(-1) 
$datestr = $Yesterday.ToString("dd-MMM-yyyy") 
$WebClient = New-Object System.Net.WebClient 
$Results = $WebClient.DownloadString("http://fakeurl") 

[StringSplitOptions]$option = "None" 
[string[]]$separator = "</td>" 
$SPL = $Results.Split($separator, $option) 

$SPL | 
    Foreach-Object { 
     where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
      % { 
      @{ 
        $matches[1] = ($matches[2]).trim() 
       } 
      } 
    } 

编辑2:

$SPL | 
     Foreach-Object { 
      where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
       % { 
if (($matches[2]).trim() -eq $datestr) { "$($matches[1]) is yesterday's back up" } 
       } 
     } 
+0

感谢所有帮助。我会在今天测试这个,并让你知道我如何继续。我可以将$ SPL变量传递给哈希表而不是上面的字符串吗?@Doug Finke – jok5r

+0

是的,这应该可以工作(我会在上面更改以显示我相信它会起作用) – Matt

+0

如何在我的原始问题下扩展您的答案? – jok5r

相关问题