2015-04-04 81 views
3

试图从AEC网站提取一些信息(例如http://apps.aec.gov.au/eSearch/LocalitySearchResults.aspx?filter=3977&filterby=Postcode)。我正在运行的XPath查询是“//x:tbody/x:tr/x:td[4]/x:a”,我已经在XPath Checker(Firefox扩展)中进行了测试,并且它提取了相关的本地数据。通过PHP中的XPath提取信息

我然后使用PHP来加载页面,执行查询,然后遍历结果。

$ch = curl_init(); 
$timeout = 5; 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
$html = curl_exec($ch); 
curl_close($ch); 

# Create a DOM parser object 
$dom = new DOMDocument(); 
libxml_use_internal_errors(true); 


$dom->loadHTML($html); 

$xpath = new DOMXpath($dom); 

$elements = $xpath->query('//tbody/tr/td[4]/a'); 


foreach ($elements as $element) { 
    echo $element; 
} 

我然后让:

Warning: Invalid argument supplied for foreach() in /home/givesh5/public_html/dig/electoratesearch.php on line 41 

看来,查询返回某种布尔而不是查询匹配列表?

相关标记如下:

<table cellspacing="0" rules="all" border="1" id="ContentPlaceHolderBody_gridViewLocalities" style="border-collapse:collapse;"> 
     <tr class="headingLink"> 
      <th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolderBody$gridViewLocalities&#39;,&#39;Sort$StateAb&#39;)">State</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolderBody$gridViewLocalities&#39;,&#39;Sort$LocalityNm&#39;)">Locality/Suburb</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolderBody$gridViewLocalities&#39;,&#39;Sort$Postcode&#39;)">Postcode</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolderBody$gridViewLocalities&#39;,&#39;Sort$DivisionNm&#39;)">Electorate</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolderBody$gridViewLocalities&#39;,&#39;Sort$DivisionNmRedistributed&#39;)">Redistributed Electorate</a></th><th scope="col">Other Locality(s)</th> 
     </tr><tr> 
      <td>VIC</td><td>BOTANIC RIDGE</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CANNONS CREEK</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Holt&amp;filterby=Electorate&amp;divid=216">Holt</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE EAST</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE EAST</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Holt&amp;filterby=Electorate&amp;divid=216">Holt</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE NORTH</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Holt&amp;filterby=Electorate&amp;divid=216">Holt</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE SOUTH</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>CRANBOURNE WEST</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Holt&amp;filterby=Electorate&amp;divid=216">Holt</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>DEVON MEADOWS</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>FIVEWAYS</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td><a href="LocalitySearchResults.aspx?filter=DEVON+MEADOWS&amp;filterby=LocalityorSuburb&amp;state=VIC">DEVON MEADOWS</a></td> 
     </tr><tr> 
      <td>VIC</td><td>JUNCTION VILLAGE</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Flinders&amp;filterby=Electorate&amp;divid=211">Flinders</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>SANDHURST</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Isaacs&amp;filterby=Electorate&amp;divid=219">Isaacs</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>SKYE</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Dunkley&amp;filterby=Electorate&amp;divid=210">Dunkley</a></td><td></td><td>&nbsp;</td> 
     </tr><tr> 
      <td>VIC</td><td>SKYE</td><td><a href="LocalitySearchResults.aspx?filter=3977&amp;filterby=Postcode">3977</a></td><td><a href="LocalitySearchResults.aspx?filter=Isaacs&amp;filterby=Electorate&amp;divid=219">Isaacs</a></td><td></td><td>&nbsp;</td> 
     </tr> 
    </table> 
+0

'DOMXpath'如果*表达式的格式不正确或contextnode无效* – adeneo 2015-04-04 09:19:33

+0

能否请您提供标记的相关部分返回false你正在解析。从Firefox派生的XPath来自可包含隐含标记的实时DOM。所以以这种方式得到它们是不可靠的。此外,你究竟想要获取什么? – Gordon 2015-04-04 09:30:59

+1

用标记更新了OP,谢谢。在这种情况下,试图获取本地链接文本(例如文本)。例如,在前两个单元中,这将是“弗林德斯”。 – Edward 2015-04-04 09:35:16

回答

0

有在HTML中没有tbody
浏览器将插入在需要的地方tbody元素,但我们不使用的浏览器,我们正在使用DOMDocument不插入tbody元素。

相反,tr元素表

$elements = $xpath->query('//table/tr/td[4]/a'); 

foreach ($elements as $element) { 
    echo $dom->saveHTML($element); 
} 
+0

//应该与文档中途的选择一致吗?从这个意义上讲,如果table/tr/td是一个唯一的选择器,那么我们可以省略前面的部分路径,仍然通过// table/tr/td [4]访问相同的信息。那是不正确的? – Edward 2015-04-04 09:54:19

+0

@爱德华 - 是的,这是正确的,我只是从控制台复制路径,但测试它'/ table/tr/td [4]/a'也可以,但是你得到了什么'// tbody/tr/td [4]/a'不起作用 – adeneo 2015-04-04 10:07:33

+0

可能因为没有tbody,呃。 – adeneo 2015-04-04 10:08:55

1

的直接孩子看来,查询返回某种布尔而不是查询匹配列表?

是的,它可以返回一个布尔值,然后将是FALSE。它表示有一个错误运行xpath查询。这可以通过传递给DOMXpath::query()Php Manual两个参数中的一个引起的,或者是xpath表达式上下文节点

在你的情况下,你只使用一个参数,所以这表示xpath表达式是错误的。然而,你使用的是没有错的,不会导致布尔FALSE。但是,当你遇到这种错误,我认为可能有其他错误,所以可能xpath对象没有完全初始化,但即使没有或部分下载我模拟我无法重现错误。这可能与PHP版本有所不同?我不知道。

对于实际XPath表达式,它适用什么adeneo戈登已经写的<tbody> - 元素插入到Firefox浏览器的DOM,在PHP DOM文档执行不同的行为在这里。你可以在这里模拟Firefox(更多的工作) - 或者 - 你只是搜索实际的表格元素,然后它可以正常工作。在这里工作的例子:

$url = 'http://apps.aec.gov.au/eSearch/LocalitySearchResults.aspx?filter=3977&filterby=Postcode'; 

# Create a DOMDocument to parse HTML 
$doc = new DOMDocument(); 
$saved = libxml_use_internal_errors(true); 
$result = $doc->loadHTMLFile($url); 
libxml_use_internal_errors($saved); 
if (false === $result) { 
    throw new UnexpectedValueException(sprintf('Failed to create DOMDocument from url %s', var_export($url, true))); 
} 

# Create a DOMXPath to get data from HTML document 
$xpath = new DOMXpath($doc); 

$expression = '//table/tr/td[4]/a'; 
$elements = $xpath->query($expression); 
if (false === $elements) { 
    throw new UnexpectedValueException(sprintf('The xpath expression %s failed', var_export($expression, true))); 
} 

foreach ($elements as $index => $element) { 
    printf("#%02d: %s\n", $index + 1, trim($element->textContent)); 
} 

与具体的输出:

#01: Flinders 
#02: Flinders 
#03: Holt 
#04: Flinders 
#05: Holt 
#06: Holt 
#07: Flinders 
#08: Holt 
#09: Flinders 
#10: Flinders 
#11: Flinders 
#12: Isaacs 
#13: Dunkley 
#14: Isaacs