2017-05-22 31 views
0

我有一个包含这样一些标题的表:PHP的Xpath就获得了href和文本节点

<TR> 
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH> 
</TR> 

表是千线长,所以我不能完全共享,但在这里是表格中的初始标签和一个完整项目。可悲的是没有嵌套每个项目,评论是我的 - 所以这是一个痛苦的破译一个项目的开始和结束。

<TABLE CLASS="datadisplaytable" SUMMARY="Layout table" width="100%"><CAPTION class="captiontext">Items Found</CAPTION> 
<!-- START of first item in the table --> 
<TR> 
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH> 
</TR> 
<TR> 
<TD CLASS="dddefault"> 
<SPAN class="fieldlabeltext">Term: </SPAN>Fall 
<BR> 
<SPAN class="fieldlabeltext">Registration: </SPAN>Jan 1, 2018 to Aug 1, 2018 
<BR> 
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<BR> 
<BR> 
Location 
<BR> 
Lecture Schedule Type 
<BR> 
     3.000 Credits 
<BR> 
<A HREF="foo">View Entry</A> 
<BR> 
<BR> 
<TABLE CLASS="datadisplaytable" SUMMARY="Meeting time table"><CAPTION class="captiontext">Scheduled Meeting Times</CAPTION> 
<TR> 
<TH CLASS="ddheader" scope="col" >Type</TH> 
<TH CLASS="ddheader" scope="col" >Time</TH> 
<TH CLASS="ddheader" scope="col" >Days</TH> 
<TH CLASS="ddheader" scope="col" >Where</TH> 
<TH CLASS="ddheader" scope="col" >Date Range</TH> 
<TH CLASS="ddheader" scope="col" >Schedule Type</TH> 
<TH CLASS="ddheader" scope="col" >Instructors</TH> 
</TR> 
<TR> 
<TD CLASS="dddefault">Lecture</TD> 
<TD CLASS="dddefault">9:20 am - 10:10 am</TD> 
<TD CLASS="dddefault">MWF</TD> 
<TD CLASS="dddefault">Some Building Room 101</TD> 
<TD CLASS="dddefault">Aug 1, 2018 - Dec 1, 2018</TD> 
<TD CLASS="dddefault">Lecture</TD> 
<TD CLASS="dddefault">Instructor Name (<ABBR title= "Primary">P</ABBR>)<A HREF="mailto:[email protected]" target="Instructur Name" ><IMG SRC="/wtlgifs/email.png" ALIGN="middle" ALT="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=16 WIDTH=16></A></TD> 
</TR> 
</TABLE> 
<BR> 
<BR> 
</TD> 
</TR> 
<!-- END first item in the table --> 

我想提取的项目细节,与课程名称开始(这是文本内容,“链接的文本,”内th.ddtitle)和课程链接(这是个内部的A HREF。 ddtitle)。以下是我已经试过了抓住这两个项目:

$dom = new DOMDocument(); 
$myHtml = file_get_contents(__DIR__.'myfile.html'); 
$dom->loadHTML($myHtml); 
$xpath = new DOMXPath($dom); 
// first part changes an outer table with the same class, so I can get inner tables without the outer one 
$tables = $xpath->query("//table[@class='datadisplaytable']"); 
for($i=0; $i<1; $i++) { 
    $tables[$i]->setAttribute('class', 'masterTable'); 
} 
$html = $dom->saveHTML(); 
// now, the query I'm having trouble with: 
$textAndLink = $xpath->query("//th[@class='ddtitle']/*"); 
$i=1; 
foreach($textAndLink as $info) { 
    foreach($info->childNodes as $child) { 
     if($i%2 == 0) { 
      echo $child->getAttribute('href') . '<br>'; 
     } else { 
      echo $child->nodeValue . '<br>'; 
     } 
    } 
    $i++; 
} 

我也试过print_r($child)和显示的唯一项目是文本节点,没有<a>标签。我该如何获得锚的“href”属性和文本内容?我期待从上面的代码是这样一个列表:

http://foo.com/<br> 
Linked text<br> 
http://foo.com/secondlink<br> 
Second linked text<br> 

等等等等。

+0

你可以分享你完整的HTML字符串和你预期的输出? –

+0

编辑上面分享更多的HTML和准确的预期输出。 – WebElaine

+0

你只想得到'http:// foo.com'吗?对? –

回答

0

Try this code snippet here

<?php 

ini_set('display_errors', 1); 
$string = ' 
<TABLE CLASS="datadisplaytable" SUMMARY="Layout table" width="100%"><CAPTION class="captiontext">Items Found</CAPTION> 
<!-- START of first item in the table --> 
<TR> 
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH> 
</TR> 
<TR> 
<TD CLASS="dddefault"> 
<SPAN class="fieldlabeltext">Term: </SPAN>Fall 
<BR> 
<SPAN class="fieldlabeltext">Registration: </SPAN>Jan 1, 2018 to Aug 1, 2018 
<BR> 
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<BR> 
<BR> 
Location 
<BR> 
Lecture Schedule Type 
<BR> 
     3.000 Credits 
<BR> 
<A HREF="foo">View Entry</A> 
<BR> 
<BR> 
<TABLE CLASS="datadisplaytable" SUMMARY="Meeting time table"><CAPTION class="captiontext">Scheduled Meeting Times</CAPTION> 
<TR> 
<TH CLASS="ddheader" scope="col" >Type</TH> 
<TH CLASS="ddheader" scope="col" >Time</TH> 
<TH CLASS="ddheader" scope="col" >Days</TH> 
<TH CLASS="ddheader" scope="col" >Where</TH> 
<TH CLASS="ddheader" scope="col" >Date Range</TH> 
<TH CLASS="ddheader" scope="col" >Schedule Type</TH> 
<TH CLASS="ddheader" scope="col" >Instructors</TH> 
</TR> 
<TR> 
<TD CLASS="dddefault">Lecture</TD> 
<TD CLASS="dddefault">9:20 am - 10:10 am</TD> 
<TD CLASS="dddefault">MWF</TD> 
<TD CLASS="dddefault">Some Building Room 101</TD> 
<TD CLASS="dddefault">Aug 1, 2018 - Dec 1, 2018</TD> 
<TD CLASS="dddefault">Lecture</TD> 
<TD CLASS="dddefault">Instructor Name (<ABBR title= "Primary">P</ABBR>)<A HREF="mailto:[email protected]" target="Instructur Name" ><IMG SRC="/wtlgifs/email.png" ALIGN="middle" ALT="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=16 WIDTH=16></A></TD> 
</TR> 
</TABLE> 
<BR> 
<BR> 
</TD> 
</TR>'; 

$domDocument = new DOMDocument(); 
$domDocument->loadHTML($string); 

$domXPath = new DOMXPath($domDocument); 
$results = $domXPath->query('//tr/th[@class="ddtitle"]/a'); 
foreach($results as $result) 
{ 
    print_r($result->textContent); 
    print_r($result->getAttribute("href")); 
} 
+0

它工作,如果我硬编码的$字符串,但如果我拉入完整的HTML文件中会引发错误:'PHP致命错误:未捕获错误:调用成员函数getAttribute()null(文件名):46 ' - 第46行是最后一行,试图print_r href属性。难道是因为实际的URL有一个查询字符串?实际的网址,而不是http://foo.com,是 WebElaine

+0

@WebElaine我更新了我的帖子,现在当你把字符串和结果是可用的然后它将打印..,这可能是一个可能的情况下,当内容添加到与外部JS的HTML。 –

+0

这个伎俩。源HTML是一个噩梦,我希望我能控制它,但它来自外部系统,这就是为什么我首先解析它。非常感谢你的帮助! – WebElaine