如何解析来自复杂xml的文本和图像

我希望你能帮助我。 XML文件是这样的：如何解析来自复杂xml的文本和图像

<channel><item> 
<description> 
<div> <a href="http://image.com"> 
<span> 
<img src="http://image.com" /> 
</span> 
</a> 
Lorem Ipsum is simply dummy text of the printing etc... 
</div> 
</description> 
</item></channel>

我可以得到描述标签的内容，但是当我这样做，我得到它有很多的CSS中有整体的结构，我不希望出现这种情况。我真正需要的仅仅是解析href链接和Lorem Ipsum文本。我正在尝试简单的XML，但无法找到，看起来太复杂。有任何想法吗？

编辑： 代码，我用它来解析XML

$file = new SimpleXMLElement($mydata); 
{ 

    foreach($file->channel->item as $post) 
{ 

    echo $post->description; } }

来源

2013-01-13 pano

我也尝试使用'attributes（）'来获取属性，但是我无法做到这一点。描述标签没有属性，但里面有更多的标签，如div，a和img。我不能只用简单的XML获得'a'和'img'标签的属性。 – pano

这是回答问题的最终代码。

$xml = simplexml_load_file('myfile.xml'); 

$descriptions = $xml->xpath('//item/description'); 

foreach ($descriptions as $description_node) { 

    $description_dom = new DOMDocument(); 
    $description_dom->loadHTML((string)$description_node); 

    $description_sxml = simplexml_import_dom($description_dom); 

    $imgs = $description_sxml->xpath('//img'); 
    $text = $description_sxml->xpath('//div'); 

    foreach($imgs as $image){ 

    echo (string)$image['src'];  
     } 
    foreach($text as $t){ 

     echo (string)$t; 
     } 
    }

这是IMSOP的代码，我添加了$text = $description_sxml->xpath('//div');读那是<div>内的文本。

在我的情况下，一些在XML中的职位有多个<div>和<span>标签，所以解析所有的人，我可能得再添->xpath为<span>也许一个if... else语句，这样，如果我没有<div>内的任何内容，请改为<span>内容。感谢您的回复。

来源

2013-01-14 20:35:25 pano

对于用这种方式解析xml的编码问题，也可以参考这个[post]（http://stackoverflow.com/questions/14336412/convert-parsed-text-with-php-to-utf-8） – pano

那将是复杂的。 ~~您没有XML，但有HTML。一个区别是标签不能包含另一个标签和XML中的一些文本。这就是为什么~~ 我会使用PHP的DOM（我还没有用过，但与纯JavaScript类似）。

这是我砍死在一起（未经测试）：

// first create our document 
$doc = new DOMDocument('1.0', 'utf-8'); 
$doc->loadHTML("your html here"); // there is also a loadHTMLFile 

// this tries to get an a element which has a href and returns that href 
function getAHref ($doc) { 
    // now get all a elements to get the one with a href 
    $aElements = $doc->getElementsByTagName("a"); 
    foreach ($aElements as $a) { 
     // has this element a href? than return 
     if ($a->hasAttribute("href")) { 
      return $a->getAttribute("href"); 
     } 
    } 
    // failed? return false 
    return false; 
} 

// tires to get the text in the node 
// in your example the text isn't wrapped in anything so this is going to be difficult 
function getTextFromNode ($doc) { 
    // get and loop all divs (assuming the text is always a child of a div) 
    $divs = $doc->getElementsByTagName("div"); // do we know it's always in that div? 
    foreach ($divs as $div) { 
     // also loop all child nodes to get the text nodes 
     foreach ($div->childNodes as $child) { 
      // is this a text node? 
      if ($child->nodeType == XML_TEXT_NODE) { 
       // is there something in it (new lines count as text nodes) 
       if (trim($child->nodeValue) != "") { 
        // *pfew* got it 
        return $child->nodeValue; 
       } 
      } 
     } 
    } 
    // failed? return false 
    return false; 
}

来源

2013-01-13 01:52:29 Nemo64

谢谢你的时间。我在上面的例子和实际的xml文件中都使用了你的脚本，但是我没有得到任何结果。相反，我得到一个错误，说“不能重新声明getText（）”，在最后一行。 – pano

@pano错误消息说明什么是错的。 PHP有一个名为getText的方法构建，我不知道。 – Nemo64

“标签不能包含其他标签和XML中的某些文本”？ ' bar bob'是完全有效的XML。 – IMSoP

这XML看起来很像一个RSS或Atom（或从一个的提取物）。 description节点通常会被转义，或者放置在标记为<![CDATA[ ... ]]>的部分内，表示其内容将被视为原始文本，即使它们包含<,>或&。

您的样品不显示，但如果你的echo是给你的全部内容，包括img标签等，然后就是正在发生的事情，你的问题是类似Trying to Parse Only the Images from an RSS Feed - 你需要抓住整个description内容，并将其解析为自己的文档。

如果由于某种原因HTML不被转义，而且实际上是被列为XML中的一串子节点，然后链接的URL可以直接访问（假设结构始终是一致的）：

echo (string)$post->description->div->a['href'];

至于文字，会的SimpleXML连接一个特定元素的所有文本内容（而不是从它的子内），如果你“剧组串”与(string)（echo自动转换为字符串，但我猜你”最终还是会想用echo以外的东西）。

在你的榜样，你想要的文字是第一个（也是唯一一个）DIV里面，所以这将显示它：

echo (string)$post->description->div;

不过，你提到“很多CSS的”，我想你”为了简单起见，我已经将其排除在示例之外，因此我不确定您的真实内容的一致性。

来源

2013-01-13 17:59:40 IMSoP

是的，那里有很多风格属性。在文本中还有很多span标签，还有一些div，我相信这可能会导致一些问题。我看到了你的其他帖子，http://stackoverflow.com/questions/14246656/trying-to-parse-only- rss-feed？lq = 1的图像，它似乎适用于我，因为它获取所有图像链接。至于文字，它确实给了我一些结果（修改了一点点），但问题是我由于错误的编码或其他原因而无法读取结果。我写了这个'新的DOMDocument（'1.0'，'UTF-8'）'，但没有奏效。我得到的东西就像ÎμμμÎÎÎÎ。 – pano

它适用于图像，可能也适用于文本。仍然编码是一个问题（文本是希腊语）。我认为是因为我试图获取'$ description_dom'而导致的。我会发布最终的代码。 – pano

如何解析来自复杂xml的文本和图像

回答

相关问题