2016-01-16 20 views
0

我在Google Spreadsheets上使用=importXML函数来从不同网站上获取一些信息。我正在尝试使用xpath在<article>标签内获取文本。Xpath为<article> HTML5标记

这是source data

<div id="blog-post-body-ad" class="ad"> 
    </div> 

    <article class="blog-post-body"> 
     <p>Fox&#39;s <em>X-Men </em>drama <em>Hellfire </em>is making a change at the top.</p> 
<p>Writers Evan Katz and Manny Coto, who co-created the drama, are exiting, <em>The Hollywood Reporter </em>has learned. Also out are Patrick McKay and John D. Payne, who came up the the story for the drama alongside Katz and Coto and were set to pen the script. A search is under way for a new writer.</p> 
<p>The changes come as <em>Hellfire </em>is on a slower development track, insiders say. <em>Hellfire, </em>which previously was&nbsp;<a href="http://www.hollywoodreporter.com/live-feed/fox-nears-deal-x-men-813542">considered a live-action&nbsp;<em>X-Men</em></a>, follows a young special agent who learns that a power-hungry woman with extraordinary abilities is working with a clandestine society of millionaires &mdash; known as &quot;The Hellfire Club&quot; &mdash; to take over the world.</p> 
<p> 
    <div class="embedded-content" data-nid="832221" data-nodetype="blog" data-template="readmore"> 
     <script type="application/json"> 
     { 
      "nid": 832221, 
      "type": "blog", 
      "title": "Marvel Sets &#039;Legion&#039; Pilot With Noah Hawley at FX, Readying &#039;Hellfire&#039; for Fox", 
      "path": "http://www.hollywoodreporter.com/live-feed/marvel-legion-noah-hawley-fx-832221", 
      "relative-path": "/live-feed/marvel-legion-noah-hawley-fx-832221" 
     } 
     </script> 
    </div></p> 
<p>Sources say the <em>X-Men </em>drama is not likely to go to pilot this season as it remains on a slower track. The change comes as Katz and Coto are shifting their focus to Fox&#39;s <em><a href="http://www.hollywoodreporter.com/live-feed/fox-greenlights-prison-break-event-856203" target="_blank">24: Legacy</a>, </em>which received a formal pilot order Friday during Fox&#39;s time in front of the press at the Television Critics Association&#39;s winter press tour. The new take on 24 will feature an entirely new cast with a diverse lead as Fox has high hopes to reboot the franchise for a new era.</p> 
<p>The change at the top should not worry diehard fans of the <em>X-Men </em>franchise. Sources say Fox remains committed to <em>Hellfire </em>and wants to get it completely right as the <em>X-Men </em>franchise remains a valuable asset for the company. Should <em>Hellfire</em> go to series and the network renew Batman prequel <em>Gotham, </em>the network would have dramas from both comic book powerhouses DC Comics and Marvel &mdash; a first for a broadcast network and something insiders would love to see on their schedule.</p> 
<p>&nbsp;</p> 

     <footer class="blog-post-tags"> 
          <a href="/topic/tv-development" data-tracklabel="Story Well - Bottom Tags TV Development">TV Development</a> 
        </footer> 
    </article> 

    <div class="blog-post-footer-ad"> 

使用谷歌浏览器>检查>复制XPath

//*[@id="page-content"]/div[1]/article 

我尝试,但谷歌表给了我一个解析错误

我尝试对堆栈溢出另一个问题的解决方案,但不是为我工作:

=importXML(C2,"//article[contains(concat('', normalize-space(@class), ''), '')//div[@class='blog-post-body']]") 

我试图做到的,是得到<article>标签 里面所有的文字和一大优势会在文章中间获得<article>的文本,不包括或不包括<div class="embedded-content">

+0

出了什么问题'//文章[@类= '博客-后身体']'?不过,无论如何,您不能使用XPath排除所选节点的后代。 * XPath *选择; * XSLT *转换。 – kjhughes

+0

感谢您的帮助@klhughes! 使用 '= importXML(C2,“//文章[@ class ='blog-post-body']”)Google表格抛出错误#N/A“导入的内容为空”。 – Peter

+1

据我所见,你做得很对,但1)Google Sheets的'IMPORTXML'功能异常错误,2)XPath是为XML设计的。如果HTML结构不遵循XML的规则,那么XPath表达式应返回的内容尚不清楚。 –

回答

1

这项功能对文章:

=concatenate(IMPORTXML("http://www.hollywoodreporter.com/live-feed/foxs-x-men-spinoff-showrunners-856338","//p[3] | //p[4] | //p[5] | //p[6] "))