网站有R

-4

<div data-projects-path="/pt/projects" id="explore_results"> 
    <div class="results"> 
    <div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork"> 
     <meta content="2014-08-30" itemprop="dateCreated"> 
     <div class="image"> 
     <a href="/pt/ospassosdabia" target="" title="Os passos da Bia"> 
      <img alt="Project thumb bia" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png" width="220"> 
     </a> 
    <div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork"> 
     <meta content="2014-09-19" itemprop="dateCreated"> 
     <div class="image"> 
     <a href="/pt/livrepartida" target="" title="Livre Partida"> 
      <img alt="Project thumb logo colorido" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7613/project_thumb_logo_colorido.jpg" width="220"> 
     </a>

这刮的是，我想用刮R.我只需要所有/pt/....为/pt/livrepartida和/pt/ospassosdabia一个例子的HTML代码。网站有R

当我向下滚动网页时，会出现更多类似的代码，并会出现更多类似那样的术语（“pt/....”）。

我想从网站上得到所有这些“pt/....”。我怎样才能做到这一点？

来源

2014-10-09 Gabriel

你可以发布多'PT/..'方面的例子吗？这将有助于测试。 – akrun 2014-10-09 15:14:08

请再看我的问题。 '/ pt/...'和上面的代码一样。但是这个信息'/ pt/..'有一个截止日期来获得html代码和新的'/ pt/....'放在每天，我想得到它们 – Gabriel 2014-10-09 16:07:54

当我使用代码时，我得到'unname （xpathSApply（doc1，“// a/@ href”））＃[1]“/ pt/ospassosdabia”“/ pt/livrepartida” – akrun 2014-10-09 16:25:26

尝试

library(XML) 
doc1 <- htmlParse(lines) 
unname(xpathSApply(doc1, "//a/@href")) 
#[1] "/pt/ospassosdabia" 


lines <- readLines(textConnection('<div data-projects-path="/pt/projects" id="explore_results"> 
<div class="results"> 
<div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork"> 
<meta content="2014-08-30" itemprop="dateCreated"> 
<div class="image"> 
<a href="/pt/ospassosdabia" target="" title="Os passos da Bia"> 
<img alt="Project thumb bia" height="172" 
    src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png" 
    width="220"> 
    </a>'))

来源

2014-10-09 07:51:14 akrun

它的工作。但也有其他类似的HTML代码，我想要在同一个节点中的这些信息。我怎样才能做到这一点？ – Gabriel 2014-10-09 13:15:20

在页面[链接]（http://www.catarse.me/pt/explore#in_funding）有很多项目和每个项目的页面url具有相同的结构，如[链接]（http：// www .catarse.me/pt/musicasderadio） – Gabriel 2014-10-09 17:07:56

在页面http://www.catarse.me/pt/explore#in_funding那里有很多项目和每个项目的页面url具有相同的结构像这样http：// www.catarse.me/pt/musicasderadio。我只想要与''相关的'/ pt/..'，在这个例子中是'pt/musicasderadio'。因此，对于你的代码，我无法获得'/pt/..'.for例如你不能使用你的代码'/ pt/ospassosdabia'或'/ pt/livrepartida'。 – Gabriel 2014-10-09 17:14:07

你应该给比这截断一个更好的格式的HTML。幸运的是，htmlParse可以解析这种损坏的格式。

library(XML) 

dd <- htmlParse(your_text,asText=TRUE)

然后你得到href属性：

xpathSApply(dd,'//a',xmlGetAttr,'href') 
[1] "/pt/ospassosdabia"

来源

2014-10-09 07:54:34 agstudy

它也工作。但也有其他类似的HTML代码，我想要在同一个节点中的这些信息。当我向下滚动网页时，出现了我想用类似代码显示的信息。我怎样才能做到这一点？ – Gabriel 2014-10-09 14:01:20

回答

相关问题