2017-08-03 48 views
0

问题:使用rvest我似乎无法找到我需要从我通过幽灵js呈现的html页面的信息块。我已经尝试了几乎所有可能的格式,但我似乎无法让html_node选择正确的块。阅读幻影渲染HTML到R

HTML代码幻影呈现:

<div class="page"> 

<div class="main-header">  
</script> 

    <div id="listing-703036966" class="shop-srp-listings__listing"> 
     <div class="card listing-row--search hide-fade"> 

      <div class="listing-row__main"> 
       <div class="listing-row__image"> 

        <div class="media-count shadowed"> 
         <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb"> 
          25 Photos 
         </a> 

          <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb"> 
           1 Video 
          </a> 
        </div> 

        <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb"> 
         <img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')"> 
        </a> 
        <div class="compare-srp"> 
         <div class="listing-row__save"> 
          <a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle="{&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events=""> 
           <div class="save-icon-wrapper"> 
            <div class="cui-icon icon-heart-line"> 
             <svg width="16" height="16" class="icon-image"> 
              <use xlink:href="#cui-icon-heart-outline"></use> 
             </svg> 
            </div> 

            <div class="cui-icon icon-heart"> 
             <svg width="16" height="16" class="icon-image"> 
              <use xlink:href="#cui-icon-heart-fill"></use> 
             </svg> 
            </div> 
           </div> 

           <p class="saved-label">Save</p> 
          </a> 
         </div> 
         <div class="compare-button" data-compare-listing="703036966"> 
          <div class="compare-icon-wrapper"> 
           <div class="cui-icon icon-plus-sign"> 
            <svg width="16" height="16" class="icon-plus-sign"> 
             <use xlink:href="#cui-icon-plus-sign"></use> 
            </svg> 
           </div> 
           <div class="cui-icon icon-checkmark"> 
            <svg width="16" height="16" class="icon-checkmark"> 
             <use xlink:href="#cui-icon-checkmark"></use> 
            </svg> 
           </div> 
          </div> 
          <p class="compare-button__label compare">Compare</p> 
          <p class="compare-button__label added">Added</p> 
         </div> 
        </div> 
       </div> 

我R中已经做了

library(rvest) 
library(stringr) 
library(plyr) 
library(dplyr) 
library(ggvis) 
library(knitr) 
library(tidyverse) 

cars <- read_html("my file.html") %>% 
    html_nodes("div") %>% 
    html_text() 

然而,当我检查汽车矢量我完全缺少所需的代码块它是:

<a id="703036966" class="switch-favorite unsaved saveVehicleHeart   compare-switch-favorite" savedfeatureinstance="" vehicle=". {&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events=""> 

但它永远不会转换为可用的形式,并且我尝试失去它的所有不同节点(div,p,span)。

任何想法?

回答

1

您似乎在寻找解析来自单个节点的括号内容。 即:字符串“vehicle ='{”listingId“:703036966,...”,来自具有css路径的节点“a id.703036966 saveVehicleHeart”

由于此节点不包含要在html浏览器中呈现的文本,所以html_text()命令将无处可用。相反,您可以将节点的代码存储为字符串,然后解析感兴趣的部分。

1.检索节点的字符串。的几个可能的CSS路径到节点之一是'.saveVehicleHeart'

library(rvest) 
library(stringr) 
library(dplyr) 
car_html <- read_html("my file.html") 
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart')) 

2.Extract括号内的内容 “{}”

cars <- cars %>% 
str_match(., "\\{.*?\\}") %>% ## Extract everything between the first "{" and the subsequent "}" 
gsub("\\{|\\}", "", .) ## Remove the characters "{" and "}" 

3.奖金。把它变成一个很好的数据框。你没有要求这个,但它可能会有所帮助。

df_cars <- cars %>% 
    cbind(read.table(text = ., sep = (','))) %>% 
    t() %>% 
    as_data_frame() %>% 
    .[-1,] %>% ## The first row contains the original unparsed string. We drop it. 
    separate(., V1, into = c("Variable", "Value"), sep = "\\:") 
df_cars 

# A tibble: 12 × 2 
     Variable  Value 
*  <chr>  <chr> 
1 listingId 703036966 
2   mkId  20005 
3   mkNm  BMW 
4   mdId  20536 
5   mdNm  750 
6  trimId  25905 
7  trimName   i 
8 modelYearId 35797618 
9 modelYear  2018 
10  stkTyp  New 
11  state  NC 
12  zipcode  27107 
+0

通过“完整的HTML”你是指你发布什么或一个更大的HTML与多个车帖? –

+0

我想通了..html_node与html_nodes。再次感谢!回复非常好 – MDEWITT

+0

谢谢。很高兴帮助。 –