2016-09-03 33 views
2

我想从使用highcharts.js显示图表的页面上刮取数据,因此我完成了解析所有页面以得到following page。但是,最后一页显示数据集的页面使用highcharts.js来显示图形,这似乎几乎不可能访问原始数据。我可以从highcharts.js中获取原始数据吗?

我使用Python 3.5与BeautifulSoup。

它仍然可以解析它吗?如果是这样我怎么刮呢?

+0

你需要写一些“代码”来这么做 - –

+0

@JaromandaX好吧,对不起,我没有提到我忘记的刮擦环境,所以我编辑了。但说实话,我根本不知道如何在第一个地方编写代码来刮取使用highcharts.js的图... – Blaszard

+0

这是一个真正的问题 - 希望有人会来为你编写它 –

回答

2

数据位于脚本标记中。您可以使用bs4和正则表达式来获取脚本标记。你也可以使用正则表达式提取数据,但我喜欢用/js2xml解析JS功能集成到一个XML树:

from bs4 import BeautifulSoup 
import requests 
import re 
import js2xml 

soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser") 
script = soup.find("script", text=re.compile("Highcharts.Chart")).text 
# script = soup.find("script", text=re.compile("precipchartcontainer")).text if you want precipitation data 
parsed = js2xml.parse(script) 
print js2xml.pretty_print(parsed) 

这就给了你:

<program> 
    <functioncall> 
    <function> 
     <identifier name="$"/> 
    </function> 
    <arguments> 
     <funcexpr> 
     <identifier/> 
     <parameters/> 
     <body> 
      <var name="chart"/> 
      <functioncall> 
      <function> 
       <dotaccessor> 
       <object> 
        <functioncall> 
        <function> 
         <identifier name="$"/> 
        </function> 
        <arguments> 
         <identifier name="document"/> 
        </arguments> 
        </functioncall> 
       </object> 
       <property> 
        <identifier name="ready"/> 
       </property> 
       </dotaccessor> 
      </function> 
      <arguments> 
       <funcexpr> 
       <identifier/> 
       <parameters/> 
       <body> 
        <assign operator="="> 
        <left> 
         <identifier name="chart"/> 
        </left> 
        <right> 
         <new> 
         <dotaccessor> 
          <object> 
          <identifier name="Highcharts"/> 
          </object> 
          <property> 
          <identifier name="Chart"/> 
          </property> 
         </dotaccessor> 
         <arguments> 
          <object> 
          <property name="chart"> 
           <object> 
           <property name="renderTo"> 
            <string>tempchartcontainer</string> 
           </property> 
           <property name="type"> 
            <string>spline</string> 
           </property> 
           </object> 
          </property> 
          <property name="credits"> 
           <object> 
           <property name="enabled"> 
            <boolean>false</boolean> 
           </property> 
           </object> 
          </property> 
          <property name="colors"> 
           <array> 
           <string>#FF8533</string> 
           <string>#4572A7</string> 
           </array> 
          </property> 
          <property name="title"> 
           <object> 
           <property name="text"> 
            <string>Average Temperature (°c) Graph for Brussels</string> 
           </property> 
           </object> 
          </property> 
          <property name="xAxis"> 
           <object> 
           <property name="categories"> 
            <array> 
            <string>January</string> 
            <string>February</string> 
            <string>March</string> 
            <string>April</string> 
            <string>May</string> 
            <string>June</string> 
            <string>July</string> 
            <string>August</string> 
            <string>September</string> 
            <string>October</string> 
            <string>November</string> 
            <string>December</string> 
            </array> 
           </property> 
           <property name="labels"> 
            <object> 
            <property name="rotation"> 
             <number value="270"/> 
            </property> 
            <property name="y"> 
             <number value="40"/> 
            </property> 
            </object> 
           </property> 
           </object> 
          </property> 
          <property name="yAxis"> 
           <object> 
           <property name="title"> 
            <object> 
            <property name="text"> 
             <string>Temperature (°c)</string> 
            </property> 
            </object> 
           </property> 
           </object> 
          </property> 
          <property name="tooltip"> 
           <object> 
           <property name="enabled"> 
            <boolean>true</boolean> 
           </property> 
           </object> 
          </property> 
          <property name="plotOptions"> 
           <object> 
           <property name="spline"> 
            <object> 
            <property name="dataLabels"> 
             <object> 
             <property name="enabled"> 
              <boolean>true</boolean> 
             </property> 
             </object> 
            </property> 
            <property name="enableMouseTracking"> 
             <boolean>false</boolean> 
            </property> 
            </object> 
           </property> 
           </object> 
          </property> 
          <property name="series"> 
           <array> 
           <object> 
            <property name="name"> 
            <string>Average High Temp (°c)</string> 
            </property> 
            <property name="color"> 
            <string>#FF8533</string> 
            </property> 
            <property name="data"> 
            <array> 
             <number value="6"/> 
             <number value="8"/> 
             <number value="11"/> 
             <number value="14"/> 
             <number value="19"/> 
             <number value="21"/> 
             <number value="23"/> 
             <number value="23"/> 
             <number value="19"/> 
             <number value="15"/> 
             <number value="9"/> 
             <number value="6"/> 
            </array> 
            </property> 
           </object> 
           <object> 
            <property name="name"> 
            <string>Average Low Temp (°c)</string> 
            </property> 
            <property name="color"> 
            <string>#4572A7</string> 
            </property> 
            <property name="data"> 
            <array> 
             <number value="2"/> 
             <number value="2"/> 
             <number value="4"/> 
             <number value="6"/> 
             <number value="10"/> 
             <number value="12"/> 
             <number value="14"/> 
             <number value="14"/> 
             <number value="11"/> 
             <number value="8"/> 
             <number value="5"/> 
             <number value="2"/> 
            </array> 
            </property> 
           </object> 
           </array> 
          </property> 
          </object> 
         </arguments> 
         </new> 
        </right> 
        </assign> 
       </body> 
       </funcexpr> 
      </arguments> 
      </functioncall> 
     </body> 
     </funcexpr> 
    </arguments> 
    </functioncall> 
</program> 

因此,要获得所有的数据:

In [28]: from bs4 import BeautifulSoup 
In [29]: import requests 
In [30]: import re  
In [31]: import js2xml  
In [32]: from itertools import repeat  
In [33]: from pprint import pprint as pp 
In [34]: soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser") 

In [35]: script = soup.find("script", text=re.compile("Highcharts.Chart")).text 

In [36]: parsed = js2xml.parse(script) 

In [37]: data = [d.xpath(".//array/number/@value") for d in parsed.xpath("//property[@name='data']")] 

In [38]: categories = parsed.xpath("//property[@name='categories']//string/text()") 

In [39]: output = list(zip(repeat(categories), data))  
In [40]: pp(output) 
[(['January', 
    'February', 
    'March', 
    'April', 
    'May', 
    'June', 
    'July', 
    'August', 
    'September', 
    'October', 
    'November', 
    'December'], 
    ['6', '8', '11', '14', '19', '21', '23', '23', '19', '15', '9', '6']), 
(['January', 
    'February', 
    'March', 
    'April', 
    'May', 
    'June', 
    'July', 
    'August', 
    'September', 
    'October', 
    'November', 
    'December'], 
    ['2', '2', '4', '6', '10', '12', '14', '14', '11', '8', '5', '2'])] 

就像我说的,你可以只使用一个正则表达式,但js2xml我觉得是错误的空间等。不会打破它更可靠。

+0

哦,它是好棒。我不知道模块,所以还没有尝试过,但至少你的代码像魅力一样工作。谢谢。 – Blaszard

+0

不用担心,就像我在答案中所说的那样,您可以使用re,但js2xml会为您完成所有工作。 –

+0

我尝试使用上述方法从“https://www.99acres.com/do/pricetrends?building_id=0&loc_id=12400&prop_type=1&pref=S&bed_no=0&w=600&h=350/”获取数据,但似乎无法正常工作 – durjoy

相关问题