2013-08-06 90 views
0

我已经下载了一个网页并以html格式保存。我想解析并获取“fullname”,“memberHeadline”,“numberOfConnections”等的值。我尝试在Python中使用BeautifulSoup,但它不起作用。我也试过遇到解析网页的困难

>>> import json 
>>> encoded_data = json.loads(f) 

Traceback (most recent call last): 
    File "<pyshell#14>", line 1, in <module> 
    encoded_data = json.loads(f) 
    File "C:\Python27\lib\json\__init__.py", line 338, in loads 
    return _default_decoder.decode(s) 
    File "C:\Python27\lib\json\decoder.py", line 365, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "C:\Python27\lib\json\decoder.py", line 383, in raw_decode 
    raise ValueError("No JSON object could be decoded") 
    ValueError: No JSON object could be decoded 

我不清楚什么格式的文件是。下面复制的是文件的内容。

<!DOCTYPE html> 
<!--[if lt IE 7]> 
    <html lang="en" class="ie ie6 lte9 lte8 lte7 os-win"> 
    <![endif]--> 
    <!--[if IE 7]> 
     <html lang="en" class="ie ie7 lte9 lte8 lte7 os-win"> 
     <![endif]--> 
     <!--[if IE 8]> 
      <html lang="en" class="ie ie8 lte9 lte8 os-win"> 
      <![endif]--> 
      <!--[if IE 9]> 
       <html lang="en" class="ie ie9 lte9 os-win"> 
       <![endif]--> 
       <!--[if gt IE 9]> 
        <html lang="en" class="os-win"> 
        <![endif]--> 
        <!--[if !IE]><!--> 
         <html lang="en" class="os-win"> 
         <!--<![endif]--> 

         <head> 
          <meta name="lnkd-track-json-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=2jds9coeh4w78ed9wblscv68v-eo3jgzogk6v7maxgg86f4u27d&amp;fc=2"> 
          <meta name="lnkd-track-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=eo3jgzogk6v7maxgg86f4u27d&amp;fc=2"> 
          <meta name="treeID" content="yGlqHfV7FxMQvJqjACsAAA=="> 
          <meta name="appName" content="profile"> 
          <meta name="lnkd-track-error" content="/lite/ua/error?csrfToken=ajax%3A1584468784299534813&amp;goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1"> 
          <script src="http://static.licdn.com:80/scds/common/u/lib/fizzy/fz-1.3.3-min.js" type="text/javascript"></script> 
          <script type="text/javascript"> 
           fs.config({ 
            "failureRedirect": "http://www.linkedin.com/nhome/", 
            "uniEscape": true, 
            "xhrHeaders": { 
             "X-FS-Origin-Request": "/profile/view?id=131506997&authType=NAME_SEARCH&authToken=9ikF&locale=en_US&srchid=123452511375704499972&srchindex=1&srchtotal=63&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A123452511375704499972%2CVSRPtargetId%3A131506997%2CVSRPcmpt%3Aprimary", 
             "X-FS-Page-Id": "nprofile-view" 
            } 
           }); 
          </script> 
          <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=8swqmpmjehppqzovz8zzfvv9g-aef5jooigi7oiyblwlouo8z90-7tqheyb1qchwa8dejl8nvz7zd-10q339fub5b718xk0pv9lzhpl&amp;fc=2"></script> 
          <meta http-equiv="content-type" content="text/html; charset=UTF-8"> 
          <meta http-equiv="X-UA-Compatible" content="IE=9"> 
          <meta name="pageImpressionID" content="3b47eb21-1db4-42c1-9d00-43b2918c4099"> 
          <meta name="pageKey" content="nprofile_v2_view_fs"> 
          <meta name="analyticsURL" content="/analytics/noauthtracker"> 
          <link rel="openid.server" href="https://www.linkedin.com/uas/openid/authorize"> 
          <link rel="apple-touch-icon-precomposed" href="/img/icon/apple-touch-icon.png"> 
          <link rel="stylesheet" type="text/css" href="http://s.c.lnkd.licdn.com/scds/concat/common/css?h=3bifs78lai5i0ndyj1ew7316e-c8kkvmvykvq2ncgxoqb13d2by-cphg8n6ehozk6lgpbb36za2ap-2it1to3q1pt5evainys9ta07p-4uu2pkz5u0jch61r2nhpyyrn8-7poavrvxlvh0irzkbnoyoginp-4om4nn3a2z730xs82d78xj3be-3t9ar1pajet97hzt9uou74qbb-ct4kfyj4tquup0bvqhttvymms-58ujm6g9r0a3ok6mpq7cs25gn-9zbbsrdszts09by60it4vuo3q-8ti9u6z5f55pestwbmte40d9-5730os2tf3iaiql5c8fukzd2u-cxff0g818hf3ks7dzixd4lqcq-6ramlbadr9lh7v5r7vuc6t4ld-e5frmcn40t833k1adjvwkoyjq&amp;fc=2"> 
          <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dfoaudjrk6rbf82f45bz5crwi-62og8s54488owngg0s7escdit-c8ha6zrgpgcni7poa5ctye7il-3ufb745s29q1ovtbq6htt6rwh-51dv6schthjydhvcv6rxvospp-e9rsfv7b5gx0bk0tln31dx3sq-2r5gveucqe4lsolc3n0oljsn1-8v2hz0euzy8m1tk5d6tfrn6j-d4jr8g8vadmx9i5wz2z6ck3pb-1wfd86vm2f60y6uu5isrw94q-ddqoqtn6xcqi8i0y3tsrdl533-81wr8bey9cjjn6rhvbv530cap-6iw3fvg61uute6gxy89acxi5d-5sxnyeselbctwpry658s2lkew-4c6mz6u5rinti47gswwanj74j-5eec9p1vamr86uabn13sngx92-6oxtrh5eu6olunu39xzqgp10i-cdr0psywot2inbx54hmajga3p-anaxa6l712w7m4gp8089vyb5m-3h7320kwlnqtbngzm67z2annq-80vg9koywz84zoon9sjflbru0-8cwe3ciy81r59l0q3usztbt2r-1na957r12xyfe317uma5pn9mc-57sur4cj634ll9tk38imgvc6g-8v6o0480wy5u6j7f3sh92hzxo-9puf8y7tgjvse2oqtgkdb4wcj-c9pibx8dlmicbwjh48g12z6bl-12xp8e6pputw80p9fcpzyy9m0-3xjyji4eyuzpbppt3cssr1oko-34nej6plgotmo4hbnvjthteuu-4nw8tqsdbe61ig2l9faf3qdi9&amp;fc=2"></script> 
          <script type="text/javascript"> 
           LI.define('UrlPackage'); 
           LI.UrlPackage.containerCore = ["http://s.c.lnkd.licdn.com/scds/concat/common/js?h=d7z5zqt26qe7ht91f8494hqx5&fc=2"] 
           [0]; 
          </script> 
          <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/common/u/js/scds-hashes.js"></script> 
          <script type="text/javascript"> 
           LI.JSContentBasePath = "http://s.c.lnkd.licdn.com/scds/concat/common/js?v=build-2000_1_28557-prod&fc=2"; 
           LI.CSSContentBasePath = "http://s.c.lnkd.licdn.com/scds/concat/common/css?v=build-2000_1_28557-prod&fc=2"; 
           LI.injectRelayHtmlUrl = "http://s.c.lnkd.licdn.com/scds/common/u/lib/inject/0.4.2/relay.html"; 
           LI.injectRelaySwfUrl = "http://s.c.lnkd.licdn.com/scds/common/u/lib/inject/0.4.2/relay.swf"; 
           LI.comboBaseUrl = "http://s.c.lnkd.licdn.com/scds/concat/common/css?v=build-2000_1_28557-prod&fc=2"; 
           LI.staticUrlHashEnabled = "true"; 
          </script> 
          <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=c19zsujfl1pg46iqy33ubhqc5-eu97293ov7e2qciy9zn7fyg55-4n3akxbgwkyp3he1eeb136xxq-aq8gt7g4x1o11fxmypuv7vfkb-8kh6sn7nciobs2crbqunav09q-6m96aslgoubdqpnadnimrxsuk&amp;fc=2"></script> 
          <script type="text/javascript"> 
           document.cookie = 'lang="v=2&lang=en-us"; domain=linkedin.com; version=0; path=/;'; 
          </script> 
          <title>xyz abc | LinkedIn</title> 
          <link rel="stylesheet" type="text/css" href="http://s.c.lnkd.licdn.com/scds/concat/common/css?h=449dtfu96optpu75y189filyn-4lrmst05cep59hjxopm4xrj84-depvqaeschv5p2381431jub3f-ae142xp1b9qwrvanr8q32v931&amp;fc=2"> 
          <!--[if gte IE 9]> 
           <link rel="shortcut icon" type="image/ico" href="http://s.c.lnkd.licdn.com/scds/common/u/img/favicon_ie9.ico"> 
          <![endif]--> 
          <!--[if (!IE)|(lt IE 9)]> 
    <link rel="shortcut icon" type="image/ico" href="http://s.c.lnkd.licdn.com/scds/common/u/img/favicon_v3.ico"> 
    <![endif]--> 
           <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=6rei9ktvfprzc38327x3gt0u3-c61ck8yq8xgf9ji3h55bmaux8-e7bh8bocljccs3kjl5f03uw8j&amp;fc=2"></script> 
           <script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=bqzk7a8ih1jk3hy30mnqy9jxb-39gmm0e77xjrikdpkxed7aywi-d228wbwzysn60azozcfg7gzoa&amp;fc=2"></script>"ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/10d/1c1/3af8941.png"},{"link_biz":"/company/axis\u002dmutual\u002dfund?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"axis\u002dmutual\u002dfund","id":565614,"logo":"http://m.c.lnkd.licdn.com/media/p/2/000/036/30a/1bf5734.png","canonicalName":"Axis Mutual Fund","biz_follow":"/company/follow/submit?id=565614&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/2/000/036/30a/1bf5734.png"},{"link_biz":"/company/uti\u002dmf?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"uti\u002dmf","id":467803,"logo":"http://m.c.lnkd.licdn.com/media/p/3/000/02f/205/35eec6b.png","canonicalName":"UTI MF","biz_follow":"/company/follow/submit?id=467803&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/02f/205/35eec6b.png"},{"link_biz":"/company/investment\u002ddata\u002dservices?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"investment\u002ddata\u002dservices","id":92139,"logo":"http://m.c.lnkd.licdn.com/media/p/3/000/0ad/3d0/32f42f2.png","canonicalName":"Investment Data Services","biz_follow":"/company/follow/submit?id=92139&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/0ad/3d0/32f42f2.png"}],"i18n_news":"News","lix_profile_showChannels":"control","i18n_unfollow":"Unfollow","isFollowing":true}},"BasicInfo":{"empty":{},"upsell":{"deferImg":true,"visible":true},"basic_info":{"showTopCardDetail":true,"visible":true,"phoneticname":"","i18n__Industry":"Industry","industry_pivot":"/search?search=&industry=43&sortCriteria=R&keepFacets=true&trk=prof\u002d0\u002dovw\u002dindustry","find_others_region":"Find other members in Mumbai Area, India","headline_highlight":"Executive at L&amp;T Mutual Fund","i18n__find_others_in_industry":"Find other members in this industry","i18n_Edit":"Edit","location_highlight":"<strong class=\ "highlight\">Mumbai Area, India</strong>","deferImg":true,"industry_highlight":"Financial Services","industryID":43,"memberHeadline":"Executive at L&T Mutual Fund","i18n__Location":"Location","memberID":131506997,"location_pivot":"/search?search=&sortCriteria=R&keepFacets=true&facet_G=in%3A7150&trk=prof\u002d0\u002dovw\u002dlocation","fmt_location":"Mumbai Area, India","fullname":"xyz abc"}}` 
+0

这是一个HTML文件的一部分。你需要整个文件,你应该使用Beautifulsoup来解析它。这不是一个JSON文件,这就是你得到这个错误的原因。你试过的Beautifulsoup代码是什么? – FakeRainBrigand

+0

您正在与javascript生成的页面进行交互。你需要像selenuim这样的库。 – zhangyangyu

回答

1

这是一个HTML文件。你可以说因为它以<!DOCTYPE html>开头(实际上,这意味着它是HTML5)。

不能加载它的原因json是因为它不是json。预计会得到异常,并且您的代码应该可以处理它,因为通常传递错误类型的文件就是发生的事情。

您可能希望使用lxml(可能与beautifulsoup后端)解析此问题。

+0

lxml完全是要走的路。尽管我个人推荐html模块的性能比beautifulsoup库更高。 –

+0

@SlaterTyranus相当。除非我错了,否则美丽的脸庞会更健壮。 – Marcin

+0

我听说过这样的人,但我从来没有遇到过一个重要的地方。根据我的经验,如果html足够大以至于很有用,那么浏览器在渲染时会对其进行更正。 –

0

标题是HTML5的文档类型(此功能与XHTML类似)。与XHML不同的是,您并不需要关闭一些标签,但在这种情况下,它更像是一种“隐藏iframe流式传输”技术,即页面永久加载。

关于解析,我会建议在回答关于“解析不良HTML” - > How to parse malformed HTML in python, using standard libraries

问候 栾