2017-06-02 122 views
0

我不能解析这个似乎没有任何引用类的XML。美丽的汤解析问题

我的代码片段:

sock = urllib2.urlopen(l) 
link = sock.read() 

soup = BeautifulSoup(link,"xml") 

FirstNameHome=soup.find('home_probable_pitcher','first_name') 

我想找到的名字同时为家庭和客队:

(即使世界只有两种情况,所以不知道我是否应该使用findAll

下面是使用源soup.prettify

LookupError: unknown encoding: <?xml version="1.0" encoding="UTF-8"?><!--Copyright 2017 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt--> 
<game id="2017/06/02/nyamlb-tormlb-1" venue="Rogers Centre" game_pk="490921" 
     time="7:07" 
     time_date="2017/06/02 7:07" 
     time_date_aw_lg="2017/06/02 7:07" 
     time_date_hm_lg="2017/06/02 7:07" 
     time_zone="ET" 
     ampm="PM" 
     first_pitch_et="" 
     away_time="7:07" 
     away_time_zone="ET" 
     away_ampm="PM" 
     home_time="7:07" 
     home_time_zone="ET" 
     home_ampm="PM" 
     game_type="R" 
     tiebreaker_sw="N" 
     original_date="2017/06/02" 
     time_zone_aw_lg="-4" 
     time_zone_hm_lg="-4" 
     time_aw_lg="7:07" 
     aw_lg_ampm="PM" 
     tz_aw_lg_gen="ET" 
     time_hm_lg="7:07" 
     hm_lg_ampm="PM" 
     tz_hm_lg_gen="ET" 
     venue_id="14" 
     scheduled_innings="9" 
     away_name_abbrev="NYY" 
     home_name_abbrev="TOR" 
     away_code="nya" 
     away_file_code="nyy" 
     away_team_id="147" 
     away_team_city="NY Yankees" 
     away_team_name="Yankees" 
     away_division="E" 
     away_league_id="103" 
     away_sport_code="mlb" 
     home_code="tor" 
     home_file_code="tor" 
     home_team_id="141" 
     home_team_city="Toronto" 
     home_team_name="Blue Jays" 
     home_division="E" 
     home_league_id="103" 
     home_sport_code="mlb" 
     day="FRI" 
     gameday_sw="P" 
     double_header_sw="N" 
     game_nbr="1" 
     tbd_flag="N" 
     venue_w_chan_loc="CAXX0504" 
     location="Toronto, Canada" 
     gameday_link="2017_06_02_nyamlb_tormlb_1" 
     away_win="30" 
     away_loss="20" 
     home_win="26" 
     home_loss="27" 
     game_data_directory="/components/game/mlb/year_2017/month_06/day_02/gid_2017_06_02_nyamlb_tormlb_1" 
     league="AA" 
     inning_state="" 
     note="" 
     status="Preview" 
     ind="S" 
     tv_station="SNET-1, MLBN (out-of-market only)"> 
    <home_probable_pitcher id="434538" first_name="Francisco" first="Francisco" last_name="Liriano" 
          last="Liriano" 
          name_display_roster="Liriano" 
          number="45" 
          throwinghand="LHP" 
          wins="2" 
          losses="2" 
          era="6.35" 
          s_wins="2" 
          s_losses="2" 
          s_era="6.35" 
          stats_season="2017" 
          stats_type="R"/> 
    <away_probable_pitcher id="501381" first_name="Michael" first="Michael" last_name="Pineda" 
          last="Pineda" 
          name_display_roster="Pineda" 
          number="35" 
          throwinghand="RHP" 
          wins="6" 
          losses="2" 
          era="3.32" 
          s_wins="6" 
          s_losses="2" 
          s_era="3.32" 
          stats_season="2017" 
          stats_type="R"/> 
    <game_media> 
     <media type="game" calendar_event_id="14-490921-2017-06-02" 
      start="2017-06-02T19:07:00-0400" 
      title="NYY @ TOR" 
      has_mlbtv="true" 
      free="NO" 
      enhanced="N" 
      media_state="media_off" 
      thumbnail="http://mediadownloads.mlb.com/mlbam/preview/nyator_490921_th_7_preview.jpg"/> 
    </game_media> 
</game> 
+0

请加(在你的例子'l'对象)URL的例子 –

+0

http://gd2.mlb.com /components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1/linescore.xml –

+0

请注意,URL输出将在2017年3月6日后发生变化 –

回答

3

如果我们写

# for Python 3 
# import urllib.request 

import urllib2 

from bs4 import BeautifulSoup 

l = 'http://gd2.mlb.com/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1/linescore.xml' 

sock = urllib2.urlopen(l) 
# for Python 3 
# sock = urllib.request.urlopen(l) 
link = sock.read() 

soup = BeautifulSoup(link, "xml") 

FirstNameHome = soup.find('home_probable_pitcher').attrs['first_name'] 
print(FirstNameHome) 

它给

Edinson 

print(soup.prettify(encoding='utf-8')) 

<?xml version="1.0" encoding="utf-8"?> 
<!--Copyright 2017 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt--> 
<game ampm="PM" aw_lg_ampm="PM" away_ampm="PM" away_code="ari" away_division="W" away_file_code="ari" away_league_id="104" away_loss="22" away_name_abbrev="ARI" away_sport_code="mlb" away_team_city="Arizona" away_team_id="109" away_team_name="D-backs" away_time="1:10" away_time_zone="MST" away_win="34" day="SAT" double_header_sw="N" first_pitch_et="" game_data_directory="/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1" game_nbr="1" game_pk="490927" game_type="R" gameday_link="2017_06_03_arimlb_miamlb_1" gameday_sw="P" hm_lg_ampm="PM" home_ampm="PM" home_code="mia" home_division="E" home_file_code="mia" home_league_id="104" home_loss="31" home_name_abbrev="MIA" home_sport_code="mlb" home_team_city="Miami" home_team_id="146" home_team_name="Marlins" home_time="4:10" home_time_zone="ET" home_win="21" id="2017/06/03/arimlb-miamlb-1" ind="S" inning_state="" league="NN" location="Miami, FL" note="" original_date="2017/06/03" scheduled_innings="9" status="Preview" tbd_flag="N" tiebreaker_sw="N" time="4:10" time_aw_lg="4:10" time_date="2017/06/03 4:10" time_date_aw_lg="2017/06/03 4:10" time_date_hm_lg="2017/06/03 4:10" time_hm_lg="4:10" time_zone="ET" time_zone_aw_lg="-4" time_zone_hm_lg="-4" tv_station="FS-F, MLBN (out-of-market only)" tz_aw_lg_gen="ET" tz_hm_lg_gen="ET" venue="Marlins Park" venue_id="4169" venue_w_chan_loc="USFL0316"> 
<home_probable_pitcher era="4.44" first="Edinson" first_name="Edinson" id="450172" last="Volquez" last_name="Volquez" losses="7" name_display_roster="Volquez" number="36" s_era="4.44" s_losses="7" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/> 
<away_probable_pitcher era="3.47" first="Randall" first_name="Randall" id="517414" last="Delgado" last_name="Delgado" losses="0" name_display_roster="Delgado" number="48" s_era="3.47" s_losses="0" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/> 
<game_media> 
    <media calendar_event_id="14-490927-2017-06-03" enhanced="N" free="NO" has_mlbtv="true" media_state="media_off" start="2017-06-03T16:10:00-0400" thumbnail="http://mediadownloads.mlb.com/mlbam/preview/arimia_490927_th_7_preview.jpg" title="ARI @ MIA" type="game"/> 
</game_media> 
</game> 

编辑

我可以重现你的错误,只有当我通过link对象(或str(soup))到prettify方法

soup.prettify(link) 

很好,这是不是你所需要的,因为prettify参数可以是encoding'utf-8'例如)和formatter(默认为'minimal'),而不是原始内容,所以只写

pretty = soup.prettify() 

,它会给

>>> type(pretty) 
<type 'unicode'> 

或指定编码

>>> pretty = soup.prettify(encoding='utf-8') 

,它会给

>>> type(pretty) 
<type 'str'> 
+0

谢谢..我没有太在意查找错误,因为我只需要知道如何解析这个名字。看起来像FirstNameHome = soup.find('home_probable_pitcher')。attrs ['first_name'] 这样做。我马上会再检查一次。 –

+0

@DannyW:让我知道它是否可以改进 –