2014-11-21 30 views
0

在此网页上有一个“显示学习位置”选项卡,当我单击该选项卡时,它会显示整个位置列表并更改我包含在此程序中的网址。当我运行程序来打印出整个位置列表时,我得到这样的结果:Python:读取隐藏的HTML表格的内容

soup = BeautifulSoup(urllib2.urlopen('https://clinicaltrials.gov/ct2/show/study/NCT01718158?term=NCT01718158&rank=1&show_locs=Y#locn').read()) 

for row in soup('table')[5].findAll('tr'): 
    tds = row('td') 
    if len(tds)<2: 
     continue 
    print tds[0].string, tds[1].string #, '\n'.join(filter(unicode.strip, tds[1].strings)) 

Local Institution None 
Local Institution None 
Local Institution None 
Local Institution None 
Local Institution None 

等等.....剩下的信息就出来了。我觉得我在这里失去了一些东西。我的结果应该是:

United States, California 
Va Long Beach Healthcare System 
Long Beach, California, United States, 90822 
United States, Georgia 
Gastrointestinal Specialists Of Georgia Pc 
Marietta, Georgia, United States, 30060 
United States, New York 
Weill Cornell Medical College 

等等。我想打印出整个位置列表。

+0

它看起来像内容可以基于用户代理进行修改或者可能由JavaScript填充。 'wget --no-check-certificate https://clinicaltrials.gov/ct2/show/study/NCT01718158?term=NCT01718158&rank=1&s how_locs = Y'给我一个没有任何位置的文件,重新寻找。 – Tom 2014-11-21 15:14:27

回答

0

当地的研究机构只有一个表格单元格,但你正在跳过这些。

也许你需要提取的所有单元格中的数据,只跳过行,而不<td>细胞这里:

for row in soup('table')[5].findAll('tr'): 
    tds = row('td') 
    if not tds: 
     continue 
    print u' '.join([cell.string for cell in tds if cell.string]) 

这将产生

United States, California 
Va Long Beach Healthcare System 
Long Beach, California, United States, 90822 
United States, Georgia 
Gastrointestinal Specialists Of Georgia Pc 
Marietta, Georgia, United States, 30060 
# .... 
Local Institution 
Taipei, Taiwan, 100 
Local Institution 
Taoyuan, Taiwan, 333 
United Kingdom 
Local Institution 
London, Greater London, United Kingdom, SE5 9RS 
+0

感谢万Martijn。太感谢了。有效! – 2014-11-24 15:11:26