2012-05-10 26 views
0

我一直用这个拉着我的头发。基本上我无法提取标签 类似的信息:用美丽的唇膏提取标签属性

<REUTERS LEWISSPLIT="TRAIN"> 

我不能让LEWISSPLIT的值,并将其存储在一个列表

我有以下代码:

import arff 
from xml.etree import ElementTree 
import re 
from StringIO import StringIO 

import BeautifulSoup 
from BeautifulSoup import BeautifulSoup 

totstring="" 

with open('reut2-000.sgm', 'r') as inF: 
    for line in inF: 
     string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) 
    totstring+=string 

soup = BeautifulSoup(totstring) 

bodies = list() 
topics = list() 
tags = list() 

for a in soup.findAll("body"): 
    bodies.append(a) 


for b in soup.findAll("topics"): 
    topics.append(b) 

for item in soup.findAll('REUTERS'): 
    tags.append(item['TOPICS']) 



outputstring="" 

for x in range(0,len(bodies)): 
    if topics[x].text=="": 
     continue 
    outputstring=outputstring+"<TOPICS>"+topics[x].text+"</TOPICS>\n"+"<BODY>"+bodies[x].text+"</BODY>\n" 

outfile=open("output.sgm","w") 
outfile.write(outputstring) 

outfile.close() 

print tags[0] 

file.close 

对于解析一些看起来有点像这样的旧的路透社XML:

<!DOCTYPE lewis SYSTEM "lewis.dtd"> 
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1"> 
<DATE>26-FEB-1987 15:01:01.79</DATE> 
<TOPICS><D>cocoa</D></TOPICS> 
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> 
<PEOPLE></PEOPLE> 
<ORGS></ORGS> 
<EXCHANGES></EXCHANGES> 
<COMPANIES></COMPANIES> 
<UNKNOWN> 
&#5;&#5;&#5;C T 
&#22;&#22;&#1;f0704&#31;reute 
u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN> 
<TEXT>&#2; 
<TITLE>BAHIA COCOA REVIEW</TITLE> 
<DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in 
the Bahia cocoa zone, alleviating the drought since early 
January and improving prospects for the coming temporao, 
although normal humidity levels have not been restored, 
Comissaria Smith said in its weekly review. 
&#3;</BODY></TEXT> 
</REUTERS> 
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2"> 
<DATE>26-FEB-1987 15:02:20.00</DATE> 
<TOPICS></TOPICS> 
<PLACES><D>usa</D></PLACES> 
<PEOPLE></PEOPLE> 
<ORGS></ORGS> 
<EXCHANGES></EXCHANGES> 
<COMPANIES></COMPANIES> 
<UNKNOWN> 
&#5;&#5;&#5;F Y 
&#22;&#22;&#1;f0708&#31;reute 
d f BC-STANDARD-OIL-&lt;SRD>-TO 02-26 0082</UNKNOWN> 
<TEXT>&#2; 
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE> 
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America 
Inc said they plan to form a venture to manage the money market 
borrowing and investment activities of both companies. 
    BP North America is a subsidiary of British Petroleum Co 
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil. 
    The venture will be called BP/Standard Financial Trading 
and will be operated by Standard Oil under the oversight of a 
joint management committee. 
&#3;</BODY></TEXT> 
</REUTERS> 
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5546" NEWID="3"> 
<DATE>26-FEB-1987 15:03:27.51</DATE> 
<TOPICS></TOPICS> 
<PLACES><D>usa</D></PLACES> 
<PEOPLE></PEOPLE> 
<ORGS></ORGS> 
<EXCHANGES></EXCHANGES> 
<COMPANIES></COMPANIES> 
<UNKNOWN> 
&#5;&#5;&#5;F A 
&#22;&#22;&#1;f0714&#31;reute 
d f BC-TEXAS-COMMERCE-BANCSH 02-26 0064</UNKNOWN> 
<TEXT>&#2; 
<TITLE>TEXAS COMMERCE BANCSHARES &lt;TCB> FILES PLAN</TITLE> 
<DATELINE> HOUSTON, Feb 26 - </DATELINE><BODY>Texas Commerce Bancshares Inc's Texas 
Commerce Bank-Houston said it filed an application with the 
Comptroller of the Currency in an effort to create the largest 
banking network in Harris County. 
    The bank said the network would link 31 banks having 
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. 

Reuter 
&#3;</BODY></TEXT> 
</REUTERS> 

我感兴趣的去除特殊字符,提取正文和主题标签的内容和构建新的XML了出来:

<topic>oil</topic> 
<body>asdsd</body> 
<topic>grain</topic> 
<body>asdsdds</body> 

我想分割基于的LEWISSPLIT

值这个数据除了把它分成lewissplit的价值之外,我已经能够做到这一切。

这是因为我无法从<reuters>标记中提取值。运行

for item in soup.findAll('REUTERS'): 
    tags.append(item['LEWISSPLIT']) 

print tags[0] 

当我试图从本网站和官方文档,但许多不同的技术,我得到的是[]

究竟为什么这么难提取从该LEWISSPLIT属性的值<REUTERS>标签?

非常感谢您阅读本文。

参见Extracting tag information with beautifulsoup and python

+2

当你调用会发生什么'soup.findAll('REUTERS')'?你会得到什么样的输出?你尝试过'soup.findAll('reuters')'?我注意到当我解析你提供的xml时,BeautifulSoup将所有标签转换为小写。 –

回答

0

乔尔·科尼特是正确的,

“路透社” 与 “lewissplit” 一起应该是小写:(正确的语法:

for item in soup.findAll('reuters'): 
    tags.append(item['lewissplit'])