2017-06-30 50 views
1

样品输入XML标签:提取属性,其与BeautifulSoup4

<subj code1="textA" code2="textB" code3="textC"> 
    <txt count="1"> 
     <txt id="123"> 
      This is my text. 
     </txt> 
    </txt> 
</subj> 

我与BeautifulSoup尝试从中提取的XML信息到CSV。 我所需的输出是

code1,code2,code3,txt 
textA,textB,textC,This is my text. 

我一直在玩这个示例代码,我发现here: 它的工作原理中关于提取txt但不是在代码1,代码2,CODE3在标签subj

if __name__ == '__main__': 
    with open('sample.csv', 'w') as fhandle: 
     writer = csv.writer(fhandle) 
     writer.writerow(('code1', 'code2', 'code3', 'text')) 
     for subj in soup.find_all('subj'): 
      for x in subj: 
       writer.writerow((subj.code1.text, 
           subj.code2.text, 
           subj.code3.text, 
           subj.txt.txt)) 

,但是,我不能让它也承认subj,我要提取的属性。 有什么建议吗?

回答

1

code1,code2code3不是文字,它们是属性

为了访问它们,treat an element as a dictionary

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True))) 

演示:

In [1]: from bs4 import BeautifulSoup 

In [2]: data = """ 
    ...: <subj code1="textA" code2="textB" code3="textC"> 
    ...:  <txt count="1"> 
    ...:   <txt id="123"> 
    ...:    This is my text. 
    ...:   </txt> 
    ...:  </txt> 
    ...: </subj> 
    ...: """ 

In [3]: soup = BeautifulSoup(data, "xml") 
In [4]: for subj in soup('subj'): 
    ...:  print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)]) 
['textA', 'textB', 'textC', 'This is my text.'] 

您还可以使用.get()提供一个默认值,如果一个属性是丢失:

subj.get('code1', 'Default value for code1')