提取属性，其与BeautifulSoup4

样品输入XML标签：提取属性，其与BeautifulSoup4

<subj code1="textA" code2="textB" code3="textC"> 
    <txt count="1"> 
     <txt id="123"> 
      This is my text. 
     </txt> 
    </txt> 
</subj>

我与BeautifulSoup尝试从中提取的XML信息到CSV。我所需的输出是

code1,code2,code3,txt 
textA,textB,textC,This is my text.

我一直在玩这个示例代码，我发现here：它的工作原理中关于提取txt但不是在代码1，代码2，CODE3在标签subj。

if __name__ == '__main__': 
    with open('sample.csv', 'w') as fhandle: 
     writer = csv.writer(fhandle) 
     writer.writerow(('code1', 'code2', 'code3', 'text')) 
     for subj in soup.find_all('subj'): 
      for x in subj: 
       writer.writerow((subj.code1.text, 
           subj.code2.text, 
           subj.code3.text, 
           subj.txt.txt))

，但是，我不能让它也承认subj，我要提取的属性。有什么建议吗？

来源

2017-06-30 owwoow14

code1,code2和code3不是文字，它们是属性。

为了访问它们，treat an element as a dictionary：

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))

演示：

In [1]: from bs4 import BeautifulSoup 

In [2]: data = """ 
    ...: <subj code1="textA" code2="textB" code3="textC"> 
    ...:  <txt count="1"> 
    ...:   <txt id="123"> 
    ...:    This is my text. 
    ...:   </txt> 
    ...:  </txt> 
    ...: </subj> 
    ...: """ 

In [3]: soup = BeautifulSoup(data, "xml") 
In [4]: for subj in soup('subj'): 
    ...:  print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)]) 
['textA', 'textB', 'textC', 'This is my text.']

您还可以使用.get()提供一个默认值，如果一个属性是丢失：

subj.get('code1', 'Default value for code1')

来源

2017-06-30 13:37:09 alecxe

提取属性，其与BeautifulSoup4

回答

相关问题