xml解析这个特定的xml

<instance id="activate.v.bnc.00024693" docsrc="BNC"> 
<answer instance="activate.v.bnc.00024693" senseid="38201"/> 
<context> 
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
</context> 
</instance>

我想提取里面的所有文本。这是我目前拥有的。 stuff.text只会在<head></head>之前打印文本（即，你知道......继续），但我不知道如何在</head>之后提取后半部分（即使用...很容易处理）。xml解析这个特定的xml

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print stuff.text

来源

2015-10-19 needhelp

如果使用BeautifulSoup是一个选项，这将是微不足道的：

import bs4 
xtxt = '''  <instance id="activate.v.bnc.00024693" docsrc="BNC"> 
    <answer instance="activate.v.bnc.00024693" senseid="38201"/> 
    <context> 
    Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
    </context> 
    </instance>''' 
soup = bs4.BeautifulSoup(xtxt) 
print soup.find('context').text

给出：

Do you know what it is , and where I can get one ? We suspect you had 
seen the Terrex Autospade , which is made by Wolf Tools . It is quite 
a hefty spade , with bicycle - type handlebars and a sprung lever at the 
rear , which you step on to activate it . Used correctly , you shouldn't 
have to bend your back during general digging , although it wo n't lift 
out the soil and put in a barrow if you need to move it ! If gardening 
tends to give you backache , remember to take plenty of rest periods 
during the day , and never try to lift more than you can easily cope 
with .

如果您prefere使用ElementTree的，你应该使用itertext来处理所有文本：

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print ''.join(stuff.itertext())

如果您确信您的XML文件是正确的，ElementTree的是不够，因为它是标准Python库的一部分，你不会有任何外部扶养。但是如果XML可能不健全，BeautifulSoup擅长修复小错误。

来源

2015-10-19 14:57:15

谢谢版本太多:) – needhelp

可以使用元素序列化。有两个选项：

保持内部<head></head>
回报只是文本没有任何标签。

与标签序列的情况下，外部<context></context>标签可以手动删除：

# convert element to string and remove <context></context> tag 
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>'))) 
# read only text without any tags 
print(et.tostring(stuff, method='text'))

来源

2015-10-19 15:33:40

xml解析这个特定的xml

回答

相关问题