2015-10-19 62 views
1
<instance id="activate.v.bnc.00024693" docsrc="BNC"> 
<answer instance="activate.v.bnc.00024693" senseid="38201"/> 
<context> 
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
</context> 
</instance> 

我想提取里面的所有文本。这是我目前拥有的。 stuff.text只会在<head></head>之前打印文本(即,你知道......继续),但我不知道如何在</head>之后提取后半部分(即使用...很容易处理)。xml解析这个特定的xml

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print stuff.text 

回答

0

如果使用BeautifulSoup是一个选项,这将是微不足道的:

import bs4 
xtxt = '''  <instance id="activate.v.bnc.00024693" docsrc="BNC"> 
    <answer instance="activate.v.bnc.00024693" senseid="38201"/> 
    <context> 
    Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
    </context> 
    </instance>''' 
soup = bs4.BeautifulSoup(xtxt) 
print soup.find('context').text 

给出:

Do you know what it is , and where I can get one ? We suspect you had 
seen the Terrex Autospade , which is made by Wolf Tools . It is quite 
a hefty spade , with bicycle - type handlebars and a sprung lever at the 
rear , which you step on to activate it . Used correctly , you shouldn't 
have to bend your back during general digging , although it wo n't lift 
out the soil and put in a barrow if you need to move it ! If gardening 
tends to give you backache , remember to take plenty of rest periods 
during the day , and never try to lift more than you can easily cope 
with . 

如果您prefere使用ElementTree的,你应该使用itertext来处理所有文本:

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print ''.join(stuff.itertext()) 

如果您确信您的XML文件是正确的,ElementTree的是不够,因为它是标准Python库的一部分,你不会有任何外部扶养。但是如果XML可能不健全,BeautifulSoup擅长修复小错误。

+0

谢谢版本太多:) – needhelp

0

可以使用元素序列化。有两个选项:

  • 保持内部<head></head>
  • 回报只是文本没有任何标签。

与标签序列的情况下,外部<context></context>标签可以手动删除:

# convert element to string and remove <context></context> tag 
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>'))) 
# read only text without any tags 
print(et.tostring(stuff, method='text'))