2016-07-20 28 views
0
>>> from lxml import html 
>>> html.tostring(html.fromstring('<div>1</div><div>2</div>')) 
'<div><div>1</div><div>2</div></div>' # I dont want to outer <div> 
>>> html.tostring(html.fromstring('I am pure text')) 
'<p>I am pure text</p>' # I dont need the extra <p> 

如何避免lxml中的外层<div><p>避免在lxml中包含外层元素

回答

1

默认情况下,lxml will create a parent div when the string contains multiple elements

你可以用单个片段,而不是工作:

from lxml import html 
test_cases = ['<div>1</div><div>2</div>', 'I am pure text'] 
for test_case in test_cases: 
    fragments = html.fragments_fromstring(test_case) 
    print(fragments) 
    output = '' 
    for fragment in fragments: 
     if isinstance(fragment, str): 
      output += fragment 
     else: 
      output += html.tostring(fragment).decode('UTF-8') 
    print(output) 

输出:

[<Element div at 0x3403ea8>, <Element div at 0x3489368>] 
<div>1</div><div>2</div> 
['I am pure text'] 
I am pure text