2015-11-06 99 views
1

我试图抓取看起来像这样的页面,每个页面有3个或更多的span标签。我们的目标是让恩类型的字典的名单:lxml xpath - 获取span标签内的所有文本

{'ctl02_lblAppearanceInfo1': 'Text', 
'ctl02_lblAppearanceInfo2': 'Text'} 

HTML:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE.............. </span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span> 


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span> 


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span> 

我用

tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]') 

成功,因为它返回一个元素对象通过ID和文本属性,但如果我遇到这样的事情:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> 
TEXT LINE 1 
<br>TEXT LINE 2 
<br>TEXT LINE 3 
<br>TEXT LINE 4</span> 

它只会返回回 “文本行1”

回答

2

使用text()

下面是代码:

from lxml import html 

HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1.............. </span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> 
TEXT LINE 10............. 
<br>TEXT LINE 11............. 
<br>TEXT LINE 12............. 
<br>TEXT LINE 13.............</span> 
""" 

tree = html.fromstring(HTML) 
text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]') 

results = dict() 

for i, text_line in enumerate(text_lines): 
    span_id = text_line.xpath('.//@id')[0] 
    span_text = [x.strip() for x in text_line.xpath('.//text()')] 
    results[i] = dict(id=span_id, texts=span_text) 

print results 

输出:

{ 
    0: { 
     'texts': ['TEXT HERE 1..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1' 
    }, 
    1: { 
     'texts': ['TEXT HERE 2..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2' 
    }, 
    2: { 
     'texts': ['TEXT HERE 3..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace' 
    }, 
    3: { 
     'texts': ['TEXT HERE 4..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1' 
    }, 
    4: { 
     'texts': ['TEXT HERE 5..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2' 
    }, 
    5: { 
     'texts': ['TEXT HERE 6..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace' 
    }, 
    6: { 
     'texts': ['TEXT HERE 7..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1' 
    }, 
    7: { 
     'texts': ['TEXT HERE 8..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2' 
    }, 
    8: { 
     'texts': ['TEXT HERE 9..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace' 
    }, 
    9: { 
     'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1' 
    } 
} 
相关问题