HTML中字符串的路径

如何生成HTML文档中所有文本字符串的路径，最好使用BeautifulSoup？我有f.e.验证码：HTML中字符串的路径

<DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished"> 
      28. february 2012 
      </SPAN> 
      14:46 
      </SPAN></DIV><DIV> 
      Something,<P>something else</P>continuing. 
      </DIV>

我想划分HTML代码为路径，以文本字符串，如

str1 >>> <DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished">28. february 2012</SPAN></SPAN></DIV> 
str2 >>> <DIV class="art-info"><SPAN class="time">14:46</SPAN></DIV> 
str3 >>> <DIV>Something,continuing.</DIV> 
str4 >>> <DIV><P>something else</P></DIV>

或

str1 >>> <DIV><SPAN><SPAN>28. february 2012</SPAN></SPAN></DIV> 
str2 >>> <DIV><SPAN>14:46</SPAN></DIV> 
str3 >>> <DIV>Something,continuing.</DIV> 
str4 >>> <DIV><P>something else</P></DIV>

或

str1 >>> //div/span/span/28. february 
str2 >>> //div/span/14:46 
str3 >>> //div/Something,continuing. 
str4 >>> //div/p/something else

我我研究过BeautifulSoup文档，但我无法弄清楚怎么做。你有什么想法？

来源

2013-03-28 Paul Chen

我不知道我的理解它是什么你想要做的事。你能否详细说明一下。 –

我想生成HTML文档中的文本字符串的所有路径。以简化的方式，我想为第一个找到的非标签文本获得类似于/ html/body/div/div/span /“string”的内容，然后使用f.e. html/body/div/div/span/h3/p /“text string”为第二个非标记文本等。 –

from bs4 import BeautifulSoup 
import re 
file=open("input") 
soup = BeautifulSoup(file) 
for t in soup(text=re.compile(".")): 
    path = '/'.join(reversed([p.name for p in t.parentGenerator() if p])) 
    print path+"/"+ t.strip()

输出

[document]/html/body/div/span/span/28. february 2012 
[document]/html/body/div/span/14:46 
[document]/html/body/div/Something, 
[document]/html/body/div/p/something else 
[document]/html/body/div/continuing.

来源

2013-03-29 00:27:47 perreal

谢谢，这正是我的想法。 –

HTML中字符串的路径

回答

相关问题