2013-03-28 65 views
1

如何生成HTML文档中所有文本字符串的路径,最好使用BeautifulSoup? 我有f.e.验证码:HTML中字符串的路径

<DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished"> 
      28. february 2012 
      </SPAN> 
      14:46 
      </SPAN></DIV><DIV> 
      Something,<P>something else</P>continuing. 
      </DIV> 

我想划分HTML代码为路径,以文本字符串,如

str1 >>> <DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished">28. february 2012</SPAN></SPAN></DIV> 
str2 >>> <DIV class="art-info"><SPAN class="time">14:46</SPAN></DIV> 
str3 >>> <DIV>Something,continuing.</DIV> 
str4 >>> <DIV><P>something else</P></DIV> 

str1 >>> <DIV><SPAN><SPAN>28. february 2012</SPAN></SPAN></DIV> 
str2 >>> <DIV><SPAN>14:46</SPAN></DIV> 
str3 >>> <DIV>Something,continuing.</DIV> 
str4 >>> <DIV><P>something else</P></DIV> 

str1 >>> //div/span/span/28. february 
str2 >>> //div/span/14:46 
str3 >>> //div/Something,continuing. 
str4 >>> //div/p/something else 

我我研究过BeautifulSoup文档,但我无法弄清楚怎么做。你有什么想法?

+0

我不知道我的理解它是什么你想要做的事。你能否详细说明一下。 –

+0

我想生成HTML文档中的文本字符串的所有路径。以简化的方式,我想为第一个找到的非标签文本获得类似于/ html/body/div/div/span /“string”的内容,然后使用f.e. html/body/div/div/span/h3/p /“text string”为第二个非标记文本等。 –

回答

2
from bs4 import BeautifulSoup 
import re 
file=open("input") 
soup = BeautifulSoup(file) 
for t in soup(text=re.compile(".")): 
    path = '/'.join(reversed([p.name for p in t.parentGenerator() if p])) 
    print path+"/"+ t.strip() 

输出

[document]/html/body/div/span/span/28. february 2012 
[document]/html/body/div/span/14:46 
[document]/html/body/div/Something, 
[document]/html/body/div/p/something else 
[document]/html/body/div/continuing. 
+0

谢谢,这正是我的想法。 –