Python的字符串替换：关键字到URL中

我将使用URL来代替某些关键字的字符串，例如，Python的字符串替换：关键字到URL中

content.replace("Google","<a href="http://www.google.com">Google</a>")

不过，我只想与网址将只如果不是已经包裹在一个取代关键字网址。

内容是简单的HTML：

<p><b>This is an example!</b></p><p>I love <a href="http://www.google.com">Google</a></p><p><a href="http://www.google.com"><img src="/google.jpg" /></a></p>

主要<a>和<img>标签。

主要问题：如何确定一个关键字是否已经包装在<a>或<img>标记中？

这里是一个类似的问题，在PHP find and replace keywords with urls ONLY if not already wrapped in a url，但答案不是一个有效的。

Python中是否有更好的解决方案？更好的代码示例。谢谢！

来源

2012-06-09 Susan Mayer

可不可以给一个您想要运行此功能的文本类型的示例？ – Acorn

@Acorn HTML网页。例如：'

这是一个例子！

我爱Google

' –

可以使用的例子，我有如下所示创建匹配以或标签正则表达式。 – tabchas

由于克里斯 - 顶说，BeautifulSoup是要走的路：

from BeautifulSoup import BeautifulSoup, Tag, NavigableString 
import re  

html = """ 
<div> 
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Dog'>fox</a> jumped over the lazy Dog</p> 
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p> 
    <p>See image for reference:</p> 
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/> 
</div> 
""" 
soup = BeautifulSoup(html) 

#search term, url reference 
keywords = [("dog","http://en.wikipedia.org/wiki/Dog"), 
      ("fox","http://en.wikipedia.org/wiki/Fox")] 

def insertLinks(string_value,string_href): 
    for t in soup.findAll(text=re.compile(string_value, re.IGNORECASE)): 
      if t.parent.name !='a': 
        a = Tag('a', name='a') 
        a['href'] = string_href 
        a.insert(0, NavigableString(string_value)) 
        string_list = re.compile(string_value, re.IGNORECASE).split(t) 
        replacement_text = soup.new_string(string_list[0]) 
        t.replace_with(replacement_text) 
        replacement_text.insert_after(a) 
        a.insert_after(soup.new_string(string_list[1])) 


for word in keywords: 
    insertLinks(word[0],word[1]) 

print soup

将产生：

<div> 
    <p>The quick brown <a href="http://en.wikipedia.org/wiki/Dog">fox</a> jumped over the lazy <a href="http://en.wikipedia.org/wiki/Dog">dog</a></p> 
    <p>The <a href="http://en.wikipedia.org/wiki/Dog">dog</a>, who was, in reality, not so lazy, gave chase to the <a href="http://en.wikipedia.org/wiki/Fox">fox</a>.</p> 
    <p>See image for reference:</p> 
    <img src="dog_chasing_fox.jpg" title="Dog chasing fox"/> 
</div>

来源

2012-06-09 22:02:40

哇这整个时间我试图解决问题使用HTMLParser库...我正在为它工作了3小时...然后有一个库已经为它:( – tabchas

@Kevin P感谢把提交一些工作代码的时间:) – topless

您可以尝试添加上一篇文章中提到的正则表达式。首先根据正则表达式检查您的字符串，以检查它是否已包装在URL中。这应该是非常简单的，因为简单地调用re库和它的search（）方法应该可以做到。

这里是一个很好的教程，如果你需要对正则表达式和搜索方法具体为：http://www.tutorialspoint.com/python/python_reg_expressions.htm

后您检查字符串，看看它是否已经包裹在一个URL或没有，你可以调用替换功能如果它尚未包装在URL中。

下面是一个简单的例子，我写道：

import re 

    x = "<a href=""http://www.google.com"">Google</a>" 
    y = 'Google' 

    def checkURL(string): 
     if re.search(r'<a href.+', string): 
      print "URL Wrapped Already" 
      print string 
     else: 
      string = string.replace('Google', "<a href=""http://www.google.com"">Google</a>") 
      print "URL Not Wrapped:" 
      print string 

    checkURL(x) 
    checkURL(y)

我希望这回答您的问题！

来源

2012-06-09 11:34:18 tabchas

咦？我似乎没有得到你。我不搜索特定的字符串。我只想用urls替换关键字，如果尚未包含在url中。 –

你能举一个你可以使用的文字的例子吗？ – tabchas

我使用Beatiful Soup解析我的HTML，因为parsing HTML与正则表达式可以证明棘手。如果你使用美丽的汤，你可以玩previous_sibling和previous_element找出你需要的东西。

你以这种方式互动：

soup.find_all('img')

来源

2012-06-09 21:02:42 topless

Python的字符串替换：关键字到URL中

回答

相关问题