正则表达式来提取所有的链接和相应的链接文字

-1

我全新的正则表达式，而我试图解决这两个以下问题：正则表达式来提取所有的链接和相应的链接文字

写的正则表达式提取所有链接和来自HTML页面的相应链接文本。例如，如果你想解析：
```
text1 <a href="http://example.com">hello, world</a> text2 
```

并得到结果

http://example.com <tab> hello, world

做同样的事情，而且处理情况< ...>嵌套：

text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3

到目前为止，我仍然处在第一个问题上，并且我尝试了几种方法。我认为我的第一个最好的答案是正则表达式(?<=a href=\")(.*)(?=</a>)它给了我：http://example.com">hello, world

这对我来说似乎很好，但我不知道我应该如何接近第二部分。任何帮助或见解将不胜感激。

来源

2016-12-15 Zach Ellis

正则表达式与嵌套不好。你应该考虑一个真正的html解析器。 –

http://stackoverflow.com/a/1732454/6779307 –

那么我该如何回答这个问题呢？只要说PLZ没有正则表达式的HTML解析？ –

如果你有HTML解析器像BeautifulSoup来解决这个问题，它仅仅归结为定位a元素，使用对href属性类似于字典的访问和get_text()用于获取元素的文本：

In [1]: from bs4 import BeautifulSoup 

In [2]: l = [ 
    """text1 <a href="http://example.com">hello, world</a> text2""", 
    """text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3""" 
] 

In [3]: for s in l: 
      soup = BeautifulSoup(s, "html.parser") 
      link = soup.a 
      print(link["href"] + "\t" + link.get_text()) 
    ...:  
http://example.com hello, world 
http://example.com hello, world

来源

2016-12-15 20:32:29 alecxe

既然你提到的正则表达式

import re 

line1 = "text1 <a href=”http://example.com”>hello, world</a> text2" 
line2 = "text1 <a href=”http://example.com” onclick=”javascript:alert(‘<b>text2</b>’)”>hello, world</a> text3" 


link1 = re.search("<. href=(.*)<\/.>",line1) 
print(link1.group(1)) 
link2 = re.search("<. href=(.*)<\/.>",line2) 
print(link2.group(1))

输出

”http://example.com”>hello, world 
”http://example.com” onclick=”javascript:alert(‘<b>text2</b>’)”>hello, world

来源

2016-12-15 20:43:56

使用正则表达式，有时候最好看看你不应该捕获的东西，而不是你应该得到你想要的东西。这Perl的正则表达式应该可靠地捕获简单链接以及相关的文字：

#!perl 

use strict; 
use warnings; 

my $sample = q{text1 <a href="http://example.com">hello, world</a> text2}; 

my ($link, $link_text) = $sample =~ m{<a href="([^"]*)"[^>]*>(.*?)</a>}; 

print "$link \t $link_text\n"; 

1;

这将打印：

http://example.com <tab> hello, world

要打破它在做什么：

第一次捕捉，([^"]*)，期待对于不是双引号的href属性中的0个或更多字符。方括号用于列出一系列字符，并且前导克拉指示正则表达式查找任何不在此范围内的字符。

同样，我使用[^>]*>来找到a标记的右括号，而不必担心标记中可能包含的其他属性。

最后，(.*?)是一个0或更多的非贪婪捕获（由问号指示）来捕获该链接内的所有文本。如果没有非贪婪指示符，它会将所有文本与文档中最后一个关闭</a>标签匹配。

希望这会帮助你解决作业的第2部分。 :)

来源

2016-12-16 21:02:15 interduo

正则表达式来提取所有的链接和相应的链接文字

回答

相关问题