2011-08-11 27 views
0

这是为什么发生?正则表达式忽略标签<a,去到以前的标签<a错误的正则表达式作品

$url = 'urband.net'; 
$p = '%(.{0,5})<a\s+href=".*?'; 
$p .= $url; 
$p .= '.*?"\s*>(.*?)</a>(.{0,5})%imm'; 

$s = file_get_contents("http://boringmachines.blogspot.com/2006/12/bitbin-herb-recordings.html"); 
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER); 
print_r($matches); 

我得到阵列:

Array 
(
    [0] => Array 
     (
      [0] => /div><a href="http://photos1.blogger.com/x/blogger/1112/3281/1600/484028/aliasEPlined.jpg"><img style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 162px; CURSOR: hand; HEIGHT: 149px" height="124" alt="" src="http://photos1.blogger.com/x/blogger/1112/3281/320/925013/aliasEPlined.jpg" width="199" border="0" /></a><span style="font-size:85%;">Due to last weeks bad weather here in Glasgow, I was unable to connect to the web and keep up those regular <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462">Herb Recordings </a>mp3's. Instead, I posted a <a href="http://boringmachines.blogspot.com/2006/11/bitbin-herb-recordings.html#links">video</a> of one of their earlier releases, BitBin. Thankfully, some good has came from thsoe storms, as Herb have kindly donated another mp3, in the form of "<em>May</em>" by BitBin.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;"><a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&amp;friendID=26396670">BitBin</a> is a London based artist and had his "Alias" ep released by Herb earlier this year. He influences are both broad, and for and electronic producer, quite unusual. The likes of Brian Eno, Bola and Warp Records, sit side by side with Brian Wilson, Captain Beefheart and dEUS. His bio may explain a few things, as BitBin claims he is all about "<em>glitching his way through any field of music and reality</em>"</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">"<em>May</em>" itself is an expansive and dark slice of electronica reminiscent of Bola and Gescom. For me, however, this is akin to the music Thom Yorke has been pushing Radiohead towards over the last few years. The beats echo those of "<em>Idioteque</em>", and believe, me that is no bad thing.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">The "Alias" ep can be ordered<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462"> here</a>, however, the cd release will feature 3 extra tracks, "<em>making it, one longer trip</em>". An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with 
      [1] => /div> 
      [2] => interview and podcast 
      [3] => with 
     ) 

) 

虽然不得不让:

Array 
(
    [0] => Array 
     (
      [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with 
      [1] => . An 
      [2] => interview and podcast 
      [3] => with 
     ) 

) 
+0

我不明白StackOverflow的用户。如果有人试图使用正则表达式解析HTML,那么问题总是得到-1。 – Karolis

+0

@Karolis,我认为html解析器不会做我需要的东西 – Mediator

+0

@Karolis:我不是downvoter,但试图用正则表达式解析HTML是邪恶的,正则表达式不是为此创建的,它会创建更多麻烦比解决等等等等(尝试谷歌,你会发现一遍又一遍的相同结论)。这不是关于SO用户,而是关于常识。改用DOM(或DOMXPath)。更简单,稳定并且完成任务。这就是说:在HTML中询问正则表达式本身并不好,但决议应该指向正确的方向。 – Abel

回答

1

尝试:

$url = 'urband\.net';
$p = '%(.{0,5})<a\s+href="[^"]*';
$p .= $url;
$p .= '[^"]*"\s*>(.*?)</a>(.{0,5})%imm';

编辑 - 用Perl测试:

$/ = undef; 

my $str = <DATA>; 
my $count = 0; 

while ($str =~ /(.{0,5})<a\s+href="[^"]*urband\.net[^"]*"\s*>(.*?)<\/a>(.{0,5})/sg) 
{ 
    print "Array\n"; 
    print "(\n"; 
    print " [$count] => Array\n"; 
    print "  (\n"; 
    print "   [0] => $&\n"; 
    print "   [1] => $1\n"; 
    print "   [2] => $2\n"; 
    print "   [3] => $3\n"; 
    print "  )\n"; 
    print "\n"; 
    print ")\n"; 
    ++$count; 
} 

输出:

Array 
(
    [0] => Array 
     (
      [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with 
      [1] => . An 
      [2] => interview and podcast 
      [3] => with 
     ) 

)