2013-12-09 74 views
1

双HTTP我指的这个环节,从其中包含一个网页提取URL特定的词不包括URL

regex to print url from any webpage with specific word in url

但几个URL像Pinterest的和Facebook转诊URL包含有兴趣我的话,但我不想使用Facebook,Pinterest的网址,因为它们不是直接的网址,所以我想排除这些网址,所以我已经观察到,这些网址将至少含有一个两个HTTP

像这样

http://www.pinterest.com/pin/create/button/?url=http%3A%2F%2Fwww.glamsham.com%2Fpicture-gallery%2Fsensual-in-saree-gallery%2Fspecials%2F3774%2F7%2Findex.htm&media=http%3A%2F%2Fmedia.glamsham.com%2Fdownload%2Fpicturegallery%2Ffeatured%2Fbollywood-beauties-saree%2F722-sensual-in-saree.jpg&guid=gNh5ehWodCZW-0&description=Rani%20Mukerji%20in%20saree%20at%20Sensual%20in%20saree%20picture%20gallery%20picture%20%23%207%20%3A%20glamsham.com

,所以我要排除的URL包含ATLEAST两个HTTP

+0

http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links/16509122#16509122 –

+0

'preg_match('/(http。*?)http /',' https://foo.bar.baz/q=http://blah.com',$ matches);' - 任何两个'http'之间的任何匹配。 – Damon

回答

0

你可以尝试这样的事情避免这些URI:

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]"); 
for($i=0; $i < $nodelist->length; $i++) { 
    $node = $nodelist->item($i); 
    $href = $node->getAttribute('href'); 
    if (!preg_match('~^http://.+?https?\b~i', $href)) 
     echo "$href\n"; 
} 

preg_match('~^http://.+?https?\b~i', $href)应与这些to-be-excluded的URI

+0

http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links/16509122#16509122 –

+0

不工作检查 – Priya

+0

查看工作演示:http://ideone.com/VrN5Jw – anubhava

0

我d可能会检查你是否通过它们循环,并删除双http的,例如:

$request_url ='YOUR URL'; 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $request_url);  
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch); 

$doc = new DOMDocument(); 
libxml_use_internal_errors(true); 
$doc->loadHTML($result); // loads your html 
$xpath = new DOMXPath($doc); 
$needle = 'blog'; 

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]"); 
$validUrls = array(); 
for($i=0; $i < $nodelist->length; $i++) { 
    $node = $nodelist->item($i); 
    $curUrl = $node->getAttribute('href'); 
    if (substr_count($curUrl,'http')===1) { 
     $validUrls[] = $curUrl; 
    } 
} 

var_dump($validUrls); // all urls with only one "http" 
+0

它在给定相关网址的地方不起作用 – Priya