2010-02-17 313 views
146

如何检查PHP中是否存在URL(不是404)?如何通过PHP检查URL是否存在?

+4

的fopen,header()函数,fsock – X10nD

+3

可能复制[如何检查是否存在使用PHP的远程文件?](http://stackoverflow.com/questions/981954/how-can-one-check-to-see-if-ar emote-file-exists-using-php) –

回答

229

这里:

$file = 'http://www.domain.com/somefile.jpg'; 
$file_headers = @get_headers($file); 
if(!$file_headers || $file_headers[0] == 'HTTP/1.1 404 Not Found') { 
    $exists = false; 
} 
else { 
    $exists = true; 
} 

hereright below上面的帖子,有一个curl解决方案:

function url_exists($url) { 
    if (!$fp = curl_init($url)) return false; 
    return true; 
} 
+14

恐怕CURL-way不会以这种方式工作。看看这个:http://stackoverflow.com/questions/981954/how-can-one-check-to-see-if-a-remote-file-exists-using-php/982045#982045 –

+0

我们应该关闭filehande? – ekerner

+4

某些网站在错误页面上有不同的'$ file_headers [0]'。例如,youtube.com为 。 其错误页面的值为'HTTP/1.0 404 Not Found'(差异为1.0和1.1)。 那该怎么办? –

43
$headers = @get_headers($this->_value); 
if(strpos($headers[0],'200')===false)return false; 

所以任何时候联系你一个网站,让别的东西比200 ok了将工作

+13

但是如果是重定向呢?域名仍然有效,但会被忽略。 –

+4

在一行上面:'返回strpos(@get_headers($ url)[0],'200')=== false? false:true'。可能有用。 – Dejv

+0

什么是$这个:(? – Andrew

3

漂亮的f AST:

function http_response($url){ 
    $resURL = curl_init(); 
    curl_setopt($resURL, CURLOPT_URL, $url); 
    curl_setopt($resURL, CURLOPT_BINARYTRANSFER, 1); 
    curl_setopt($resURL, CURLOPT_HEADERFUNCTION, 'curlHeaderCallback'); 
    curl_setopt($resURL, CURLOPT_FAILONERROR, 1); 
    curl_exec ($resURL); 
    $intReturnCode = curl_getinfo($resURL, CURLINFO_HTTP_CODE); 
    curl_close ($resURL); 
    if ($intReturnCode != 200 && $intReturnCode != 302 && $intReturnCode != 304) { return 0; } else return 1; 
} 

echo 'google:'; 
echo http_response('http://www.google.com'); 
echo '/ ogogle:'; 
echo http_response('http://www.ogogle.com'); 
+0

太复杂:) HTTP:// stackoverflow.com/questions/981954/how-can-one-check-to-see-if-a-remote-file-exists-using-php/982045#982045 –

+0

我得到这个exceptionn当URL存在:无法调用CURLOPT_HEADERFUNCTION – safiot

43

当搞清楚,如果一个URL从PHP存在有几件事情需要注意:

  • 是URL本身有效(字符串,而不是空的,良好的语法)这是快速检查服务器端。
  • 等待响应可能需要时间并阻止代码执行。
  • 并非所有由get_headers()返回的头都格式正确。
  • 使用卷曲(如果可以)。
  • 防止获取整个主体/内容,但仅请求标头。
  • 考虑重定向网址:
    • 您想要返回第一个代码吗?
    • 或者关注所有重定向并返回最后的代码?
    • 你可能会得到200,但它可以使用元标记或JavaScript重定向。找出困难之后会发生什么。

请记住,无论你使用的方法,它需要时间来等待响应。
所有的代码可能(也可能会)停止,直到你知道结果或请求超时。

例如:下面的代码可能需要很长时间才能显示页面,如果网址无效或无法访问:

<?php 
$urls = getUrls(); // some function getting say 10 or more external links 

foreach($urls as $k=>$url){ 
    // this could potentially take 0-30 seconds each 
    // (more or less depending on connection, target site, timeout settings...) 
    if(! isValidUrl($url)){ 
    unset($urls[$k]); 
    } 
} 

echo "yay all done! now show my site"; 
foreach($urls as $url){ 
    echo "<a href=\"{$url}\">{$url}</a><br/>"; 
} 

下面的功能可能是有益的,你可能要修改它们,以满足您需求:

function isValidUrl($url){ 
     // first do some quick sanity checks: 
     if(!$url || !is_string($url)){ 
      return false; 
     } 
     // quick check url is roughly a valid http request: (http://blah/...) 
     if(! preg_match('/^http(s)?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)*(:[0-9]+)?(\/.*)?$/i', $url)){ 
      return false; 
     } 
     // the next bit could be slow: 
     if(getHttpResponseCode_using_curl($url) != 200){ 
//  if(getHttpResponseCode_using_getheaders($url) != 200){ // use this one if you cant use curl 
      return false; 
     } 
     // all good! 
     return true; 
    } 

    function getHttpResponseCode_using_curl($url, $followredirects = true){ 
     // returns int responsecode, or false (if url does not exist or connection timeout occurs) 
     // NOTE: could potentially take up to 0-30 seconds , blocking further code execution (more or less depending on connection, target site, and local timeout settings)) 
     // if $followredirects == false: return the FIRST known httpcode (ignore redirects) 
     // if $followredirects == true : return the LAST known httpcode (when redirected) 
     if(! $url || ! is_string($url)){ 
      return false; 
     } 
     $ch = @curl_init($url); 
     if($ch === false){ 
      return false; 
     } 
     @curl_setopt($ch, CURLOPT_HEADER   ,true); // we want headers 
     @curl_setopt($ch, CURLOPT_NOBODY   ,true); // dont need body 
     @curl_setopt($ch, CURLOPT_RETURNTRANSFER ,true); // catch output (do NOT print!) 
     if($followredirects){ 
      @curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,true); 
      @curl_setopt($ch, CURLOPT_MAXREDIRS  ,10); // fairly random number, but could prevent unwanted endless redirects with followlocation=true 
     }else{ 
      @curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,false); 
     } 
//  @curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,5); // fairly random number (seconds)... but could prevent waiting forever to get a result 
//  @curl_setopt($ch, CURLOPT_TIMEOUT  ,6); // fairly random number (seconds)... but could prevent waiting forever to get a result 
//  @curl_setopt($ch, CURLOPT_USERAGENT  ,"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"); // pretend we're a regular browser 
     @curl_exec($ch); 
     if(@curl_errno($ch)){ // should be 0 
      @curl_close($ch); 
      return false; 
     } 
     $code = @curl_getinfo($ch, CURLINFO_HTTP_CODE); // note: php.net documentation shows this returns a string, but really it returns an int 
     @curl_close($ch); 
     return $code; 
    } 

    function getHttpResponseCode_using_getheaders($url, $followredirects = true){ 
     // returns string responsecode, or false if no responsecode found in headers (or url does not exist) 
     // NOTE: could potentially take up to 0-30 seconds , blocking further code execution (more or less depending on connection, target site, and local timeout settings)) 
     // if $followredirects == false: return the FIRST known httpcode (ignore redirects) 
     // if $followredirects == true : return the LAST known httpcode (when redirected) 
     if(! $url || ! is_string($url)){ 
      return false; 
     } 
     $headers = @get_headers($url); 
     if($headers && is_array($headers)){ 
      if($followredirects){ 
       // we want the the last errorcode, reverse array so we start at the end: 
       $headers = array_reverse($headers); 
      } 
      foreach($headers as $hline){ 
       // search for things like "HTTP/1.1 200 OK" , "HTTP/1.0 200 OK" , "HTTP/1.1 301 PERMANENTLY MOVED" , "HTTP/1.1 400 Not Found" , etc. 
       // note that the exact syntax/version/output differs, so there is some string magic involved here 
       if(preg_match('/^HTTP\/\S+\s+([1-9][0-9][0-9])\s+.*/', $hline, $matches)){// "HTTP/*** ### ***" 
        $code = $matches[1]; 
        return $code; 
       } 
      } 
      // no HTTP/xxx found in headers: 
      return false; 
     } 
     // no headers : 
     return false; 
    } 
+10

+1对于这个完全被低估的答案的广泛性!由于某种原因, – sousdev

+0

getHttpResponseCode_using_curl()在我的情况下总是返回200。 –

+2

如果有人有同样的问题,检查DNS,域名服务器..使用OpenDNS的,没有followredirects http://stackoverflow.com/a/11072947/1829460 –

7
$url = 'http://google.com'; 
$not_url = 'stp://google.com'; 

if (@file_get_contents($url)): echo "Found '$url'!"; 
else: echo "Can't find '$url'."; 
endif; 
if (@file_get_contents($not_url)): echo "Found '$not_url!"; 
else: echo "Can't find '$not_url'."; 
endif; 

// Found 'http://google.com'!Can't find 'stp://google.com'. 
+0

辉煌的解决方案! – kouton

+2

如果allow-url-fopen关闭,这将不起作用。 - http://www.php.net/manual/en/filesystem.configuration.php#ini.allow-url-fopen –

+2

我建议只读取第一个字节......如果(@file_get_contents($网址,虚假, NULL,0,1)) –

6
function URLIsValid($URL) 
{ 
    $exists = true; 
    $file_headers = @get_headers($URL); 
    $InvalidHeaders = array('404', '403', '500'); 
    foreach($InvalidHeaders as $HeaderVal) 
    { 
      if(strstr($file_headers[0], $HeaderVal)) 
      { 
        $exists = false; 
        break; 
      } 
    } 
    return $exists; 
} 
0

简单的方法就是卷曲(和更快的太多)

<?php 
$mylinks="http://site.com/page.html"; 
$handlerr = curl_init($mylinks); 
curl_setopt($handlerr, CURLOPT_RETURNTRANSFER, TRUE); 
$resp = curl_exec($handlerr); 
$ht = curl_getinfo($handlerr, CURLINFO_HTTP_CODE); 


if ($ht == '404') 
    { echo 'OK';} 
else { echo 'NO';} 

?> 
14

不能使用卷曲在某些服务器 u可以使用此代码

<?php 
$url = 'http://www.example.com'; 
$array = get_headers($url); 
$string = $array[0]; 
if(strpos($string,"200")) 
    { 
    echo 'url exists'; 
    } 
    else 
    { 
    echo 'url does not exist'; 
    } 
?> 
+0

它可能不适用于302-303重定向或例如304未修改 – Zippp

2
function urlIsOk($url) 
{ 
    $headers = @get_headers($url); 
    $httpStatus = intval(substr($headers[0], 9, 3)); 
    if ($httpStatus<400) 
    { 
     return true; 
    } 
    return false; 
} 
4

karim79的get_headers(),因为我得到了疯狂的结果与Pinterest的解决方案并没有为我工作。

get_headers(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed 

Array 
(
    [url] => https://www.pinterest.com/jonathan_parl/ 
    [exists] => 
) 

get_headers(): Failed to enable crypto 

Array 
(
    [url] => https://www.pinterest.com/jonathan_parl/ 
    [exists] => 
) 

get_headers(https://www.pinterest.com/jonathan_parl/): failed to open stream: operation failed 

Array 
(
    [url] => https://www.pinterest.com/jonathan_parl/ 
    [exists] => 
) 

无论如何,这开发商表明,卷曲比get_headers()方法更快:

http://php.net/manual/fr/function.get-headers.php#104723

由于很多人问karim79修复是卷曲的解决方案,这是我今天建造的解决方案。

/** 
* Send an HTTP request to a the $url and check the header posted back. 
* 
* @param $url String url to which we must send the request. 
* @param $failCodeList Int array list of code for which the page is considered invalid. 
* 
* @return Boolean 
*/ 
public static function isUrlExists($url, array $failCodeList = array(404)){ 

    $exists = false; 

    if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){ 

     $url = "https://" . $url; 
    } 

    if (preg_match(RegularExpression::URL, $url)){ 

     $handle = curl_init($url); 


     curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); 

     curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false); 

     curl_setopt($handle, CURLOPT_HEADER, true); 

     curl_setopt($handle, CURLOPT_NOBODY, true); 

     curl_setopt($handle, CURLOPT_USERAGENT, true); 


     $headers = curl_exec($handle); 

     curl_close($handle); 


     if (empty($failCodeList) or !is_array($failCodeList)){ 

      $failCodeList = array(404); 
     } 

     if (!empty($headers)){ 

      $exists = true; 

      $headers = explode(PHP_EOL, $headers); 

      foreach($failCodeList as $code){ 

       if (is_numeric($code) and strpos($headers[0], strval($code)) !== false){ 

        $exists = false; 

        break; 
       } 
      } 
     } 
    } 

    return $exists; 
} 

让我解释卷曲选项:

CURLOPT_RETURNTRANSFER:返回一个字符串,而不是在屏幕上显示调用页。

CURLOPT_SSL_VERIFYPEER:cURL将无法检出证书

CURLOPT_HEADER:在字符串中包含头

CURLOPT_NOBODY:不包括体字符串中

CURLOPT_USERAGENT:某些网站需要正常工作(例如:https://plus.google.com


附加说明:在这个功能我用迭戈佩里尼的正则表达式发送请求之前验证网址:

const URL = "%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)[email protected]|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu"; //@copyright Diego Perini 

附加说明2:我爆炸头字符串和用户头[0]确保只验证返回码和消息(例如:200,404,405等)。)

其他注意事项3:有时仅验证代码404是不够的(请参阅单元测试),因此有一个可选的$ failCodeList参数来提供所有要拒绝的代码列表。

,当然,这里的单元测试(包括所有流行的社交网络),以合法化我的编码:

public function testIsUrlExists(){ 

//invalid 
$this->assertFalse(ToolManager::isUrlExists("woot")); 

$this->assertFalse(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque4545646456")); 

$this->assertFalse(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque890800")); 

$this->assertFalse(ToolManager::isUrlExists("https://instagram.com/mariloubiz1232132/", array(404, 405))); 

$this->assertFalse(ToolManager::isUrlExists("https://www.pinterest.com/jonathan_parl1231/")); 

$this->assertFalse(ToolManager::isUrlExists("https://regex101.com/546465465456")); 

$this->assertFalse(ToolManager::isUrlExists("https://twitter.com/arcadefire4566546")); 

$this->assertFalse(ToolManager::isUrlExists("https://vimeo.com/**($%?%$", array(400, 405))); 

$this->assertFalse(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666456456456")); 


//valid 
$this->assertTrue(ToolManager::isUrlExists("www.google.ca")); 

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque")); 

$this->assertTrue(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque")); 

$this->assertTrue(ToolManager::isUrlExists("https://instagram.com/mariloubiz/")); 

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque")); 

$this->assertTrue(ToolManager::isUrlExists("https://www.pinterest.com/")); 

$this->assertTrue(ToolManager::isUrlExists("https://regex101.com")); 

$this->assertTrue(ToolManager::isUrlExists("https://twitter.com/arcadefire")); 

$this->assertTrue(ToolManager::isUrlExists("https://vimeo.com/")); 

$this->assertTrue(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666")); 
} 

大获成功给所有,

乔纳森家长Lévesque这样从蒙特利尔

+0

谢谢,它是好的。 –

5

我用这个函数:

/** 
* @param $url 
* @param array $options 
* @return string 
* @throws Exception 
*/ 
function checkURL($url, array $options = array()) { 
    if (empty($url)) { 
     throw new Exception('URL is empty'); 
    } 

    // list of HTTP status codes 
    $httpStatusCodes = array(
     100 => 'Continue', 
     101 => 'Switching Protocols', 
     102 => 'Processing', 
     200 => 'OK', 
     201 => 'Created', 
     202 => 'Accepted', 
     203 => 'Non-Authoritative Information', 
     204 => 'No Content', 
     205 => 'Reset Content', 
     206 => 'Partial Content', 
     207 => 'Multi-Status', 
     208 => 'Already Reported', 
     226 => 'IM Used', 
     300 => 'Multiple Choices', 
     301 => 'Moved Permanently', 
     302 => 'Found', 
     303 => 'See Other', 
     304 => 'Not Modified', 
     305 => 'Use Proxy', 
     306 => 'Switch Proxy', 
     307 => 'Temporary Redirect', 
     308 => 'Permanent Redirect', 
     400 => 'Bad Request', 
     401 => 'Unauthorized', 
     402 => 'Payment Required', 
     403 => 'Forbidden', 
     404 => 'Not Found', 
     405 => 'Method Not Allowed', 
     406 => 'Not Acceptable', 
     407 => 'Proxy Authentication Required', 
     408 => 'Request Timeout', 
     409 => 'Conflict', 
     410 => 'Gone', 
     411 => 'Length Required', 
     412 => 'Precondition Failed', 
     413 => 'Payload Too Large', 
     414 => 'Request-URI Too Long', 
     415 => 'Unsupported Media Type', 
     416 => 'Requested Range Not Satisfiable', 
     417 => 'Expectation Failed', 
     418 => 'I\'m a teapot', 
     422 => 'Unprocessable Entity', 
     423 => 'Locked', 
     424 => 'Failed Dependency', 
     425 => 'Unordered Collection', 
     426 => 'Upgrade Required', 
     428 => 'Precondition Required', 
     429 => 'Too Many Requests', 
     431 => 'Request Header Fields Too Large', 
     449 => 'Retry With', 
     450 => 'Blocked by Windows Parental Controls', 
     500 => 'Internal Server Error', 
     501 => 'Not Implemented', 
     502 => 'Bad Gateway', 
     503 => 'Service Unavailable', 
     504 => 'Gateway Timeout', 
     505 => 'HTTP Version Not Supported', 
     506 => 'Variant Also Negotiates', 
     507 => 'Insufficient Storage', 
     508 => 'Loop Detected', 
     509 => 'Bandwidth Limit Exceeded', 
     510 => 'Not Extended', 
     511 => 'Network Authentication Required', 
     599 => 'Network Connect Timeout Error' 
    ); 

    $ch = curl_init($url); 
    curl_setopt($ch, CURLOPT_NOBODY, true); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 

    if (isset($options['timeout'])) { 
     $timeout = (int) $options['timeout']; 
     curl_setopt($ch, CURLOPT_TIMEOUT, $timeout); 
    } 

    curl_exec($ch); 
    $returnedStatusCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); 
    curl_close($ch); 

    if (array_key_exists($returnedStatusCode, $httpStatusCodes)) { 
     return "URL: '{$url}' - Error code: {$returnedStatusCode} - Definition: {$httpStatusCodes[$returnedStatusCode]}"; 
    } else { 
     return "'{$url}' does not exist"; 
    } 
} 
0

其他方法来检查如果一个URL是否有效可以是:

<?php 

    if (isValidURL("http://www.gimepix.com")) { 
     echo "URL is valid..."; 
    } else { 
     echo "URL is not valid..."; 
    } 

    function isValidURL($url) { 
     $file_headers = @get_headers($url); 
     if (strpos($file_headers[0], "200 OK") > 0) { 
     return true; 
     } else { 
     return false; 
     } 
    } 
?> 
1

这里是只读取源代码的第一个字节...返回false如果失败的file_get_contents的解决方案......这也将像远程文件图片。

function urlExists($url) 
{ 
    if (@file_get_contents($url,false,NULL,0,1)) 
    { 
     return true; 
    } 
    return false; 
} 
2

以上所有解决方案+额外的糖。 (终极AIO溶液)

/** 
* Check that given URL is valid and exists. 
* @param string $url URL to check 
* @return bool TRUE when valid | FALSE anyway 
*/ 
function urlExists ($url) { 
    // Remove all illegal characters from a url 
    $url = filter_var($url, FILTER_SANITIZE_URL); 

    // Validate URI 
    if (filter_var($url, FILTER_VALIDATE_URL) === FALSE 
     // check only for http/https schemes. 
     || !in_array(strtolower(parse_url($url, PHP_URL_SCHEME)), ['http','https'], true) 
    ) { 
     return false; 
    } 

    // Check that URL exists 
    $file_headers = @get_headers($url); 
    return !(!$file_headers || $file_headers[0] === 'HTTP/1.1 404 Not Found'); 
} 

实施例:

var_dump (urlExists('http://stackoverflow.com/')); 
// Output: true; 
1

,以检查是否URL是在线还是离线---

function get_http_response_code($theURL) { 
    $headers = @get_headers($theURL); 
    return substr($headers[0], 9, 3); 
}