2012-10-24 127 views
2

具有运行4个机器人同时, 每个机器人是在新标签打开,一般不会是一个履带脚本一段时间后,执行停止刮花所需的变种。
2 /从db获取URL目标。
3使用CURL或file_get_content获取内容。
4 /用simple_html_dom设置“$ html”。
5 /包含一个“引擎”,用于擦除和操纵内容。
6 /最后 - 检查它是否正确并优化内容并将其存储在db中。 X链接将刷新页面并继续爬网过程之后。

PHP脚本没有错误报告

每件事都像魔术一样工作!但最近几分钟后(不是同一时间)所有的机器人
停止(没有错误闪现)有时只有3人...
有一个脚本,设置每隔Y分钟刷新页面的时间间隔。这是我的
机器人工作,如果他们卡住,但它不是一个答案这个问题。

我检查了apache错误日志,并没有指出任何奇怪的东西。

你有什么想法?
收缩码:(带评论)

ini_set('user_agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5'); 
error_reporting(E_ALL); 
include ("peulot/stations1.php");//with connection and vars 
include_once('simple_html_dom.php'); 

//DEFINE VALUES: 
/* 
here vars are declared and set 
*/ 

     echo " 
      <script language=javascript> 

      var int=self.setInterval(function(){ refresh2(); },".$protect."); 

      var counter; 

      function refresh2() { 
       geti(); 
       link = 'store_url_beta.php?limit_link=".$limit_link."&storage_much=".$dowhile."&jammed=".($jammed_count+=1)."&bot=".$sbot."&counter='; 
       link = link+counter; 
       window.location=link; 
       } 

      function changecolor(answer) 
        { 
       document.getElementById(answer).style.backgroundColor = \"#00FF00\"; 
        } 
      </script>";//this is the refresh if jammed 


//some functions: 
/* 
function utf8_encode_deep --> for encoding 
function hexbin --> for simhash fingerprint 
function Charikar_SimHash --> for simhash fingerprint 
function SimHashfingerprint --> for simhash fingerprint 
*/    

     while ($i<=$dowhile) 
      { 

      //final values after crawling: 
      $link_insert=""; 
      $p_ele_insert=""; 
      $title_insert=""; 
      $alt_insert=""; 
      $h_insert=""; 
      $charset=""; 
      $text=""; 
      $result_key=""; 
      $result_desc=""; 
      $note=""; 

      ///this connection is to check that there are links to crawl in data base... + grab the line for crawl. 
      $sql = "SELECT * FROM $table2 WHERE crawl='notyet' AND flag_avoid $regex $bot_action"; 
      $rs_result = mysql_query ($sql); 
      $idr = mysql_fetch_array($rs_result);       
      unset ($sql); 
      unset ($rs_result); 

       set_time_limit(0); 

       $qwe++; 

        $target_url = $idr['live_link'];//set the link we are about to crawl now. 
        $matches_relate = $idr['relate'];//to insert at last 
        $linkid = $idr['id'];//link id to mark it as crawled in the end 
        $crawl_status = $idr['crawl'];//saving this to check if we update storage table or insert new row 
        $bybot_status = $idr['by_bot'];//saving this to check if we update storage table or insert new row 

        $status ="UPDATE $table2 SET crawl='working', by_bot='".$bot."', flag_avoid='$stat' WHERE id='$linkid'"; 
        if(!mysql_query($status)) die('problem15');     

        $ch = curl_init(); 

        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5'); 
        curl_setopt($ch, CURLOPT_URL, $target_url); 
        curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt"); 
        curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt"); 
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
        curl_setopt($ch, CURLOPT_HEADER, 0); 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

        $str = curl_exec($ch); 
        curl_close($ch); 

        if (strlen($str)<100) 
          { 
          //do it with file get content 
          }    
     if (strlen($html)>500) 
     { 

          require("engine.php");//GENERATE FATAL ERROR IF CRAWLER ENGINE AND PARSER NOT AVAILABLE 

         flush();//that will flush a result without any refresh 
         usleep(300);         

           //before inserting into table storage check if it was crawled before and then decide if to insert or update: 
           if ($crawl_status=="notyet"&&$bybot_status=="notstored") 
              { 
              //insert values 
              } 
              else 
              { 
              //update values 
              } 

         flush();//that will flush a result without any refresh 
         usleep(300); 


         if ($qwe>=$refresh) //for page refresh call 
          { 
          $secounter++;//counter for session 
          //optimize data       
          echo "<script type='text/javascript'>function refresh() { window.location='store_url_beta.php?limit_link=".$limit_link."&counter=".$i."&secounter=".$secounter."&storage_much=".$dowhile."&jammed=".$jammed."&bot=".$sbot."'; } refresh(); </script>";       
          } 
      }//end of if html is no empty. 
      else 
      {//mark a flag @4 and write title jammed! 

      //here - will update the table and note that its not possible to crawl 

         if ($qwe>=$refresh) 
          { 
          $secounter++;//counter for session 
          //optimize data       
          echo "<script type='text/javascript'>function refresh() { window.location='store_url_beta.php?limit_link=".$limit_link."&counter=".$i."&secounter=".$secounter."&storage_much=".$dowhile."&jammed=".$jammed."&bot=".$sbot."'; } refresh(); </script>";       

          } 
      }//end of else cant grab nothing 
      unset($html); 
     }//end of do while 
      mysql_close(); 
      echo "<script language=javascript> window.clearInterval(int); </script>"; 

编辑:
经过不断的测试和记录方法(以下插孔咨询)我什么也没找到! 当机器人停止这种情况发生的唯一事情是在Apache日志:

[Thu Oct 25 01:01:33 2012] [error] [client 127.0.0.1] File does not exist: C:/wamp/www/favicon.ico 
zend_mm_heap corrupted 
[Thu Oct 25 01:01:51 2012] [notice] Parent: child process exited with status 1 -- Restarting. 
[Thu Oct 25 01:01:51 2012] [notice] Apache/2.2.22 (Win64) mod_ssl/2.2.22 OpenSSL/1.0.1c PHP/5.3.13 configured -- resuming normal operations 
[Thu Oct 25 01:01:51 2012] [notice] Server built: May 13 2012 19:41:17 
[Thu Oct 25 01:01:51 2012] [notice] Parent: Created child process 736 
[Thu Oct 25 01:01:51 2012] [warn] Init: Session Cache is not configured [hint: SSLSessionCache] 
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Child process is running 
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Acquired the start mutex. 
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting 200 worker threads. 
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting thread to listen on port 80. 
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting thread to listen on port 80. 
[Thu Oct 25 01:01:51 2012] [error] [client 127.0.0.1] File does not exist: C:/wamp/www/favicon.ico 

这条线是神秘的我真的不知道该怎么办,请帮助我!
[Thu Oct 25 01:01:51 2012] [notice] Parent:子进程退出状态1 - 重新启动。

+0

什么烂摊子。请正确缩进您的代码。 – OptimusCrime

+3

这是很多代码,你期望人们通过给你一个答案。您应该先尝试自己调试它,看看您是否可以缩小问题出现的位置。 –

+0

哈哈现在收缩吧 – shlomix

回答

0

找到这些问题的方法通常归结为普通的旧式日志记录。

您应该让每个工作人员在潜在的长时间操作之前和之后将条目写入自己的日志文件中,包括调试消息,行号,内存使用情况,无论您需要知道什么;让它堵塞几次并分析日志。

如果有一个模式(即日志停止在同一点显示数据),您可以缩小搜索范围;如果没有,你可能会遇到内存问题或其他致命的崩溃。

它也有助于追溯最近在您的设置中可能发生的变化,即使它看起来不相关。

+0

感谢您的快速响应!我再次检查apache日志,它说很多,但没有错误唯一一个是这样的:[错误] [客户端127.0.0.1]文件不存在:C:/wamp/www/favicon.ico – shlomix

+0

最近添加这些配置到httpd的: 保持活动在 MaxKeepAliveRequests 2000 MaxRequestsPerChild 2000 的KeepAliveTimeout 350个 HostnameLookups关闭 – shlomix

+0

@ user1769877我不认为我做了自己不够清晰; *你*必须做记录。 –