刮内容从网站页面

我得到了一些问题，需要帮助..刮内容从网站页面

我的计划：从免费代理 1.获取IP地址（hi去我的屁股） 2.转换为XML

$html = file_get_contents('http://www.hidemyass.com/proxy-list/'); 

//$body = explode('<tbody>', $html); 
$body = $html; 


$xml = simplexml_load_string("<?xml version='1.0' encoding='utf-8'?><xml />"); 

$rows = array(); 
foreach (array_slice(explode('<td>', end($body)), 1) as $row) 
{ 
    preg_match('/span>([0-9])<\/span>/', $row, $ids); 
    preg_match('/span>([0-9])<\/span>/', $row, $dir); 
    preg_match('/span>([0-9])<\/span>/', $row, $due); 


    $node = $xml->addChild('train'); 

    $node->addChild('route', $ids[1]); 
    $node->addChild('direction', $dir[1]); 
    $node->addChild('due', $due[1]); 
} 

header('Content-Type: text/xml'); 
echo $xml->asXML();

但仍不能...

你能帮助我吗？

感谢 JK

来源

2012-04-09 kimpuler

不要使用正则表达式来解析html。 http://stackoverflow.com/a/1732454/118068改用DOM。 – 2012-04-09 18:35:43

waw ...感谢马克的快速反应..我会学习.. – kimpuler 2012-04-09 18:58:50

刚刚添加了一个完整的工作版本 – Baba 2012-04-09 19:31:25

最简单和理想的解决方案将是simple_html_dom请参阅：http://simplehtmldom.sourceforge.net/

例

include 'simple_html_dom.php'; 
    $html = file_get_html('http://www.hidemyass.com/proxy-list/'); 
    echo "<pre>"; 
    foreach ($html->find ('tr') as $element) { 
     $ip = $element->find ('td', 1); 
     $port = $element->find ('td', 2); 
     $ip = getIP ($ip); 
     // var_dump($element->xmltext); 
     echo " $ip : $port \n"; 
    } 

    function getIP($obj) { 
     global $html; 

     $text = str_replace ("div", "span", $obj->xmltext); 
     $text = explode ("span", $text); 

     $ip = array(); 

     foreach ($text as $value) { 
      $value = trim ($value); 
      $value = trim ($value, "<"); 
      $value = trim ($value, ">"); 
      $value = trim ($value, "."); 

      if (empty ($value)) 
       continue; 

      if (strpos ($value, "display:none")) { 
       continue; 
      } 

      if (strpos ($value, ">")) { 
       $value = "<" . $value . ">"; 
      } 

      $value = strip_tags ($value); 

      $value = trim ($value, "."); 

      if (empty ($value)) 
       continue; 

      $ip [] = $value; 
     } 

     if (is_array ($ip)) { 
      return implode (".", $ip); 
     } 
    }

但是这不会给你你想要的格式的IP地址因为HideMyASS正在保护这种提取

A td包含IP地址应该是这样的

<td> 
<span><span class="52">201</span>.73 
    <div style="display: none">228</div> 
    <span class="" style="">.</span><span>17</span><span 
    style="display: none">248</span><span></span>.107</span> 
</td>

你能看到<div style="display: none">有时<span style="display: none">一些瓦片他们使用整数clases如class=51这也意味着没有....

我能得到一个疯狂的和身边的工作..使用getIP功能....我希望这有助于

输出例

IP address : Port 
200.135.197.120 : 8080 
96.46.7.194 : 80 
217.26.14.18 : 3128 
189.114.111.190 : 8080 
202.51.107.37 : 8080 
128.208.04.198 : 2124 
221.133.238.138 : 8080 
41.215.247.146 : 8080 
140.113.216.134 : 3128 
190.211.132.33 : 8080 
117.34.92.43 : 3128 
118.97.235.234 : 3128 
85.248.141.245 : 3128 
203.223.47.119 : 3128 
200.48.213.82 : 8080 
217.112.128.247 : 80 
114.134.76.27 : 8080 
78.45.134.10 : 3128 
77.78.197.15 : 8080 
189.44.226.66 : 3128 
124.195.124.166 : 8080 
190.39.128.219 : 8080 
222.42.45.51 : 3128 
195.138.76.136 : 3128 
115.249.252.235 : 8080 
222.124.152.18 : 8080 
190.255.39.147 : 3128 
189.22.138.162 : 8080 
217.146.208.162 : 8080 
203.143.18.1 : 8080 
210.57.215.130 : 80 
190.98.166.106 : 3128 
200.5.226.74 : 80 
187.6.254.19 : 3128 
177.36.242.57 : 8080 
41.133.101.242 : 8080 
201.87.208.66 : 8080 
41.67.20.91 : 8080 
118.192.1.168 : 3128 
41.75.201.146 : 3128 
61.166.144.69 : 8080 
200.238.98.234 : 3128 
110.52.11.220 : 80 
125.67.230.192 : 8080 
94.228.35.219 : 80 
64.85.181.45 : 8080 
222.169.15.234 : 8080 
113.106.194.220 : 80 
119.82.239.50 : 8080 
117.27.139.17 : 80

谢谢

来源

2012-04-09 18:40:09 Baba

谢谢巴巴..im感谢您的帮助... – kimpuler 2012-04-09 18:59:17

不客气..刚更新我的发现 – Baba 2012-04-09 19:01:57

工作解决方法..我会尽快更新 – Baba 2012-04-09 19:14:49

刮内容从网站页面

回答

相关问题