2015-03-19 53 views
2

所以基本上我想从网页中提取所有网址,即使它们不是可点击的链接。如何以纯文本的形式提取网页上的所有URL(链接)?

例如页面的源代码可能是:

<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html> 

两个我都想要的URL来进行显示,

http://clicklink.com and http://foobar.com 

我也不想你可以把它。

我目前的脚本抓住了网址,但似乎也抓住了一堆其他垃圾,使链接可点击,无法存储在数据库中。

这是我目前的代码。

<?php 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 

$url="http://www.frozencpu.com/"; 
$data=file_get_contents($url); 
$data = strip_tags($data,"<a>"); 
$d = preg_split("/<\/a>/",$data); 
foreach ($d as $k=>$u){ 
    if(strpos($u, "<a href=") !== FALSE){ 
    //echo $u; 
    //echo "<BR>"; 
     $u = preg_replace("/.*<a\s+href=\"/sm","",$u); 
     $u = preg_replace("/\".*/","",$u); 
     //echo $u; 
     //echo "<BR>"; 
     $db->exec("INSERT INTO urls(url, crawled) VALUES('$u', '0')"); 
    } 
} 

?> 

下面是一个例子输出

http://www.facebook.com/pages/FrozenCPUcom/351841771499<BR>http://twitter.com/FrozenCPU<BR>/rss/frozencpu.rss<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?id=CR9RnD2g<BR> 

*似乎罚款,直到这里

Then it just junks up big time 

&nbsp;&nbsp;<a href='http://www.frozencpu.com/advanced_search.html?id=CR9RnD2g' class=small>Advanced Search<BR>http://www.frozencpu.com/brands/shop_by_brand.html?id=CR9RnD2g<BR>http://www.frozencpu.com/shop_category.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g30/Liquid_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g57/EK_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g59/XSPC_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g60/LutroO_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g12/Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g40/Air_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g53/Apparel.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g34/Bay_Devices.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g54/Cabinet_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g2/Cables.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g32/Caffeine.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g1/Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g58/CaseLabs_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g45/Custom_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g43/Case_Parts-OEM.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g51/Connectors.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g48/CPU_Heatsinks.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g44/DIYMod_Parts.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g4/Electronics.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g36/Fans.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g47/Fan_Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g39/Gaming.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g6/Lighting.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g49/Phase_Change.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g11/Power_Supplies.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g55/Screws.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g35/SleevingHeatshrink.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g7/Sound_Dampening.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g52/Switches.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g8/Thermal_Interface.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g31/Travel_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g33/Ultra_Quiet.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g42/Window_Kits.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g50/Custom_Services.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?enable=1&id=CR9RnD2g<BR>http://www.frozencpu.com/products/2770/gc-01/Gift_Certificate.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/aboutus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/resource.html?id=CR9RnD2g<BR>http://www.frozencpu.com/career.html?id=CR9RnD2g<BR>http://www.frozencpu.com/clearance/list/p1/Clearance-Page1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>http://www.frozencpu.com/links.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>http://www.frozencpu.com/media.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?view_cart=Wish%2dList&wish_list=1&id=CR9RnD2g<BR>http://www.frozencpu.com/new_products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/powder_coating.html?id=CR9RnD2g<BR>http://www.frozencpu.com/press.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/stores.html?id=CR9RnD2g<BR> 


      <a href='http://www.facebook.com/pages/FrozenCPUcom/351841771499' target=<BR> 
      <a href='http://twitter.com/FrozenCPU' target=<BR> 
      <a href='/rss/frozencpu.rss' target=<BR>https://www.resellerratings.com 
<BR>https://www.securitymetrics.com/sitecertsummary.adp?s=67%2e228%2e74%2e232&amp;i=340380<BR>mailto:[email protected]?subject=WESTERN%20UNION<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g 

            The XSPC Raystorm RX240 V3 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well. 

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is a top o... 
            3 In Stock, Ships Today Till 6pm EST 
            $259.99 
           <BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g 

            The RayStorm Copper Twin D5 AX240 kit is the most powerful 240 kit XSPC have ever made. It includes a special Copper edition of our RayStorm block, our fantastic new AX240 radiator and two D5 Vario pumps in series. 

The RayStorm Copper has the same great performance as our award winning RayStorm block, but with an all metal design. The acetal top... 
            7 In Stock, Ships Today Till 6pm EST 
            $399.99 
           <BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g 

            PrimoChill once again provides a good lookin, easy solution to the unimaginable. Introducing, one hell of a crypto rack, The Hasher! 

Built out of rugged, 1in anodized extruded aluminum t-slot, the PrimoChill Hasher is tough but cool enough to keep out of the basement. It combines not only functionality but order to the chaos that other mining r... 
            5 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g 

            Small, lightweight, and true Plug N Play, the Add2Psu adapter allows you to add more power to your computer. No cutting wires or soldering, no compromising the integrity or function of your PC. 

Now there is a way to add more power to your PC. Finally a true plug and play way to manage additional power for those big video cards, bigger hard drive... 
            290 In Stock, Ships Today Till 6pm EST 
            $19.95 
           <BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g 

            The SkyWater 330L is a new liquld cooling system with a variable speed pump and Fans in desktop PC. The water cooling system is designed for the best thermal solution of CPU, the most important component of your PC. The SkyWater 330L provides a low noise at low speed fans , high performance at high speed fans and reliable liquid cooling system. 

... 
            4 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g 

            Combined GPU/RAM/VRM-cooler for graphics cards of the type nvidia GTX 980 with 4 GB RAM according to reference design. 
This cooler combines the features of a graphics chip cooler and RAM-coolers in an elegant and very flat watercooler. Additionally the voltage regulators are also cooled effectively. 

The kryographics for GTX 980 water block offe... 
            5 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g 

            Introducing the Lamptron CW611 Water Cooling fan controller! The first in a series of advanced control 5.25&#8243; bay devices that allow complete control over your entire PC cooling system. You can use this controller to be used with fans, liquid cooling pumps, as well as flow meters. The first in a new series of controllers this is sure to get ... 
            52 In Stock, Ships Today Till 6pm EST 
            $99.99 
           <BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g 

            The Noiseblocker NB-BlackSilentFan XM2 40mmx10mm Ultra Quiet Fan, manufactured by Noiseblocker, Germany's quietest fan manufacturer, the BlackSilentFan series features extraordinary life spans and near silent operation. Using the NB-Longlife advanced sleeve bearing and matched with the NB-EKA drive, the BlackSilentFan series runs more than double ... 
            20 In Stock, Ships Today Till 6pm EST 
            $12.95 
           <BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g 

            Staying true to the Phanteks’ Enthoo line, the Luxe features a sandblasted front and top panel. Ambient lighting run from top to front of the case on both sides. Even though smaller in size, the Enthoo Luxe boost many features from the award-winning Enthoo Primo. The Luxe comes pre-installed with a 200mm front fan and 2x PH-F140SP fans. Phanteks’ E... 
            In Stock, Ships Today Till 6pm EST 
            $159.99 
           <BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g 

            The MagiCool DIY Complete Liquid Cooling Kit comes with everything you need to set your system up on liquid. The CPU block is compatible with all current sockets giving you flexibility for now and for future upgrades as well. The radiator is a slim profile variant allowing for maximum case compatibility. 
Compression fittings are provided for dur... 
            5 In Stock, Ships Today Till 6pm EST 
            $124.99 
           <BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g 

            With the new NexXxoS GPX coolers Alphacool is again a step ahead! Optimum performance and quality in a new cooling design for a great price! 

A new sophisticated injection system means the GPU is actively cooled. All other chips are sufficiently cooled by the passive cooler which is also in contact with the watercooling block for extra efficiency... 
            3 In Stock, Ships Today Till 6pm EST 
            $94.99 
           <BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g 

            The new generation of cooling control from Alphacool: The Heatmaster II 

The new Alphacool Heatmaster II was developed in Germany over multiple years, and has continuously been improved considering the experiences from the first version. Hence we are now, after a development and testing period of almost 3 years, able to present the best Heatmaste... 
            4 In Stock, Ships Today Till 6pm EST 
            $84.99 
           <BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g 

            EK ZMT (Zero Maintainance Tubing) is a high quality, zero maintainance industrial grade EPDM rubber tubing in stylish matte black. 

This tubing is - just like Norprene - designed to withstand harsh conditions for a very long period of time, offering a truly exceptional lifespan even under UV, ozone and heat exposure for many years. 

Unlike most... 
            62 In Stock, Ships Today Till 6pm EST 
            $2.50 
           <BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g 

            The XSPC Raystorm DDC Photon EX360 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well. 

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is... 
            5 In Stock, Ships Today Till 6pm EST 
            $254.99 
           <BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g 

            A new generation of fans joins the Alphacool range. The Susurro, Spanish for Whisper. 

A fundamental review of known fan designs was used to manufacture the Susurro. The perfect harmony between the AlphaCool blue and deep blacks make a great impression. The transparent black fan is optimized to cause virtually no noise. 

But don’t be persuaded ... 
            2 In Stock, Ships Today Till 6pm EST 
            $14.99 
           <BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g 

            The best Alphacool reservoir mounts of all times! 

Many reservoir mounts were designed for the original tube reservoirs from the beginning of the PC water cooling sector. During the last years though, the reservoirs became larger, sized for more capacity and metal was integrated for the end caps. This resulted in heavier reservoirs, making the co... 
            1 In Stock, Ships Today Till 6pm EST 
            $10.99 
           <BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?gu=1&id=CR9RnD2g<BR>http://www.frozencpu.com/help/h25/Ordering_with_a_PO.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/problem.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h15/Legal.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h13.html?id=CR9RnD2g<BR>http://www.getfirefox.com<BR> 

回答

2

如果你希望所有的URL,你不能只是看看里面<a href=,特别是鉴于该物业的<a>href不会总是标签内的第一件事。像<a target=_blank href=http://google.com>这样的标签将被忽略。

如果你想搜索的所有URL不管你可以简单地忽略标签,并期待在一般的URL模式的情况下,像这样的东西:

$urls = preg_match_all('/[a-z]+:\/\/[a-zA-Z0-9?+.=%:\/]+/', $content, $matches); 

这可能需要抛光的很多,但应该做窍门让事情开始。 但是,请注意,这只会匹配完整的网址。链接到相关页面如<a href="index.html">显然不会匹配。

由于Regular Expressions are not a recommended solution to parse HTML,恐怕您将不得不寻求更合适的解决方案,例如DOMDocument()来打开页面并充分查找URL。

1

对于所有类型的URL匹配下面的代码可以帮助你:

<?php 

$content = '<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html>'; 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0]))); 
echo "<br><br>"; 
echo implode("<br>", array_values(array_unique($matches[0]))); 


/* 
* With your code 
*/ 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 
$url="http://www.frozencpu.com/"; 
$data=file_get_contents($url); 
$matches = array(); 

preg_match_all($pattern, $data, $matches); 
$array = array_values(array_unique($matches[0])); 
    $count = count($array); 

    for($i = 0; $i < $count; $i++) { 
      $db->exec("INSERT INTO urls(url, crawled) VALUES('{$array[$i]}', '0')"); 
} 

    ?> 

这里是更新代码,似乎工作,但速度非常慢。

<?php 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 

$url="http://proxylists.connectionincognito.com/"; 
$content=file_get_contents($url); 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

$unique = array_unique($matches[0]); 

foreach ($unique as $url) { 

//Insert if none exist 

$stmt = $db->prepare("SELECT * FROM urls WHERE url='$url'"); 
$stmt->bindParam(1, $_GET['id'], PDO::PARAM_INT); 
$stmt->execute(); 
$row = $stmt->fetch(PDO::FETCH_ASSOC); 

if(! $row) 
{ 

$db->exec("INSERT INTO urls(url, crawled) VALUES('$url', '0')"); 
} 
//Insert end code 
} 
?> 

参考:

http://php.net/manual/en/function.preg-match.php

+0

谢谢你的回答,它似乎到目前为止工作很好! 快速问题,我在这些操作上的加载时间非常缓慢〜15秒左右。 – Nick 2015-03-19 05:04:37

+0

这是预料之中吗?我将用新的代码编辑我的文章! – Nick 2015-03-19 05:04:57

+0

也许有一件事让它变得如此缓慢,应该是即时的 – Nick 2015-03-19 05:05:09

相关问题