2014-02-16 18 views
0

我有一个页面,我希望所有链接关闭(,例如http://www.stephenfry.com/)。我想把所有形式为http://www.stephenfry.com/WHATEVER的链接放到一个数组中。我现在所得到的是只是下面的方法:提取某种形式的所有链接

#!/usr/bin/perl -w 
use strict; 
use LWP::Simple; 
use HTML::Tree; 

# I ONLY WANT TO USE JUST THESE 

my $url = 'http://www.stephenfry.com/'; 

my $doc = get($url); 

my $adt = HTML::Tree->new(); 
$adt->parse($doc); 

my @juice = $adt->look_down(
    _tag => 'a', 
    href => 'REGEX?' 
); 

不知道如何把这些链接只是在

+0

我编辑了标题,以便更详细地描述实际问题。 –

+0

您的预期产出是什么? – toolic

+0

可能是http://www.stephenfry.com/stuff或http://www.stephenfry.com/stuff/morestuff任何http://www.stephenfry.com/链接。 – user3269763

回答

1

你想使用extract_links()方法,而不是look_down()

use strict; 
use warnings; 
use LWP::Simple; 
use HTML::Tree; 

my %seen; 
my $url = 'http://www.stephenfry.com/'; 
my $doc = get($url); 

my $adt = HTML::Tree->new(); 
$adt->parse($doc); 
my $links_array_ref = $adt->extract_links('a'); 

my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->[0], 
    @$links_array_ref; 

print "$_\n" for @links; 

的部分输出:

http://www.stephenfry.com/ 
http://www.stephenfry.com/blog/ 
http://www.stephenfry.com/category/blessays/ 
http://www.stephenfry.com/category/features/ 
http://www.stephenfry.com/category/general/ 
... 

WWW::Mechanize使用可简单,而且它确实返回更多链接:

use strict; 
use warnings; 
use WWW::Mechanize; 

my %seen; 
my $mech = WWW::Mechanize->new(); 
$mech->get('http://www.stephenfry.com/'); 
my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->url, 
    $mech->links(); 

print $_, "\n" for @links; 

的部分输出:

http://www.stephenfry.com/wp-content/themes/fry/images/favicon.png 
http://www.stephenfry.com/wp-content/themes/fry/style.css 
http://www.stephenfry.com/wordpress/xmlrpc.php 
http://www.stephenfry.com/feed/ 
http://www.stephenfry.com/comments/feed/ 
... 

希望这有助于!

相关问题