提取某种形式的所有链接

我有一个页面，我希望所有链接关闭（，例如http://www.stephenfry.com/）。我想把所有形式为http://www.stephenfry.com/WHATEVER的链接放到一个数组中。我现在所得到的是只是下面的方法：提取某种形式的所有链接

#!/usr/bin/perl -w 
use strict; 
use LWP::Simple; 
use HTML::Tree; 

# I ONLY WANT TO USE JUST THESE 

my $url = 'http://www.stephenfry.com/'; 

my $doc = get($url); 

my $adt = HTML::Tree->new(); 
$adt->parse($doc); 

my @juice = $adt->look_down(
    _tag => 'a', 
    href => 'REGEX?' 
);

不知道如何把这些链接只是在

来源

2014-02-16 user3269763

我编辑了标题，以便更详细地描述实际问题。 –

您的预期产出是什么？ – toolic

可能是http://www.stephenfry.com/stuff或http://www.stephenfry.com/stuff/morestuff任何http://www.stephenfry.com/链接。 – user3269763

你想使用extract_links()方法，而不是look_down()：

use strict; 
use warnings; 
use LWP::Simple; 
use HTML::Tree; 

my %seen; 
my $url = 'http://www.stephenfry.com/'; 
my $doc = get($url); 

my $adt = HTML::Tree->new(); 
$adt->parse($doc); 
my $links_array_ref = $adt->extract_links('a'); 

my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->[0], 
    @$links_array_ref; 

print "$_\n" for @links;

的部分输出：

http://www.stephenfry.com/ 
http://www.stephenfry.com/blog/ 
http://www.stephenfry.com/category/blessays/ 
http://www.stephenfry.com/category/features/ 
http://www.stephenfry.com/category/general/ 
...

WWW::Mechanize使用可简单，而且它确实返回更多链接：

use strict; 
use warnings; 
use WWW::Mechanize; 

my %seen; 
my $mech = WWW::Mechanize->new(); 
$mech->get('http://www.stephenfry.com/'); 
my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->url, 
    $mech->links(); 

print $_, "\n" for @links;

的部分输出：

http://www.stephenfry.com/wp-content/themes/fry/images/favicon.png 
http://www.stephenfry.com/wp-content/themes/fry/style.css 
http://www.stephenfry.com/wordpress/xmlrpc.php 
http://www.stephenfry.com/feed/ 
http://www.stephenfry.com/comments/feed/ 
...

希望这有助于！

来源

2014-02-16 03:27:03 Kenosis

提取某种形式的所有链接

回答

相关问题