要做到这一点与HTML::TreeBuilder,你会阅读文件,修改树,并写出来(到同一文件,或不同的文件)。这是相当复杂的,因为你试图将文本节点的一部分转换为标签,并且因为你的评论无法移动。
用HTML树中常见的成语是使用修改树递归函数:
use strict;
use warnings;
use 5.008;
use File::Slurp 'read_file';
use HTML::TreeBuilder;
sub replace_keyword
{
my $elt = shift;
return if $elt->is_empty;
$elt->normalize_content; # Make sure text is contiguous
my $content = $elt->content_array_ref;
for (my $i = 0; $i < @$content; ++$i) {
if (ref $content->[$i]) {
# It's a child element, process it recursively:
replace_keyword($content->[$i])
unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
} else {
# It's text:
if ($content->[$i] =~ /here/) { # your keyword or regexp here
$elt->splice_content(
$i, 1, # Replace this text element with...
substr($content->[$i], 0, $-[0]), # the pre-match text
# A hyperlink with the keyword itself:
[ a => { href => 'http://example.com' },
substr($content->[$i], $-[0], $+[0] - $-[0]) ],
substr($content->[$i], $+[0]) # the post-match text
);
} # end if text contains keyword
} # end else text
} # end for $i in content index
} # end replace_keyword
my $content = read_file('foo.shtml');
# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");
my $body = $html->look_down(qw(_tag body));
replace_keyword($body);
# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;
print STDOUT $content; # Replace STDOUT with a suitable filehandle
从as_HTML
输出将是语法正确的HTML,但不一定很好地格式化HTML供人观看的来源。如果需要,可以使用HTML::PrettyPrinter写出文件。
来源
2010-10-11 00:17:45
cjm
没有看到你的代码,很难说出问题出在哪里。 – Ether 2010-10-10 15:30:54
你可以给出示例HTML行吗? – Ruel 2010-10-10 15:34:00
我添加了一个例子。 – snoofkin 2010-10-10 18:18:04