-2
我是Perl新手,但我试图编写一个程序将单个HTML文件分割为多个HTML文件。使用perl分割html文件
#!/usr/bin/env perl
use strict;
#use warnings;
my @file_names;
## Read the list of file names
open(my $fh, "$ARGV[0]");
while (<$fh>) {
chomp; #remove new line character from the end of the line
push @file_names, $_;
}
my $counter = 0;
my ($file_name, $fn);
## Read the input file
open($fh, "$ARGV[1]");
while (<$fh>) {
## If this is an opening class, open the next output file,
## and set $counter to 1.
if (/ class="bch_ha"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
## If this is a closing class, print the line and set $counter back to 0
if (/\n<p sourcepage="(\d+)" class="bch_ha"/) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bcesu_tt"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
if (/\n<p sourcepage="(\d+)" class="bcekt_tt"/) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bcekt_tt"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
if (/\n<p sourcepage="(\d+)" class="bcepq_tt"/) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bcepq_tt"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
if (/\n<p sourcepage="(\d+)" class="bcecs_tt"/) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bcecs_tt"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
if (/\n<p sourcepage="(\d+)" class="bceex_tt"/) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bceex_tt"/) {
$counter = 1;
$file_name = shift(@file_names);
open($fn, ">", "$file_name");
#print "<html>\n<body>";
}
if (/\n<\/body>\n<\/html>/) {
$counter = 0;
print $fn $_;
close($fn);
}
## Print into the corresponding file handle if $counter is 1
print $fn $_ if $counter == 1
}
我需要添加更多的选项。代码应该要求手动输入分隔符,并且分割文件应该转到文件夹名称chapterxx
。请帮助我在这
是啊请找到下面的HTML示例。
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8" />
</head>
<body>
<p sourcepage="27" `class="bch_ha"`></p>
<p sourcepage="26" class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p>
<p sourcepage="26" class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26" class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p>
<p sourcepage="26" class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
</body>
</html>
我只需要基于类class="bch_ha"
的HTML拆分到下一class="bch_ha"
,谱写reader_0.html命名为新的HTML内容。文件名将像reader_1.html一样增量。
您不能注释'use warnings'。这些消息表明代码中的某些内容不太正确,并将它们关闭并不能解决问题! – Borodin
这应该用适当的HTML解析器完成。请显示原始HTML,以便我们能够帮助您。如果它在线,那么一个链接是好的 – Borodin
HTML我不能分享,因为它的保密官方的东西。我只需要通过使用类名称将html文件拆分为多个文件,您可以在上面的代码中看到它。但这应该是动态的,我需要创建一个名称为输入文件的目录以及所有需要在文件夹中移动的已拆分html。 –