2016-06-08 72 views
-2

我是Perl新手,但我试图编写一个程序将单个HTML文件分割为多个HTML文件。使用perl分割html文件

#!/usr/bin/env perl 

use strict; 
#use warnings; 

my @file_names; 

## Read the list of file names 
open(my $fh, "$ARGV[0]"); 
while (<$fh>) { 
    chomp; #remove new line character from the end of the line 
    push @file_names, $_; 
} 

my $counter = 0; 
my ($file_name, $fn); 

## Read the input file 
open($fh, "$ARGV[1]"); 
while (<$fh>) { 

    ## If this is an opening class, open the next output file, 
    ## and set $counter to 1. 

    if (/ class="bch_ha"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    ## If this is a closing class, print the line and set $counter back to 0 

    if (/\n<p sourcepage="(\d+)" class="bch_ha"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcesu_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcekt_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcekt_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcepq_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcepq_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcecs_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcecs_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bceex_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bceex_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<\/body>\n<\/html>/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    ## Print into the corresponding file handle if $counter is 1 

    print $fn $_ if $counter == 1 
} 

我需要添加更多的选项。代码应该要求手动输入分隔符,并且分割文件应该转到文件夹名称chapterxx。请帮助我在这

是啊请找到下面的HTML示例。

<!DOCTYPE html> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta charset="UTF-8" /> 
</head> 
<body> 
<p sourcepage="27" `class="bch_ha"`></p> 
<p sourcepage="26"  class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p> 
<p sourcepage="26"  class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
<p sourcepage="26"  class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
<p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26"  class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p> 
<p sourcepage="26"  class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
</body> 
</html> 

我只需要基于类class="bch_ha"的HTML拆分到下一class="bch_ha",谱写reader_0.html命名为新的HTML内容。文件名将像reader_1.html一样增量。

+1

您不能注释'use warnings'。这些消息表明代码中的某些内容不太正确,并将它们关闭并不能解决问题! – Borodin

+1

这应该用适当的HTML解析器完成。请显示原始HTML,以便我们能够帮助您。如果它在线,那么一个链接是好的 – Borodin

+0

HTML我不能分享,因为它的保密官方的东西。我只需要通过使用类名称将html文件拆分为多个文件,您可以在上面的代码中看到它。但这应该是动态的,我需要创建一个名称为输入文件的目录以及所有需要在文件夹中移动的已拆分html。 –

回答

0

也许这个例子会给你一个关于如何能够完成你的程序的想法。

在本例中,重点是如何根据分隔符分割文件。

注意:只保存html正文。

#!/usr/bin/env perl 
# test.pl 

use strict; 
use warnings; 

my $file = './htmlInput.html'; # input file 
my $delim = 'class="bch_ha"'; # delimiter 
my $dir = 'chapter' . time; # folder with unix timestamp 

# mkdir returns 1 if success 
if (mkdir($dir, 0755)) { 
    print "INFO: Created folder $dir to collect files.\n"; 
} else { 
    die "Can't make folder $dir\n"; 
} 

# reader_x.html, x = [0..] 
my $reader = 'reader_0.html'; 

my $fh2; 
my $cnt = 0; 
my $delim_first_time = 1; 
open(my $fh, "<", $file) or die "Can't open and read $file: $!"; # read file 
while (my $li = <$fh>) { 
    last if ($li =~ /<\/body>/); # quit the while loop 

    if ($delim_first_time && $li =~ /$delim/) { 
     open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write 
     $delim_first_time = 0; 
    } elsif ($li =~ /$delim/) { 
     close($fh2); 
     $cnt++; 
     $reader =~ s/[0-9]+/$cnt/; # reader_0.html -> reader_1.html 
     open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write 
    } 
    print $fh2 $li if !$delim_first_time; 
} 
close($fh); 
close($fh2); 

# output: 
# [~]$ ./test.pl 
# INFO: Created folder chapter1465642603 to collect files. 
# [~]$ ls chapter1465642603 
# reader_0.html reader_1.html 
# [~]$ cat chapter1465642603/reader_0.html 
# <p sourcepage="27" `class="bch_ha"`></p> 
# <p sourcepage="26"  class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p> 
# <p sourcepage="26"  class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
# <p sourcepage="26"  class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
# <p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# [~]$ 
# [~]$ cat chapter1465642603/reader_1.html 
# <p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26"  class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p> 
# <p sourcepage="26"  class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# [~]$