2012-10-23 67 views
1

我试图使用HTML :: TableExtract从HTML文件中提取表格内容。我的问题是我的HTML文件的结构和方式如下:使用Perl提取表格内容

<!DOCTYPE html> 
<html> 
<body> 

    <h4>One row and three columns:</h4> 

    <table border="1"> 
     <tr> 
     <td> 
     <p> 100 </p></td> 
     <td> 
     <p> 200 </p></td> 
     <td> 
     <p> 300 </p></td> 
     </tr> 
     <tr> 
     <td> 
     <p> 100 </p></td> 
     <td> 
     <p> 200 </p></td> 
     <td> 
     <p> 300 </p></td> 
     </tr> 
    </table> 
</body> 
</html> 

由于这种结构,我的输出是这样的:

100| 

    200| 

    300| 

    400| 

    500| 

    600| 

而不是我想要的东西:

100|200|300| 
    400|500|600| 

你能帮忙吗?这里是我的Perl代码

use strict; 
use warnings; 
use HTML::TableExtract; 

my $te = HTML::TableExtract->new(); 
$te->parse_file('Table_One.html'); 

open (DATA2, ">TableOutput.txt") 
    or die "Can't open file"; 

foreach my $ts ($te->tables()) { 

    foreach my $row ($ts->rows()) { 

     my $Final = join('|', @$row); 
    print DATA2 "$Final"; 
    } 
} 
close (DATA2); 

回答

1
sub trim(_) { my ($s) = @_; $s =~ s/^\s+//; $s =~ s/\s+\z//; $s } 

或者在Perl 5.14+,

sub trim(_) { $_[0] =~ s/^\s+//r =~ s/\s+\z//r } 

然后使用:

my $Final = join '|', map trim, @$row; 
+0

为什么括号? '我的($ s)' – Tim

+0

@Tim N,强制使用列表赋值操作符。否则,它将与'my $ s = 1;'相同。 – ikegami

0

试着这样做:

use strict; 
use warnings; 
use HTML::TableExtract; 

my $te = HTML::TableExtract->new(); 
$te->parse_file('Table_One.html'); 

open (DATA2, ">TableOutput.txt") or die "Can't open file"; 
foreach my $ts ($te->tables()) 
{ 
    foreach my $row ($ts->rows()) 
    { 
     s/(\n|\s)//g for @$row; 
     my $Final = join('|', @$row); 
     print DATA2 "$Final"; 
    } 
} 
close (DATA2); 
+0

太棒了!谢谢 – user1769222

+0

你可以看看编辑过的问题吗? – user1769222

1

使用Mojo :: DOM

#!/usr/bin/env perl 

use strict; 
use warnings; 

use Mojo::DOM; 
my $dom = Mojo::DOM->new(<<'END'); 
<!DOCTYPE html> 
<html> 
<body> 

    <h4>One row and three columns:</h4> 

    <table border="1"> 
     <tr> 
     <td> 
     <p> 100 </p></td> 
     <td> 
     <p> 200 </p></td> 
     <td> 
     <p> 300 </p></td> 
     </tr> 
     <tr> 
     <td> 
     <p> 100 </p></td> 
     <td> 
     <p> 200 </p></td> 
     <td> 
     <p> 300 </p></td> 
     </tr> 
    </table> 
</body> 
END 

my $rows = $dom->find('table tr'); 
$rows->each(sub{ 
    print $_->find('td p') 
      ->pluck('text') 
      ->join('|') . "|\n" 
});