如何使用Sed在每行中使用相同标记制作HTML内容

我正在查看HTML文件以便于解析的目的进行修改。我需要把body中的每一项HTML分隔开来。如何使用Sed在每行中使用相同标记制作HTML内容

如我现在的HTML文件

<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> 
    <head> 
    <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> 
    <meta name="ncc:files" content="78" /> 
    </head> 
    <body> 
    <h1 class="title" id="h1"><a href="001.smil#txt4">ABOUT DAISY</a></h1> 
    <h1 class="section" id="h7"> 
     <a href="002.smil#txt10">Cover</a> 
    </h1> 
    <span class="page-normal" id="p13"> 
     <a href="002.smil#txt15">1</a> 
    </span> 
    <h1 class="section" id="h18"> 
     <a href="003.smil#txt21">Swadesaabhimaani, K. Kelappan, Muhammad Abdul Rahiman</a> 
    </h1> 
    <span class="page-normal" id="p24"> 
     <a href="003.smil#txt26">2</a> 
    </span> 
    <span class="page-normal" id="p33"> 
     <a href="003.smil#txt35">3</a> 
    </span> 
    <h1 class="section" id="h38"> 
     <a href="004.smil#txt41">Title</a> 
    </h1> 
    <span class="page-normal" id="p45"> 
     <a href="004.smil#txt47">4</a> 
    </span> 
    <h1 class="section" id="h50"> 
     <a href="005.smil#txt53">Publication</a> 
    </h1> 
    <span class="page-normal" id="p69"> 
     <a href="005.smil#txt71">5</a> 
    </span> 
    <h1 class="section" id="h74"> 
     <a href="006.smil#txt77">K. Ramakrishnapilla</a> 
    </h1> 
     </body> 
</html>

所需的HTML后<body>标签

<h1 class="title" id="h1"><a href="001.smil#txt4">ABOUT DAISY</a></h1> 
<h1 class="section" id="h7"><a href="002.smil#txt10">Cover</a></h1> 
<span class="page-normal" id="p13"><a href="002.smil#txt15">1</a></span>

手段的每个标签内容必须进来同一行没有分裂。请告知如何使用sed来完成。

来源

2016-01-19 Anes

虽然可以用'sed'作为超级高级挑战来做到这一点，您在这里可以更好地回顾S.O.使用'awk'设置一个标志变量来指示''。但是，请参阅http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454。迟一些，你会遇到用'sed'或'awk'操作xml（ish）数据的问题。您需要学习一种支持xml支持的语言。祝你好运。 – shellter

它可以这样来完成：将所有行连接成一个，例如， tr -d '\n' INFILE > OUTFILE。

然后找出其中你想有一个单独的行，并创建一个sed脚本出来，例如像，你想<p>，<h1>所有的容器标签：

#sedscript.sed 
s/<h1>/\n&/ 
s/<\/h1>/&\n/ 
s/<p>/\n&/ 
s/<\/p>/&\n/

然后运行它sed -f sedscript.sed OUTFILE。

虽然它可能适合您的需要，但它无法处理错误格式的HTML（例如重叠标签等）。

来源

2016-01-27 10:00:11

如何使用Sed在每行中使用相同标记制作HTML内容

回答

相关问题