在Haskell中转换HTML

我想将嵌套程度不高的有效HTML转换为具有更多受限制规则的另一个HTML。在Haskell中转换HTML

仅下列标签支持HTML结果：

<b></b>, <strong></strong>, <i></i>, <em></em>, <a 
href="URL"></a>, <code></code>, <pre></pre>

嵌套的标签是不允许的。

对于其余的标签及其组合，我必须创建一些规则来处理每个标签。所以我要像转换：因为<code>嵌套在<a>等

<p>text</p>成简单的字符串text与断行，

<b>text <a href="url">link</a> text</b>到text link text

<a href="url">text<code> code here</code></a>到<a href="url">text code here</a>。

例如HTML（换行符仅用于方便）：

<p>long paragraph <a href="url">link</a> </p> 
<p>another text <pre><code>my code block</code></pre> the rest of description</p> 
<p><code>inline monospaced text with <a href="url">link</a></code></p>

应转变为：

long paragraph <a href="url">link</a> 

another text <code>my code block</code> the rest of description 

<code>inline monospaced text with link</code>

任何建议来解决，该方法是什么？

来源

2016-06-09 klappvisor

经过一番调查后，我发现我认为非常优雅的解决方案。它基于tagsoup库。该库有Text.HTML.TagSoup.Tree模块，它有助于将HTML解析为树结构。

它也包含transformTree功能，它做转换很平凡。该功能的文档说：

此操作基于Uniplate转换函数。给定一个树列表，它以自底向上的方式将该函数应用于每棵树。

你可以阅读关于Uniplate更多here。

这是我很满意的代码：

import Text.HTML.TagSoup 
import Text.HTML.TagSoup.Tree 

convert = transformTree f 
    where 
     f (TagLeaf (TagOpen "br" _)) = [TagLeaf (TagText "\n")] -- line breaks 
     f (TagLeaf (TagOpen _ _)) = [] -- ignore all tags without closing pairs 
     f (TagBranch "a" attrs inner) = tagExtr "a" attrs inner -- keeps href for <a> 
     f (TagBranch "p" _ inner) = inner ++ [(TagLeaf (TagText "\n"))] 
     f (TagBranch "pre" _ [TagBranch "code" _ inner]) = tagExtr "pre" [] inner -- <pre><code> -> <code> 
     f (TagBranch tag _ inner) = if tag `elem` allowedTags then tagExtr tag [] inner else inner 
     f x = [x] 

tagExtr tag attrs inner = [TagBranch tag attrs [(extractFrom inner)]] 

allowedTags = ["b", "i", "a", "code", "a", "pre", "em", "strong"] 

extractFrom x = TagLeaf $ TagText $ (innerText . flattenTree) x

来源

2016-06-09 18:05:00 klappvisor

感谢向我们展示你发现了什么。 – ErikR

在Haskell中转换HTML

回答

相关问题