从使用itext创建的PDF中删除HTML和CSS样式

我们在应用程序中使用itext动态创建PDF。用户通过使用富文本编辑器的屏幕将PDF内容插入到Web应用程序中。从使用itext创建的PDF中删除HTML和CSS样式

以下是具体的步骤。

用户转到添加PDF内容页面。
添加页面有一个富文本编辑器，可以在其中输入PDF内容。
有时用户可以复制/粘贴现有word文档中的内容并输入到RTE中。
一旦他提交了内容，就会创建PDF。

的RTE的使用，因为我们有一些其他的网页，我们需要表现出与粗体，斜体等

内容，但是，我们不希望产生的PDF这RTE的东西。

在生成PDF之前，我们使用了一些Java实用程序从内容中删除了RTE内容。

这可以正常工作，但是当从word文档复制内容时，文档应用的html和css样式不会被我们使用的java实用程序删除。

如何生成没有任何HTML或CSS的PDF？

下面是代码

Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);

而且removeHTML方法如下

public static String removeHTML(String htmlString) { 
    if (htmlString == null) 
     return ""; 
    htmlString.replace("\"", "'"); 
    htmlString = htmlString.replaceAll("\\<.*?>", ""); 
    htmlString = htmlString.replaceAll("&nbsp;", ""); 
    return htmlString; 
}

并且在下面示于当我复制/从字文档粘贴PDF中的附加内容。

<w:LsdException Locked="false" Priority="10" SemiHidden="false 
UnhideWhenUsed="false" QFormat="true" Name="Title" /> 
<w:LsdException Locked="false" Priority="11" SemiHidden="false" 
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" /> 
<w:LsdException Locked="false" Priority="22" SemiHidden="false"

请帮忙！

谢谢。

来源

2011-04-20 ashishjmeshram

我们的应用程序类似，我们有一个富文本编辑器（TinyMCE），我们的输出是通过iText PDF生成PDF。我们希望HTML尽可能干净，理想情况下只使用iText HTMLWorker支持的HTML标签。 TinyMCE可以做到这一点，但仍然有一些情况，最终用户可能会提交HTML，这真的搞砸了，这可能会破坏iText生成PDF的能力。

我们使用jSoup和jTidy + CSSParser的组合过滤掉HTML中“style”属性中输入的不需要的CSS样式。输入到TinyMCE中的HTML会使用此服务进行清理，清理所有来自文字标记的粘贴（如果用户没有使用TinyMCE中的从Word粘贴按钮），并且为我们提供了可以翻译iTextPDF HTMLWorker的HTML。

如果表格宽度在style属性中，HTMLWorker会忽略它并将表格宽度设置为0，我还发现iText的HTMLWorker解析器（5.0.6）中的表格宽度问题，并将表格宽度设置为0，所以这是一些逻辑来解决以下问题。我们用下面的库：一个

com.itextpdf:itextpdf:5.0.6     // used to generate PDFs 
org.jsoup:jsoup:1.5.2      // used for cleaning HTML, primary cleaner 
net.sf.jtidy:jtidy:r938      // used for cleaning HTML, secondary cleaner 
net.sourceforge.cssparser:cssparser:0.9.5 // used to parse out unwanted HTML "style" attribute values

下面是我们建立擦洗HTML只保留通过iText的+所支持的标记和样式属性Groovy的一些服务代码修复表的问题。代码中有一些特定于我们的应用程序的假设。目前这对我们非常有用。

import com.steadystate.css.parser.CSSOMParser 
import org.htmlcleaner.CleanerProperties 
import org.htmlcleaner.HtmlCleaner; 
import org.htmlcleaner.PrettyHtmlSerializer 
import org.htmlcleaner.SimpleHtmlSerializer 
import org.htmlcleaner.TagNode 
import org.jsoup.Jsoup 
import org.jsoup.nodes.Document 
import org.jsoup.safety.Cleaner 
import org.jsoup.safety.Whitelist 
import org.jsoup.select.Elements 
import org.w3c.css.sac.InputSource 
import org.w3c.dom.css.CSSRule 
import org.w3c.dom.css.CSSRuleList 
import org.w3c.dom.css.CSSStyleDeclaration 
import org.w3c.dom.css.CSSStyleSheet 
import org.w3c.tidy.Tidy 

class HtmlCleanerService { 

    static transactional = true 

    def cleanHTML(def html) { 

     // clean with JSoup which should filter out most unwanted things and 
     // ensure good html syntax 
     html = soupClean(html); 

     // run through JTidy to remove repeated nested tags, clean anything JSoup left out 
     html = tidyClean(html); 

     return html; 
    } 

    def tidyClean(def html) { 
     Tidy tidy = new Tidy() 
     tidy.setAsciiChars(true) 
     tidy.setDropEmptyParas(true) 
     tidy.setDropProprietaryAttributes(true) 
     tidy.setPrintBodyOnly(true) 

     tidy.setEncloseText(true) 
     tidy.setJoinStyles(true) 
     tidy.setLogicalEmphasis(true) 
     tidy.setQuoteMarks(true) 
     tidy.setHideComments(true) 
     tidy.setWraplen(120) 

     // (makeClean || dropFontTags) = replaces presentational markup by style rules 
     tidy.setMakeClean(true)  // remove presentational clutter. 
     tidy.setDropFontTags(true) 

     // word2000 = drop style & class attributes and empty p, span elements 
     // draconian cleaning for Word2000 
     tidy.setWord2000(true)  
     tidy.setMakeBare(true)  // remove Microsoft cruft. 
     tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute 

     // TODO ? tidy.setForceOutput(true) 

     def reader = new StringReader(html); 
     def writer = new StringWriter(); 

     // hide output from stderr 
     tidy.setShowWarnings(false) 
     tidy.setErrout(new PrintWriter(new StringWriter())) 

     tidy.parse(reader, writer); // run tidy, providing an input and output stream 
     return writer.toString() 
    } 

    def soupClean(def html) { 

     // clean the html 
     Document dirty = Jsoup.parseBodyFragment(html); 
     Cleaner cleaner = new Cleaner(createWhitelist()); 
     Document clean = cleaner.clean(dirty); 

     // now hunt down all style attributes and ensure we only have those that render with iTextPDF 
     Elements styledNodes = clean.select("[style]"); // a with href 
     styledNodes.each { element -> 
      def style = element.attr("style"); 
      def tag = element.tagName().toLowerCase() 
      def newstyle = "" 
      CSSOMParser parser = new CSSOMParser(); 
      InputSource is = new InputSource(new StringReader(style)) 
      CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is) 
      boolean hasProps = false 
      for (int i=0; i < styledeclaration.getLength(); i++) { 
       def propname = styledeclaration.item(i) 
       def propval = styledeclaration.getPropertyValue(propname) 
       propval = propval ? propval.trim() : "" 

       if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) { 
        newstyle = newstyle + propname + ": " + propval + ";" 
        hasProps = true 
       } 

       // standardize table widths, itextPDF won't render tables if there is only width in the 
       // style attribute. Here we ensure the width is in its own attribute, and change the value so 
       // it is in percentage and no larger than 100% to avoid end users from creating really goofy 
       // tables that they can't edit properly becuase they have made the width too large. 
       // 
       // width of the display area in the editor is about 740px, so let's ensure everything 
       // is relative to that 
       // 
       // TODO could get into trouble with nested tables and widths within as we assume 
       // one table (e.g. could have nested tables both with widths of 500) 
       if (tag.equals("table") && propname.equals("width")) { 
        if (propval.endsWith("%")) { 
         // ensure it is <= 100% 
         propval = propval.replaceAll(~"[^0-9]", "") 
         propval = Math.min(100, propval.toInteger()) 
        } 
        else { 
         // else we have measurement in px or assumed px, clean up and 
         // get integer value, then calculate a percentage 
         propval = propval.replaceAll(~"[^0-9]", "") 
         propval = Math.min(100, (int) (propval.toInteger()/740)*100) 
        } 
        element.attr("width", propval + "%") 
       } 
      } 
      if (hasProps) { 
       element.attr("style", newstyle) 
      } else { 
       element.removeAttr("style") 
      } 

     } 

     return clean.body().html(); 
    } 

    /** 
    * Returns a JSoup whitelist suitable for sane HTML output and iTextPDF 
    */ 
    def createWhitelist() { 
     Whitelist wl = new Whitelist(); 

     // iText supported tags 
     wl.addTags(
      "br", "div", "p", "pre", "span", "blockquote", "q", "hr", 
      "h1", "h2", "h3", "h4", "h5", "h6", 
      "u", "strike", "s", "strong", "sub", "sup", "em", "i", "b", 
      "ul", "ol", "li", "ol", 
      "table", "tbody", "td", "tfoot", "th", "thead", "tr", 
      ); 

     // iText attributes recognized which we care about 
     // padding-left (div/p/span indentation) 
     // text-align (for table right/left align) 
     // text-decoration (for span/div/p underline, strikethrough) 
     // font-weight (for span/div/p bolder etc) 
     // font-style (for span/div/p italic etc) 
     // width (for tables) 
     // colspan/rowspan (for tables) 

     ["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag -> 
      ["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr -> 
       wl.addAttributes(tag, attr) 
      } 
     } 

     ["td", "th"].each { tag -> 
      ["colspan", "rowspan", "width"].each { attr -> 
       wl.addAttributes(tag, attr) 
      } 
     } 
     wl.addAttributes("table", "width", "style", "cellpadding") 

     // img support 
     // wl.addAttributes("img", "align", "alt", "height", "src", "title", "width") 


     return wl 
    } 
}

来源

2011-05-11 18:45:29

如果您只是想要HTML文档的文本内容，那么请使用XML API（如SAX或DOM）仅发布文档中的文本节点。如果您知道如何使用DOM，那么对于DocumentTraversal API来说这是微不足道的。如果我有我的IDE运行，我会粘贴一个样本...

此外，显示的removeHtml方法效率低下。使用Pattern.compile并将其缓存在一个静态变量中，并使用Matcher API来替换StringBuffer（或者StringBuilder，如果它使用的话）。这样你就不会创建一堆中间字符串并把它扔掉。

来源

2011-04-20 05:08:00 les2

嗨。谢谢回复。我不是从HTML文档中获取文本内容，而是从数据库中获取文本内容。当用户从RTE提交内容时，它首先进入数据库，然后从数据库中检索并用于生成PDF。 – ashishjmeshram 2011-04-20 05:17:22

从使用itext创建的PDF中删除HTML和CSS样式

回答

相关问题