在Python中截断的HTML

是否有一个纯Python工具可以接收一些HTML并尽可能接近给定的长度截断它，但要确保生成的代码片段格式正确？例如，给定这个HTML：在Python中截断的HTML

<h1>This is a header</h1> 
<p>This is a paragraph</p>

它不会产生：

<h1>This is a hea

但：

<h1>This is a header</h1>

或至少：

<h1>This is a hea</h1>

我不能找到一个工作，但我发现一个依靠pullparser，既过时又死亡。

来源

2011-02-11 JasonFruit

“它会产生：” ..给出什么参数？连续的字符数？ DOM元素的数量，层次结构？ – akira 2011-02-11 15:01:03

可能是一些内容字符或一些HTML字符。我不挑剔。 – JasonFruit 2011-02-11 15:02:39

我不认为你需要一个全面的解析 - 你只需要在输入字符串标记化到之一：

文本
开放标签
结束标记
自闭标签
字符实体

一旦你有这样的记号流，很容易使用堆栈跟踪什么标记都需要关闭的。其实，我就遇到了这个问题，前一段时间，写了一个小型图书馆要做到这一点：

https://github.com/eentzel/htmltruncate.py

它的工作很适合我，和处理大多数的角落案件以及包括任意嵌套标记，计数字符实体作为单个字符，返回格式错误的错误等。

它会产生：

<h1>This is a hea</h1>

上你的榜样

。这可能会改变，但在一般情况下很难 - 如果你想截断为10个字符，但<h1>标签没有关闭另一个，比如说300个字符？

来源

2011-03-07 19:05:43 eentzel

我最初的想法是使用XML解析器（可能是python's sax parser），然后可能会计算每个xml元素中的文本字符。我会忽略标记字符的数量，以使其更加一致以及更简单，但两者都应该是可能的。

来源

2011-02-11 15:13:01 Petriborg

当我评论funktku的回答时，有没有人*已经做到了？ – JasonFruit 2011-02-11 15:23:09

@JasonFruit哦，我明白你的意思了 - 我不知道它的真实性和坦率性，以及它的简单性。 – Petriborg 2011-02-11 15:45:22

我建议先完全解析HTML然后截断。 Python的一个很棒的HTML解析器是lxml。解析和截断后，您可以将其打印回HTML格式。

来源

2011-02-11 15:14:24

但是，有没有人*已经做到了？我理解这个问题，但是看起来这是一个普遍的问题，有人必须有解决方案。 – JasonFruit 2011-02-11 15:22:19

查看HTML Tidy清理/重新格式化/重新加载HTML。

来源

2011-02-11 18:52:12

不是最好的选择，而不是真正的Python事情。 – JasonFruit 2011-02-12 01:26:21

有一些Python库绑定到Tidy，检查出来。我用它来清理MS-Word HTML中的一些用户粘贴到CMS中。 – 2011-02-13 04:46:15

如果你使用Django的lib，你可以简单地说：

from django.utils import text, html 

    class class_name(): 


     def trim_string(self, stringf, limit, offset = 0): 
      return stringf[offset:limit] 

     def trim_html_words(self, html, limit, offset = 0): 
      return text.truncate_html_words(html, limit) 


     def remove_html(self, htmls, tag, limit = 'all', offset = 0): 
      return html.strip_tags(htmls)

不管怎么说，这里是从truncate_html_words代码从Django中：

import re 

def truncate_html_words(s, num): 
    """ 
    Truncates html to a certain number of words (not counting tags and comments). 
    Closes opened tags if they were correctly closed in the given html. 
    """ 
    length = int(num) 
    if length <= 0: 
     return '' 
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input') 
    # Set up regular expressions 
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)') 
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>') 
    # Count non-HTML words and keep note of open tags 
    pos = 0 
    ellipsis_pos = 0 
    words = 0 
    open_tags = [] 
    while words <= length: 
     m = re_words.search(s, pos) 
     if not m: 
      # Checked through whole string 
      break 
     pos = m.end(0) 
     if m.group(1): 
      # It's an actual non-HTML word 
      words += 1 
      if words == length: 
       ellipsis_pos = pos 
      continue 
     # Check for tag 
     tag = re_tag.match(m.group(0)) 
     if not tag or ellipsis_pos: 
      # Don't worry about non tags or tags after our truncate point 
      continue 
     closing_tag, tagname, self_closing = tag.groups() 
     tagname = tagname.lower() # Element names are always case-insensitive 
     if self_closing or tagname in html4_singlets: 
      pass 
     elif closing_tag: 
      # Check for match in open tags list 
      try: 
       i = open_tags.index(tagname) 
      except ValueError: 
       pass 
      else: 
       # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags 
       open_tags = open_tags[i+1:] 
     else: 
      # Add it to the start of the open tags list 
      open_tags.insert(0, tagname) 
    if words <= length: 
     # Don't try to close tags if we don't need to truncate 
     return s 
    out = s[:ellipsis_pos] + ' ...' 
    # Close any tags still open 
    for tag in open_tags: 
     out += '</%s>' % tag 
    # Return string 
    return out

来源

2011-02-13 16:57:12 vertazzar

这将有助于你的requirement.An好用HTML分析器和坏标记校正器

http://www.crummy.com/software/BeautifulSoup/

来源

2011-02-13 17:14:11 DhruvPathak

您可以用BeautifulSoup一行做到这一点（假设你想在一定数量的源字符截断，而不是在一些内容字符）：

from BeautifulSoup import BeautifulSoup 

def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length]))

来源

2011-12-08 17:02:08 slacy

我找到了答案由slacy非常如果我有声望，它会很有帮助，并且会升级它 - 但是还有一件需要注意的事情。在我的环境中，我安装了html5lib以及BeautifulSoup4。 BeautifulSoup使用html5lib解析器，这导致我的html代码片段被包裹在html和body标签中，这不是我想要的。

>>> truncate_html("<p>sdfsdaf</p>", 4) 
u'<html><head></head><body><p>s</p></body></html>'

要解决这些问题，我告诉BeautifulSoup使用Python解析器：

from bs4 import BeautifulSoup 
def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length], "html.parser")) 

>>> truncate_html("<p>sdfsdaf</p>", 4) 
u'<p>s</p>'

来源

2012-02-09 05:13:20

在Python中截断的HTML

回答

相关问题