2010-07-20 51 views
3

我写了一个脚本,它向Google发送大块文本进行翻译,但有时文本是html源代码)将最终分裂成html标签的中间,Google会错误地返回代码。将一个大字符串拆分成一个数组,但拆分点不能破坏标签

我已经知道如何将字符串拆分成数组,但是有没有更好的方法来做到这一点,同时确保输出字符串不超过5000个字符并且不会在标签上分割?

UPDATE:多亏了答案,这是我最终使用在我的项目的代码,它的伟大工程

function handleTextHtmlSplit($text, $maxSize) { 
    //our collection array 
    $niceHtml[] = ''; 

    // Splits on tags, but also includes each tag as an item in the result 
    $pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE); 

    //the current position of the index 
    $currentPiece = 0; 

    //start assembling a group until it gets to max size 

    foreach ($pieces as $piece) { 
     //make sure string length of this piece will not exceed max size when inserted 
     if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) { 
      //advance current piece 
      //will put overflow into next group 
      $currentPiece += 1; 
      //create empty string as value for next piece in the index 
      $niceHtml[$currentPiece] = ''; 
     } 
     //insert piece into our master array 
     $niceHtml[$currentPiece] .= $piece; 
    } 

    //return array of nicely handled html 
    return $niceHtml; 
} 

回答

3

注:还没有机会测试这个(所以有可能是一个小错误或两个),但它应该给你一个想法:

function get_groups_of_5000_or_less($input_string) { 

    // Splits on tags, but also includes each tag as an item in the result 
    $pieces = preg_split('/(<[^>]*>)/', $input_string, 
     -1, PREG_SPLIT_DELIM_CAPTURE); 

    $groups[] = ''; 
    $current_group = 0; 

    while ($cur_piece = array_shift($pieces)) { 
     $piecelen = strlen($cur_piece); 

     if(strlen($groups[$current_group]) + $piecelen > 5000) { 
      // Adding the next piece whole would go over the limit, 
      // figure out what to do. 
      if($cur_piece[0] == '<') { 
       // Tag goes over the limit, just put it into a new group 
       $groups[++$current_group] = $cur_piece; 
      } else { 
       // Non-tag goes over the limit, split it and put the 
       // remainder back on the list of un-grabbed pieces 
       $grab_amount = 5000 - $strlen($groups[$current_group]; 
       $groups[$current_group] .= substr($cur_piece, 0, $grab_amount); 
       $groups[++$current_group] = ''; 
       array_unshift($pieces, substr($cur_piece, $grab_amount)); 
      } 
     } else { 
      // Adding this piece doesn't go over the limit, so just add it 
      $groups[$current_group] .= $cur_piece; 
     } 
    } 
    return $groups; 
} 

另外请注意,这可以在拆分常规单词的中间 - 如果您不想要,那么修改以// Non-tag goes over the limit开头的部分,以便为$grab_amount选择更好的值。我没有打扰编码,因为这只是一个如何解决分裂标签的例子,而不是一个简单的解决方案。

+0

哇琥珀,谢谢你。它应该真的让我的车轮转动。我会放弃它。 – james 2010-07-21 01:57:49

0

为什么发送到Google之前不剥离字符串中的HTML标签。 PHP有一个strip_tags()函数可以为你做到这一点。

+0

因为我需要保持HTML完好无损,因为它会最终呈现在页面上 – james 2010-07-21 01:56:32

+0

不是谷歌翻译出来吗? – 2010-07-21 07:44:24

+0

不,它会忽略除'alt'之外的html标签和属性,就我的测试显示而言。它返回它们没有被触动 – james 2010-07-21 18:21:15

0

preg_split一个很好的正则表达式会为你做。