2012-05-22 50 views
3
String original = "This is a sentence.Rajesh want to test the application for the word split."; 
List matchList = new ArrayList(); 
Pattern regex = Pattern.compile(".{1,10}(?:\\s|$)", Pattern.DOTALL); 
Matcher regexMatcher = regex.matcher(original); 
while (regexMatcher.find()) { 
    matchList.add(regexMatcher.group()); 
} 
System.out.println("Match List "+matchList); 

我需要将文本解析为长度不超过10个字符的行数组,并且不应该在行尾有单词中断。将长字符串分解成适当的单词换行

我用下面的逻辑在我的情况却是后10个字符解析到最近的空白如果在对如线

的下场休息的问题:实际的一句话就是“这是一个句子。Rajesh想要测试分词这个词的应用。“但是在逻辑执行完成之后,它变得如下。

匹配列表[这是一个,nce.Rajesh,要,试,pplication,对,字,分]

+0

假设你在Groovy想要这个?除了标签之外,您没有提及Groovy ... –

+1

您的意思是第10个字符不应该是?如果它是一个空间呢? – JHS

+1

如果单词本身长度超过10个字符,会发生什么情况?它应该分裂在中间吗?例如,“quickbrownfoxjumpsoverthelazydog”变成“{”quickbrown“,”foxjumpsov“,”erthelazyd“,”og“}'? – dasblinkenlight

回答

1

我避免正则表达式原样不拉的重量。这个代码字包装,如果一个单词超过10个字符,就打破它。它还处理多余的空白。

import static java.lang.Character.isWhitespace; 

public static void main(String[] args) { 
    final String original = 
    "This is a sentence.Rajesh want to test the application for the word split."; 
    final StringBuilder b = new StringBuilder(original.trim()); 
    final List<String> matchList = new ArrayList<String>(); 
    while (true) { 
    b.delete(0, indexOfFirstNonWsChar(b)); 
    if (b.length() == 0) break; 
    final int splitAt = lastIndexOfWsBeforeIndex(b, 10); 
    matchList.add(b.substring(0, splitAt).trim()); 
    b.delete(0, splitAt); 
    } 
    System.out.println("Match List "+matchList); 
} 
static int lastIndexOfWsBeforeIndex(CharSequence s, int i) { 
    if (s.length() <= i) return s.length(); 
    for (int j = i; j > 0; j--) if (isWhitespace(s.charAt(j-1))) return j; 
    return i; 
} 
static int indexOfFirstNonWsChar(CharSequence s) { 
    for (int i = 0; i < s.length(); i++) if (!isWhitespace(s.charAt(i))) return i; 
    return s.length(); 
} 

打印:

Match List [This is a, sentence.R, ajesh, want to, test the, applicatio, n for the, word, split.] 
+0

我的要求是我需要限制1行中的字符数小于或等于100个字符,如果在100个字符末尾的单词被破​​坏,我们需要将这个单词添加到下一行 – Raja

1

这个问题在某些点标记为Groovy的。假设一个Groovy的答案仍然是有效的,你不担心保存多个空格(如““):

def splitIntoLines(text, maxLineSize) { 
    def words = text.split(/\s+/) 
    def lines = [''] 
    words.each { word -> 
     def lastLine = (lines[-1] + ' ' + word).trim() 
     if (lastLine.size() <= maxLineSize) 
      // Change last line. 
      lines[-1] = lastLine 
     else 
      // Add word as new line. 
      lines << word 
    } 
    lines 
} 

// Tests... 
def original = "This is a sentence. Rajesh want to test the application for the word split." 

assert splitIntoLines(original, 10) == [ 
    "This is a", 
    "sentence.", 
    "Rajesh", 
    "want to", 
    "test the", 
    "application", 
    "for the", 
    "word", 
    "split." 
] 
assert splitIntoLines(original, 20) == [ 
    "This is a sentence.", 
    "Rajesh want to test", 
    "the application for", 
    "the word split." 
] 
assert splitIntoLines(original, original.size()) == [original] 
4

好了,我已经成功地得到了以下的工作,用10行的最大长度限制,还要正确地分开长度超过10的单词!

String original = "This is a sentence. Rajesh want to test the applications for the word split handling."; 
List matchList = new ArrayList(); 
Pattern regex = Pattern.compile("(.{1,10}(?:\\s|$))|(.{0,10})", Pattern.DOTALL); 
Matcher regexMatcher = regex.matcher(original); 
while (regexMatcher.find()) { 
    matchList.add(regexMatcher.group()); 
} 
System.out.println("Match List "+matchList); 

这是结果:

This is a 
sentence. 
Rajesh want 
to test 
the 
applicatio 
ns word 
split 
handling. 
+0

如果你想包含换行符,那么:“(。{1,10}(?:\\ s \\ n | $))|(。{0,10})” – Rafe

+0

这很适合使用正则表达式!但很难在破碎的词语之间添加' - '... – Valen

+0

对不起,我不明白? – Rafe

相关问题