将段落分解成句子 - 一个特例

我是用Java编程的新手。我想将一个文件中的段落拆分成句子并将它们写入不同的文件中。此外，还应该有一种机制来确定哪个句子来自哪一段。到目前为止，我使用的代码如下所述。但是这个代码打破：将段落分解成句子 - 一个特例

Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.

到

Former Secretary of Finance Dr. 
P.B. 
Jayasundera is being questioned by the police Financial Crime Investigation Division.

我怎样才能纠正呢？提前致谢。

import java.io.*; 
class trial4{ 
    public static void main(String args[]) throws IOException 
{ 
FileReader fr = new FileReader("input.txt"); 
BufferedReader br = new BufferedReader(fr); 
String s; 
OutputStream out = new FileOutputStream("output10.txt"); 
         String token[]; 

while((s = br.readLine()) != null) 
    { 
     token = s.split("(?<=[.!?])\\s* "); 
     for(int i=0;i<token.length;i++) 
     { 
     byte buf[]=token[i].getBytes(); 
    for(int j=0;j<buf.length;j=j+1) 
     { 
           out.write(buf[j]); 
       if(j==buf.length-1) 
         out.write('\n'); 
      } 
     } 
     } 
     fr.close(); 
    } 
}

我引用的所有贴在StackOverflow上的类似的问题。但是这些答案无法帮助我解决这个问题。

来源

2015-11-08 sugz

这将是合理很难做到，除非你能正式的“这一时期标志着一个缩写” VS“这个时期标志着一个句子的末尾”的一些概念。 –

如何结合使用负回顾后与替换。简单地说：将所有没有“特殊”的行结束符替换为换行符后跟换行符。

的“已知的缩写” A名单将是必要的。无法保证这些内容可以存在多长时间，也不能保证一行字末尾可能有多短。（见？“是”，如果很短了！）

class trial4{ 
    public static void main(String args[]) throws IOException { 
    FileReader fr = new FileReader("input.txt"); 
    BufferedReader br = new BufferedReader(fr); 
    PrintStream out = new PrintStream(new FileOutputStream("output10.txt")); 

    String s = br.readLine(); 
    while(s != null) { 
     out.print(  //Prints newline after each line in any case 
      s.replaceAll("(?i)"    //Make the match case insensitive 
       + "(?<!"     //Negative lookbehind 
       + "(\\W\\w)|"   //Single non-word followed by word character (P.B.) 
       + "(\\W\\d{1,2})|"  //one or two digits (dates!) 
       + "(\\W(dr|mr|mrs|ms))" //List of known abbreviations 
       + ")"      //End of lookbehind      
       +"([!?\\.])"    //Match end-ofsentence 
        , "$5"     //Replace with end-of-sentence found 
          +System.lineSeparator())); //Add newline if found 
     s = br.readLine(); 
    } 
    } 
}

来源

2015-11-08 10:38:27 Jan

它工作完美！非常感谢！ :) – sugz

是的！ :)。我还有一个问题。如果这些段落在Excel表单中怎么办？假设一个单元格包含一个段落。分割后，这些句子可以在文本文件/ Excel表格中。（无论哪种方式）。那么，这是如何实现的？ – sugz

嗨，我很抱歉再次打扰。但是当我给出像3.2这样的值时，它现在分成不同的句子。我以前没有这个问题。 – sugz

正如评论所说的“这将是合理硬”，打破文本段落没有正式的要求。看看BreakIterator - 特别是SentenceInstance。您可能会推出自己的BreakIterator，因为它与使用正则表达式打破相同，只是它更抽象。或尝试找到像http://deeplearning4j.org/sentenceiterator.html这样的第三方解决方案，这可以是训练标记化您的输入。

例如用的BreakIterator：

String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division."; 

BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US); 
bilus.setText(str); 

int last = bilus.first(); 
int count = 0; 

while (BreakIterator.DONE != last) 
{ 
    int first = last;  
    last = bilus.next(); 

    if (BreakIterator.DONE != last) 
    { 
     String sentence = str.substring(first, last); 
     System.out.println("Sentence:" + sentence); 
     count++; 
    } 
} 
System.out.println("" + count + " sentences found.");

来源

2015-11-08 10:39:28 Willmore

将段落分解成句子 - 一个特例

回答

相关问题