2017-10-15 7 views
-1

我有一个巨大的文本文件。 我想删除所有换行符,并希望段落中断也被删除并附加到前一个paragrah。我应该如何使用java?我在java中使用了replaceALL(),但我坚持将段落追加到前一个。如何使用java为给定文本文件删除所有换行符和paragrah中断?

Please view this image for the file screenshot

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ 
      StringBuilder sb = new StringBuilder(); 
      System.out.println(value.toString().replaceAll("[\\t\\n]+", "")); 
      StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[\\t\\n]+", ""));   
      String[] tokens = new String[itr.countTokens()*2]; 

      for(int l = 0 ; l<tokens.length;l++){ 
       if(itr.hasMoreTokens()){ 
        tokens[l] = itr.nextToken(); 

       } 
      } 
        for(int i = 0; i < tokens.length; i++){ 
        if(tokens[i] != null && tokens[i] != " "){ 
         sb.append(tokens[i]); 
          for(int j = i+1;j<i+5;j++){ 
           if(tokens[j] != null) 
           { 
           sb.append(" "); 
           sb.append(tokens[j]); 
           } 

          } 
        } 
         word.set(sb.toString()); 
         context.write(word, one); 
         //System.out.println(sb.toString()); 
         sb.setLength(0); 

        } 
     } 

输入:

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by 
William Shakespeare 
sn 
This eBook is for the use of anyone anywhere at no cost and with 
almost no restrictions whatsoever. You may copy it, give it away or 
re-use it under the terms of the Project Gutenberg License included 
with this eBook or online at www.gutenberg.org 

** This is a COPYRIGHTED Project Gutenberg eBook, Details Below ** 
**  Please follow the copyright guidelines in this file.  ** 

Title: The Complete Works of William Shakespeare 

Author: William Shakespeare 

Posting Date: September 1, 2011 [EBook #100] 
Release Date: January, 1994 

Language: English 


*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE *** 




Produced by World Library, Inc., from their Library of the Future 

This is the 100th Etext file presented by Project Gutenberg, and 
is presented in cooperation with World Library, Inc., from their 
Library of the Future and Shakespeare CDROMS. Project Gutenberg 
often releases Etexts that are NOT placed in the Public Domain!! 

Shakespeare 

*This Etext has certain copyright implications you should read!* 

预期输出:

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by 
William Shakespeare sn This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included 
with this eBook or online at www.gutenberg.org ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below Please follow the copyright guidelines in this file.Title: The Complete Works of William Shakespeare Author: William Shakespeare Posting Date: September 1, 2011 [EBook #100] 
Release Date: January, 1994 Language: English START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE Produced by World Library, Inc., from their Library of the Future This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS. Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!! Shakespeare *This Etext has certain copyright implications you should read!* 
+2

帖子例子。也不要将文本/代码发布为图片/链接([更多信息](https://meta.stackoverflow.com/a/285557))。使用[编辑]选项更正您的帖子。 – Pshemo

+0

@Pshemo我需要所有的换行符,删除标点符号以及将段落添加到前面的段落中。这是一个单一的段落 –

+0

这还不是很清楚。你声称“所有换行符”,但这意味着我们会得到单行,这是不是这种情况,因为你的预期输出有四行。你是怎么认识到哪些分隔线应该留下的?你还写道所有的标点符号都应该被删除,但是我们可以看到'','在'之前'由'威廉莎士比亚','更不用说'发行日期:1994年1月'。 – Pshemo

回答

0

如果你只想要的话,你可以的话搜索与\ w和将它们连接起来。

public static void main(String args[]) { 
    final String input = "hello, how are you today how was school today, what did you have for food? this star needs to be removed ****"; 
    final String regex = "\\w+"; 
    final Matcher m = Pattern.compile(regex).matcher(input); 

    String output = ""; 
    while (m.find()) { 
     output += m.group(0)+" "; 
    } 
    System.out.println(output); 
} 

结果:

hello how are you today how was school today what did you have for food this star needs to be removed 
0

使用字符串文字逃逸,为真正的标签,换行。不要忘记回车(在Windows上)。

String text = value.toString() 
    .replaceAll("(\r?\n){2}", "§") // Two line breaks will become a real line break. 
    .replaceAll("[\t\r\n]+", " ") // White space will become a real space. 
    .replace("§", "\n"); // The real line breaks. 

代替§人们可能会使用一些深奥的性格uFEFF

会变成

Good Morning, 

How are you? 
I am fine. 

投入与预期输出的

Good Morning, 
How are you? I am fine. 
相关问题