我有一个巨大的文本文件。 我想删除所有换行符,并希望段落中断也被删除并附加到前一个paragrah。我应该如何使用java?我在java中使用了replaceALL(),但我坚持将段落追加到前一个。如何使用java为给定文本文件删除所有换行符和paragrah中断?
Please view this image for the file screenshot
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
StringBuilder sb = new StringBuilder();
System.out.println(value.toString().replaceAll("[\\t\\n]+", ""));
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[\\t\\n]+", ""));
String[] tokens = new String[itr.countTokens()*2];
for(int l = 0 ; l<tokens.length;l++){
if(itr.hasMoreTokens()){
tokens[l] = itr.nextToken();
}
}
for(int i = 0; i < tokens.length; i++){
if(tokens[i] != null && tokens[i] != " "){
sb.append(tokens[i]);
for(int j = i+1;j<i+5;j++){
if(tokens[j] != null)
{
sb.append(" ");
sb.append(tokens[j]);
}
}
}
word.set(sb.toString());
context.write(word, one);
//System.out.println(sb.toString());
sb.setLength(0);
}
}
输入:
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare
sn
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
** Please follow the copyright guidelines in this file. **
Title: The Complete Works of William Shakespeare
Author: William Shakespeare
Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994
Language: English
*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***
Produced by World Library, Inc., from their Library of the Future
This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS. Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!
Shakespeare
*This Etext has certain copyright implications you should read!*
预期输出:
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare sn This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below Please follow the copyright guidelines in this file.Title: The Complete Works of William Shakespeare Author: William Shakespeare Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994 Language: English START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE Produced by World Library, Inc., from their Library of the Future This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS. Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!! Shakespeare *This Etext has certain copyright implications you should read!*
帖子例子。也不要将文本/代码发布为图片/链接([更多信息](https://meta.stackoverflow.com/a/285557))。使用[编辑]选项更正您的帖子。 – Pshemo
@Pshemo我需要所有的换行符,删除标点符号以及将段落添加到前面的段落中。这是一个单一的段落 –
这还不是很清楚。你声称“所有换行符”,但这意味着我们会得到单行,这是不是这种情况,因为你的预期输出有四行。你是怎么认识到哪些分隔线应该留下的?你还写道所有的标点符号都应该被删除,但是我们可以看到'','在'之前'由'威廉莎士比亚','更不用说'发行日期:1994年1月'。 – Pshemo