在Java中编辑文件时保持对标点符号，间距的跟踪

我正在编写一个程序，用于从文本文件中删除重复的连续单词，然后替换该文本文件而不重复。我知道我当前的代码不能处理重复单词在一行结尾的情况，并且在下一行的开头，因为我将每行读入一个ArrayList，找到重复的并将其删除。在写完之后，我不确定这是否是一种“好”的方法，因为现在我不知道如何写出来。我不确定如何跟踪线条句子开始和结束的标点符号，以及正确的间距，以及原始文本文件中是否有行返回。有没有办法处理这些事情（间距，标点符号等）与我目前为止？或者，我需要重新设计吗？我想我能做的另一件事是返回一个我需要删除的词的索引数组，但我不确定这是否更好。总之，这里是我的代码（在此先感谢！）在Java中编辑文件时保持对标点符号，间距的跟踪

/** Removes consecutive duplicate words from text files. 
It accepts only one argument, that argument being a text file 
or a directory. It finds all text files in the directory and 
its subdirectories and moves duplicate words from those files 
as well. It replaces the original file. */ 

import java.io.*; 
import java.util.*; 

public class RemoveDuplicates { 

    public static void main(String[] args) { 


     if (args.length != 1) { 
      System.out.println("Program accepts one command-line argument. Exiting!"); 
      System.exit(1); 
     } 
     File f = new File(args[0]); 
     if (!f.exists()) { 
      System.out.println("Does not exist!"); 
     } 

     else if (f.isDirectory()) { 
      System.out.println("is directory"); 

     } 
     else if (f.isFile()) { 
      System.out.println("is file"); 
      String fileName = f.toString(); 
      RemoveDuplicates dup = new RemoveDuplicates(f); 
      dup.showTextFile(); 
      List<String> noDuplicates = dup.doDeleteDuplicates(); 
      showTextFile(noDuplicates); 
      //writeOutputFile(fileName, noDuplicates); 
     } 
     else { 
      System.out.println("Shouldn't happen"); 
     } 
    } 

    /** Reads in each line of the passed in .txt file into the lineOfWords array. */ 
    public RemoveDuplicates(File fin) { 
     lineOfWords = new ArrayList<String>(); 
     try { 
      BufferedReader in = new BufferedReader(new FileReader(fin)); 
      for (String s = null; (s = in.readLine()) != null;) { 
       lineOfWords.add(s); 
      } 
     } 
     catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 

    public void showTextFile() { 
     for (String s : lineOfWords) { 
      System.out.println(s); 
     } 
    } 

    public static void showTextFile(List<String> list) { 
     for (String s : list) { 
      System.out.print(s); 
     } 
    } 

    public List<String> doDeleteDuplicates() { 
     List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates 
     // go through each line and split each word into end string array 
     for (String s : lineOfWords) { 
      String endString[] = s.split("[\\s+\\p{Punct}]"); 
      // add each word to the arraylist 
      for (String word : endString) { 
       noDup.add(word); 
      } 
     } 
     for (int i = 0; i < noDup.size() - 1; i++) { 
      if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) { 
       System.out.println("Removing: " + noDup.get(i+1)); 
       noDup.remove(i + 1); 
       i--; 
      } 
     } 
     return noDup; 
    } 

    public static void writeOutputFile(String fileName, List<String> newData) { 
     try { 
      PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName))); 
      for (String str : newData) { 
       outputFile.print(str + " "); 
      } 
      outputFile.close(); 
     } 
     catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 

    private List<String> lineOfWords; 
}

的example.txt文件：

Hello hello this is a test test in order 
order to see if it deletes duplicates Duplicates words.

来源

2010-08-03 Crystal

怎么这样呢？在这种情况下，我认为它是不区分大小写的。

Pattern p = Pattern.compile("(\\w+) \\1"); 
    String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words."; 

    Matcher m = p.matcher(line.toUpperCase()); 

    StringBuilder sb = new StringBuilder(1000); 
    int idx = 0; 

    while (m.find()) { 
     sb.append(line.substring(idx, m.end(1))); 
     idx = m.end(); 
    } 
    sb.append(line.substring(idx)); 

    System.out.println(sb.toString());

下面是输出： -

Hello this a test in order 
order to see if it deletes duplicates words.

来源

2010-08-03 16:46:44 limc

你能解释一下你的代码比较多，首先是sb.append部分。我不确定它是如何工作的。谢谢。 – Crystal 2010-08-04 04:34:53

m.end（1）中的“1”代表正则表达式中的组（由圆括号包围）。 m.end（1）返回该匹配组的最后一个索引，而m.end（）返回与提供的模式匹配的整个字符串的最后一个索引（“（\\ w +）\\ 1”）。基本上，我忽略m.end（1）和m.end（）之间的任何内容，因为它是m.start（1）和m.end（1）之间字符串的重复。在这种情况下，我不使用m.start（1），因为我没有看到需要。希望这可以帮助。 – limc 2010-08-04 14:10:40

在Java中编辑文件时保持对标点符号，间距的跟踪

回答

相关问题