2017-02-17 40 views
1

我有.sh,.txt,.sql,.pkb等文件,文件大小超过10 MB,这意味着超过10万行。使用Java从大文件中删除注释

我想从这些文件中删除注释,然后再使用未注释的内容。我为它编写了下面的代码。

/** 
* Removes all the commented part from the file content as well as returns a 
* file structure which have just lines with declaration syntax for eg. 
* Create Package packageName <- Stores all decalartion lines as separate 
* string in an array 
* 
* @param file 
* @return file content 
* @throws IOException 
*/ 
private static String[] filterContent(File file) throws IOException { 

    String withoutComment = ""; 
    String declare = ""; 
    String[] content; 
    List<String> readLines = FileUtils.readLines(file); 

    int size = readLines.size(); 
    System.out.println(file.getName() + " Files number of lines "+ size + " at "+new Date()); 
    String[] declareLines = new String[size]; 
    int startComment = 0; 
    int endComment = 0; 
    Boolean check = false; 
    int j = 0; 
    int i=0; 
    // Reading content line by line 
    for (String line:readLines) { 
     // If line contains */ that means comment is ending in this line, 
     // making a note of the line number 
     if (line.toString().contains("*/")) { 
      endComment = i; 
      // Removing the content before */ from the line 
      int indexOf = line.indexOf("*/"); 
      line = line.replace(line.substring(0, indexOf + 2), ""); 
     } 

     // If startComment is assigned fresh value and end comment hasn't, 
     // that means the current line is part of the comment 
     // Ignoring the line in this case and moving on to the next one 
     if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
      continue; 

     // If line contains /* that means comment is starting in this line, 
     // making a note of the line number 
     if (line.contains("/*")) { 
      startComment = i; 
      // Removing the content after /* from the line 
      int indexOf = line.indexOf("/*"); 
      line = line.replace(line.substring(indexOf), ""); 
      if (i == 0) 
       check = true; // means comment in the very first line 
     } 

     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("--")) { 
      int indexOf = line.indexOf("--"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 
     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("#")) { 
      int indexOf = line.indexOf("#"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 

     // At this point, all commented part is removed from the line, hence 
     // appending it to the final content 
     if (!line.isEmpty()) 
      withoutComment = withoutComment + line + " \n"; 
     // If line contains CREATE its a declaration line, holding it 
     // separately in the array 
     if (line.toUpperCase().contains(("CREATE"))) { 
      // If next line does not contains Create and the current line is 
      // the not the last line, 
      // then considering two consecutive lines as declaration line, 
      if (i < size - 1 && !readLines.get(i + 1).toString().toUpperCase().contains(("CREATE"))) { 
       declare = line + " " + readLines.get(i + 1).toString() + "\n"; 
      } else if (i < size) {// If the line is last line, including 
            // that line alone. 
       declare = line + "\n"; 
      } 

      declareLines[j] = declare.toUpperCase(); 
      j++; 
     } 
     i++; 
    } 
    System.out.println("Read lines "+ new Date()); 
    List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
    list.removeAll(Collections.singleton(null)); 

    content = list.toArray(new String[list.size() + 1]); 

    withoutComment = withoutComment.toUpperCase(); 
    content[j] = withoutComment; 
    System.out.println("Retruning uncommented content "+ new Date()); 
    return content; 
} 


public static void main(String[] args) { 
     String[] content = filterContent(new File("abc.txt")); 
} 

这个代码的问题是它太慢,如果文件大小很大。对于10 MB文件,删除评论需要6个多小时。 (代码在SSH服务器上运行)。

我可以拥有大小不超过100 MB的文件,在这个文件中需要几天时间才能删除评论。我如何更快地删除评论?

更新:问题不是重复的,因为我的问题不仅仅是通过改变阅读行的方式来解决。它的字符串活动使得这个过程变得缓慢,我需要一种方法来使评论移除活动更快。

+0

1.不要将整个文件放在内存中。 2.你为什么想这样做? – Axel

+0

首先,不要把它放到列表中,使用InputStream读取文件并直接分析行。你可以很容易地找到一行是否包含'/ *'或'/ * ... * /',删除它并重新创建没有注释的新文件。读取超过100MB的文件应该不会花费那么长的时间... – AxelH

+0

[如何使用Java逐行读取大型文本文件?](http://stackoverflow.com/questions/5868369/how-to -read-a-large-text-file-line-by-java) – AxelH

回答

0

发现我的代码最大的问题是使用Strings。用任何方法读取行不会造成很大的差别,但使用StringBuilder而不是String来存储未注释的行,从而大幅改变了性能。现在,与StringBuilder相同的代码需要几秒钟时间才能删除需要花费数小时的注释。

这是代码。为了获得更好的性能,我将List更改为BufferedReader

/** 
    * Removes all the commented part from the file content as well as returns a 
    * file structure which have just lines with declaration syntax for eg. 
    * Create Package packageName <- Stores all decalartion lines as separate 
    * string in an array 
    * 
    * @param file 
    * @return file content 
    * @throws IOException 
    */ 
    private static List<String> filterContent(File file) throws IOException { 

     StringBuilder withoutComment = new StringBuilder(); 
//  String declare = ""; 
//  String[] content; 
//  List<String> readLines = FileUtils.readLines(file); 
// 
//  int size = readLines.size(); 
     System.out.println(file.getName() + " at " + new Date()); 
     List<String> declareLines = new ArrayList<String>(); 
     // String line = null; 
     int startComment = 0; 
     int endComment = 0; 
     Boolean check = false; 
     Boolean isLineDeclaration = false; 

     int j = 0; 
     int i = 0; 

     InputStream in = new FileInputStream(file); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in)); 
     String line; 
     // Reading content line by line 
     while ((line = reader.readLine()) != null) { 
      // for (int i = 0; i < size; i++) { 
      // line = readLines.get(i).toString();// storing current line data 
      // If line contains */ that means comment is ending in this line, 
      // making a note of the line number 
      if (line.toString().contains("*/")) { 
       endComment = i; 
       // Removing the content before */ from the line 
       int indexOf = line.indexOf("*/"); 
       line = line.replace(line.substring(0, indexOf + 2), ""); 
      } 

      // If startComment is assigned fresh value and end comment hasn't, 
      // that means the current line is part of the comment 
      // Ignoring the line in this case and moving on to the next one 
      if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
       continue; 

      // If line contains /* that means comment is starting in this line, 
      // making a note of the line number 
      if (line.contains("/*")) { 
       startComment = i; 
       // Removing the content after /* from the line 
       int indexOf = line.indexOf("/*"); 
       line = line.replace(line.substring(indexOf), ""); 
       if (i == 0) 
        check = true; // means comment in the very first line 
      } 

      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("--")) { 
       int indexOf = line.indexOf("--"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 
      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("#")) { 
       int indexOf = line.indexOf("#"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 

      // At this point, all commented part is removed from the line, hence 
      // appending it to the final content 
      if (!line.isEmpty()) 
       withoutComment.append(line).append(" \n"); 
      // If line contains CREATE its a declaration line, holding it 
      // separately in the array 
      if (line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.add(line.toUpperCase()); 

       isLineDeclaration = true; 
       j++; 
      } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase()); 
       isLineDeclaration = false; 
      } 
      i++; 
     } 

     reader.close(); 
     System.out.println("Read lines " + new Date()); 
//  List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
     declareLines.removeAll(Collections.singleton(null)); 

//  content = list.toArray(new String[list.size() + 1]); 

//  withoutComment = withoutComment..toUpperCase(); 
     declareLines.add(withoutComment.toString().toUpperCase()); 
     System.out.println("Retruning uncommented content " + new Date()); 
     return declareLines; 
    } 
0

您可以创建多个线程做的工作(需要您行的正确分裂)

+0

该文件甚至可能有50万行。不会创建数百个线程重载线程堆栈? –

0

一些主意,让这些代码更快

使用InputStream读取该文件,并直接分析线,将新的String存储在未注释的文件中。这将防止多次读取文件(一旦创建List<String> readLines,一旦完成您的迭代)

设计,您可以使用注释语法而不是此redondant代码的映射。

一旦这样做,这应该是更快。当然,多线程可能是一个解决方案,但是这需要进行一些检查,以确保您不会将文件拆分为注释块。所以,首先改善代码,然后你可以想到这一点。