2016-10-24 59 views
2

我正在研究从2个大型csv文件(逐行读取数据)中读取数据的“程序”,比较文件中的数组元素,并在找到匹配项时写入我的必要的数据放入第三个文件。我遇到的唯一问题是它非常缓慢。它读取每秒1-2行,这是非常缓慢的,考虑到我有数百万条记录。关于如何让它更快的任何想法?这里是我的代码:优化CSV解析速度更快

 public class ReadWriteCsv { 

public static void main(String[] args) throws IOException { 

    FileInputStream inputStream = null; 
    FileInputStream inputStream2 = null; 
    Scanner sc = null; 
    Scanner sc2 = null; 
    String csvSeparator = ","; 
    String line; 
    String line2; 
    String path = "D:/test1.csv"; 
    String path2 = "D:/test2.csv"; 
    String path3 = "D:/newResults.csv"; 
    String[] columns; 
    String[] columns2; 
    Boolean matchFound = false; 
    int count = 0; 
    StringBuilder builder = new StringBuilder(); 

    FileWriter writer = new FileWriter(path3); 

    try { 
     // specifies where to take the files from 
     inputStream = new FileInputStream(path); 
     inputStream2 = new FileInputStream(path2); 

     // creating scanners for files 
     sc = new Scanner(inputStream, "UTF-8"); 

     // while there is another line available do: 
     while (sc.hasNextLine()) { 
      count++; 
      // storing the current line in the temporary variable "line" 
      line = sc.nextLine(); 
      System.out.println("Number of lines read so far: " + count); 
      // defines the columns[] as the line being split by "," 
      columns = line.split(","); 
      inputStream2 = new FileInputStream(path2); 
      sc2 = new Scanner(inputStream2, "UTF-8"); 

      // checks if there is a line available in File2 and goes in the 
      // while loop, reading file2 
      while (!matchFound && sc2.hasNextLine()) { 
       line2 = sc2.nextLine(); 
       columns2 = line2.split(","); 

       if (columns[3].equals(columns2[1])) { 
        matchFound = true; 
        builder.append(columns[3]).append(csvSeparator); 
        builder.append(columns[1]).append(csvSeparator); 
        builder.append(columns2[2]).append(csvSeparator); 
        builder.append(columns2[3]).append("\n"); 
        String result = builder.toString(); 
        writer.write(result); 
       } 

      } 
      builder.setLength(0); 
      sc2.close(); 
      matchFound = false; 
     } 

     if (sc.ioException() != null) { 
      throw sc.ioException(); 

     } 

    } finally { 
     //then I close my inputStreams, scanners and writer 
+0

看起来你正在重读第一行中每行的整个第二个文件。 *当然*这对大文件来说会很慢。 – azurefrog

+0

你能适应这两个文件的内存?如果是这样,只需读取并加载到内存中的数据结构(数组,列表等)。与内存操作相比,IO操作非常昂贵。 – Yuri

+0

@azurefrog我怎么能这样做呢?新编程,对不起 - – Noobinator

回答

1

使用现有的CSV库,而不是滚动自己的。它会比现在更强大。

但是,您的问题不是CSV解析速度,它的算法是O(n^2),对于第一个文件中的每一行,您需要扫描第二个文件。这种算法在数据量很大的情况下爆炸很快,当你有数百万行时,你会遇到问题。你需要一个更好的算法。

另一个问题是你重新解析每个扫描的第二个文件。您至少应该在程序开始时将其作为ArrayList或其他东西读取到内存中,这样您只需加载并解析一次即可。

0

使用univocity-parsers“CSV解析器,因为它不会花费更长的时间超过几秒钟来处理是分别用1万行两个文件:

public void diff(File leftInput, File rightInput) { 
    CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial 

    CsvParser leftParser = new CsvParser(settings); 
    CsvParser rightParser = new CsvParser(settings); 

    leftParser.beginParsing(leftInput); 
    rightParser.beginParsing(rightInput); 

    String[] left; 
    String[] right; 

    int row = 0; 
    while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) { 
     row++; 
     if (!Arrays.equals(left, right)) { 
      System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right)); 
     } 
    } 

    leftParser.stopParsing(); 
    rightParser.stopParsing(); 
} 

披露:我是这个库的作者。它是开放源代码和免费的(Apache V2.0许可证)。