2016-07-26 78 views
2

我在一个mac上运行,并有一个非常大的包含超过100k对象的.json文件。在Mac上将.json文件拆分成多个文件

我想将文件分成许多文件(最好是50-100)。

源文件

原来以.json文件是一个多维数组,看起来有点像这样:

[{ 
    "id": 1, 
    "item_a": "this1", 
    "item_b": "that1" 
}, { 
    "id": 2, 
    "item_a": "this2", 
    "item_b": "that2" 
}, { 
    "id": 3, 
    "item_a": "this3", 
    "item_b": "that3" 
}, { 
    "id": 4, 
    "item_a": "this4", 
    "item_b": "that4" 
}, { 
    "id": 5, 
    "item_a": "this5", 
    "item_b": "that5" 
}] 

所需的输出

如果这被分成三个文件我想输出看起来像这样:

文件1:

[{ 
    "id": 1, 
    "item_a": "this1", 
    "item_b": "that1" 
}, { 
    "id": 2, 
    "item_a": "this2", 
    "item_b": "that2" 
}] 

文件2:

[{ 
    "id": 3, 
    "item_a": "this3", 
    "item_b": "that3" 
}, { 
    "id": 4, 
    "item_a": "this4", 
    "item_b": "that4" 
}] 

文件3:

[{ 
    "id": 5, 
    "item_a": "this5", 
    "item_b": "that5" 
}] 

任何想法将不胜感激。谢谢!

回答

3

Perl来救援:

#!/usr/bin/perl 
use warnings; 
use strict; 

use JSON; 

my $file_count = 5; # You probably want 50 - 100 here. 

my $json_text = do { 
    local $/; 
    open my $IN, '<', '1.json' or die $!; 
    <$IN> 
}; 
my $arr = decode_json($json_text); 
my $size = @$arr/$file_count; 
my $rest = @$arr % $file_count; 

my $i = 1; 
while (@$arr) { 
    open my $OUT, '>', "file$i.json" or die $!; 
    my @chunk = splice @$arr, 0, $size; 
    ++$size if $i++ >= $file_count - $rest; 
    print {$OUT} encode_json(\@chunk); 
    close $OUT or die $!; 
} 
3

@ choroba的答案是非常有效和灵活。 我有一个bash解决方案jq

#!/bin/bash 
i=0 
file=0 
for f in `cat data.json | jq -c -M '.[]'`; 
do 

    if [ $i -eq 2 ]; then 

     ret=`jq --slurp "." /tmp/0.json /tmp/1.json > File$file.json`; 
     ret=`rm /tmp/0.json /tmp/1.json`; #cleanup 

     ((file = file + 1)); 
    i=0 
    fi 
    ret=`echo $f > /tmp/$i.json`; 
    ((i = i + 1)); 
done 
if [ -f /tmp/0.json ]; then 
    ret=`jq --slurp '.' /tmp/0.json > File$file.json`; 
    ret=`rm /tmp/0.json`; #cleanup 
fi 
1
$ cat tst.awk 
/{/ && (++numOpens % 2) { 
    if (++numOuts > 1) { 
     print out, "}]" 
     close(out) 
    } 
    out = "out" numOuts 
    $0 = "[{" 
} 
{ 
    # print > out 
    print out, $0 
} 

$ awk -f tst.awk file 
out1 [{ 
out1  "id": 1, 
out1  "item_a": "this1", 
out1  "item_b": "that1" 
out1 }, { 
out1  "id": 2, 
out1  "item_a": "this2", 
out1  "item_b": "that2" 
out1 }] 
out2 [{ 
out2  "id": 3, 
out2  "item_a": "this3", 
out2  "item_b": "that3" 
out2 }, { 
out2  "id": 4, 
out2  "item_a": "this4", 
out2  "item_b": "that4" 
out2 }] 
out3 [{ 
out3  "id": 5, 
out3  "item_a": "this5", 
out3  "item_b": "that5" 
out3 }] 

只是删除print out, $0并取消# print > out你测试后是满意的。

+0

谢谢你,Ed。我认为这非常接近。它在测试时在我的终端中正确打印,但是当我删除'print out,$ 0'并取消注释'#print $ 0> out'时,out1和out2的末尾将被打印在终端中,但不包含在文件中。 '}]'被截断,只是在终端打印。任何想法如何解决?谢谢! – Brandon

+0

您必须复制/粘贴错误或未注释的错误。我发布的脚本**不会执行您所描述的内容。如果您编辑问题以显示您正在运行的脚本,我们可以帮助您进行调试。 –

+0

如果任何键或值包含“{”字符,则这将失败。 –