2017-10-21 34 views
0

我有以下命令抢在UNIX一个JSON:正则表达式与多个管道JSON文件

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json 

哪个(每次显然有不同的结果)给了我下面的输出格式:

{ 
"kind": "...", 
"data": { 
"modhash": "", 
"whitelist_status": "...", 
"children": [ 
e1, 
e2, 
e3, 
... 
], 
"after": "...", 
"before": "..." 
} 
} 

其中阵列的儿童中的每个元素是结构化的作为对象如下:

{ 
"kind": "...", 
"data": { 
... 
} 
} 

这里是一个前充足完整的上传.json的get(车身太长,直接发布: https://pastebin.com/20p4kk3u

我需要打印完整的数据对象数组孩子的每一个元素中的存在。我知道我需要管ATLEAST两次,最初得到那里的孩子[...],然后数据{...},这是我到目前为止有:

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)' 

我是新来的正则表达式,所以我不知道如何处理括号或大括号内的元素我正在grepping。上面的行没有打印任何东西,我不知道为什么。任何帮助表示赞赏。

+2

你开到使用第三方的事业吗?我通常使用jq二进制来轻松解析json数据。根据您的要求,您只需将json数据传递给具有内部查询语言的jq即可:cat/tmp/data | jq'.data.children | 。[]'(这里/ tmp/data包含完整的json)。通过使用这些实用程序,您实际上可以使用较短的查询和高级功能(如原始输出,查询等)完成工作。 – akskap

+0

那么,获取数据的最终目标不是唯一的目标;这一次恰好是一个.json格式,但我想知道如何通过正则表达式来处理任何文件。 –

回答

1

代码

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)' 

一些关于正则表达式

* == zero or more time 
+ == one or more time 
? == zero or one time 
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character 
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_) 
\d == all numbers from 0 to 9 
\r == carriage return 
\n == new line character (line feed) 
\ == escape special characters so they can to be read as normal characters 
[...] == search for character class. Example: [abc] search for a or b or c 
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. 
\K == match start at this position. 

反正你可以阅读更多关于正则表达式从这里:Regex Tutorial

现在我可以试着解释代码

wget download the source. 
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep. 
grep -o option is used for only matching. 
grep -P option is for perl regexp. 

So here 
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' 
we have sayed: 
match all the line from "children" 
zero or more spaces 
: 
zero or more spaces 
\[ escaped so it's a simple character and not a special 
zero or more spaces 
\K force submatch to start from here 
(submatch 
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?) 
) close submatch 
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch. 
+0

感谢您的详细解释,非常有帮助。后续问题,如果使用egrep而不使用perl regex语法,会有什么区别? –

+0

看看这里:https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions –

1

如果你想得到儿童阵列试试这个,但我不知道这是你在找什么。

wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'