2017-10-19 77 views
0

我有以下文件夹中HDFS:使用的GroupBy而从HDFS复制到S3到一个文件夹中的文件合并

hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101 

每个文件夹都有近50个文件中it.My目的是要合并的所有文件在一个文件夹内从HDFS复制S3上的单个文件。我遇到的问题是与GROUPBY option.I正则表达式尝试这样做,这似乎并没有工作:

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo 

命令的工作本身,而是我不每个文件夹中获得文件合并成一个文件,这让我相信这个问题是与我的正则表达式。

回答

0

我想通了这一点我自己only..the正确的正则表达式是

.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.* 

和命令合并和副本:

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo 
相关问题