2016-06-07 44 views
4

我试图从我上传到HDFS目录的CSV中的Impala中创建表格。 CSV包含带引号内的逗号的值。从CSV创建表格,其中包含引号括起来的逗号值

实施例:

1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC." 
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC." 
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC." 
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC." 
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC." 

Impala documentation说,这可以与ESCAPED BY子句来解决。这里是我当前的代码:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
    network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'; 

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

我也使用ESCAPED BY '"'子句尝试。在这两种情况下,Impala都在引号内使用逗号,并将其用作分隔符,将值分成两列。

有关如何修复代码以避免这种情况发生的任何想法?

EDIT(2015年6月9日)

所以,我已经通过以下变化了的基础上,从@K小号Nidhin和@JTUP建议。然而,每一个变化返回相同结果作为查询刻录而不SERDEPROPERTIES运营商,逗号仍然导致值出现在错误的列:

变化1

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
    network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
WITH SERDEPROPERTIES ("quoteChar" = "'", "escapeChar" = "\\") 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'; 

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

变化2

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
    network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' 
WITH SERDEPROPERTIES ('quoteChar' = '"', 'escapeChar' = '\\') 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'; 

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

变形例3

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
    network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' 
WITH SERDEPROPERTIES (
    "separatorChar" = "\,", 
    "quoteChar"  = "\"" 
) 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'; 

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

有没有其他的想法,或者SERDEPROPERTIES运营商的其他变种试试?

EDIT(2016年6月10日)

我能得到使用SERDESERDEPROPERTIES运营商在蜂房的工作(基于Hive Documentation提供的代码)查询的不同变化,与正在创建正确的表:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING) 

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 

WITH SERDEPROPERTIES (
    'separatorChar' = ',', 
    'quoteChar'  = '"', 
    'escapeChar' = '\\' 
) 
STORED AS TEXTFILE; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

由于SERDE经营者不得在因帕拉提供,该解决方案将在那里工作。我很好地在Hive中创建表格,但是我仍然无法在Impala中找到可行的解决方案。

+0

尝试增加SERDE性能随SERDEPROPERTIES( “quoteChar”= “'”, “escapeChar”= “\\” ) –

回答

0
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4; 

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
    network STRING 
,isp STRING 
,organization STRING 
,autonomous_system_number STRING 
,autonomous_system_organization STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' 

WITH SERDEPROPERTIES (
    "separatorChar" = "\,", 
    "quoteChar"  = "\"" 
) 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'; 

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4; 

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4; 

添加与SERDEPROPERTIES应该希望做的伎俩

+0

刚刚试了一下。不幸的是,Impala不支持“OPTIONALALLY ENCLOSED BY”。 – nxl4

+0

进行了编辑检查,看它是否有效。自从我上一份工作以来我没有这样做过。所以不知道如果我把它放在正确的地方。但serdeproperties应该有助于逗号。 – JT4U

+0

看到我的编辑到上面的OP。 – nxl4

相关问题