我试图从我上传到HDFS目录的CSV中的Impala中创建表格。 CSV包含带引号内的逗号的值。从CSV创建表格,其中包含引号括起来的逗号值
实施例:
1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
的Impala documentation说,这可以与ESCAPED BY
子句来解决。这里是我当前的代码:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
我也使用ESCAPED BY '"'
子句尝试。在这两种情况下,Impala都在引号内使用逗号,并将其用作分隔符,将值分成两列。
有关如何修复代码以避免这种情况发生的任何想法?
EDIT(2015年6月9日)
所以,我已经通过以下变化了的基础上,从@K小号Nidhin和@JTUP建议。然而,每一个变化返回相同结果作为查询刻录而不SERDEPROPERTIES
运营商,逗号仍然导致值出现在错误的列:
变化1
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ("quoteChar" = "'", "escapeChar" = "\\")
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
变化2
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES ('quoteChar' = '"', 'escapeChar' = '\\')
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
变形例3
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
有没有其他的想法,或者SERDEPROPERTIES
运营商的其他变种试试?
EDIT(2016年6月10日)
我能得到使用SERDE
和SERDEPROPERTIES
运营商在蜂房的工作(基于Hive Documentation提供的代码)查询的不同变化,与正在创建正确的表:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
由于SERDE
经营者不得在因帕拉提供,该解决方案将在那里工作。我很好地在Hive中创建表格,但是我仍然无法在Impala中找到可行的解决方案。
尝试增加SERDE性能随SERDEPROPERTIES( “quoteChar”= “'”, “escapeChar”= “\\” ) –