我试图转储到相关文件AVRO但我得到一个奇怪的错误:猪铸造/数据类型
org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence
我不使用DataByteArray
(字节阵列),参见下面的关系的描述。
sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray}
即使我不明确铸造我得到了同样的错误:
sensitiveSet = foreach sensitiveSet generate (long) $0, (chararray) $1, (long) $2, (chararray) $3, (chararray) $4, (chararray) $5, (chararray) $6;
STORE sensitiveSet INTO 'testOut2222.avro'
USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"xxxx","namespace":"","fields":[{"name":"rank_ID","type":"long"},{"name":"name","type":"string","store":"no","sensitive":"na"},{"name":"customerId","type":"string","store":"yes","sensitive":"yes"},{"name":"VIN","type":"string","store":"yes","sensitive":"yes"},{"name":"birth_date","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_mileage","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_consumption","type":"string","store":"yes","sensitive":"no"}]}');
编辑:
我试图确定哪些应该是包含另外两个一个元组的输出模式元组,即stats:tuple(c:tuple(),d:tuple)
。
下面的代码不能像预期的那样工作。它以某种方式产生结构为:
stats:tuple(b:tuple(c:tuple(),d:tuple()))
以下是由describe
产生的输出。
sourceData: {com.mortardata.pig.dataspliter_36: (stats: ((name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),(name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray)))}
是否有可能创建如下结构,这意味着我需要从前面的示例中删除元组b。
grunt> describe sourceData;
sourceData: {t: (s: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),n: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray))}
下面的代码不能按预期工作。
public Schema outputSchema(Schema input) {
Schema sensTuple = new Schema();
sensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
sensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
sensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
sensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
sensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
sensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));
Schema nonSensTuple = new Schema();
nonSensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
nonSensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
nonSensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
nonSensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
nonSensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
nonSensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));
Schema parentTuple = new Schema();
parentTuple.add(new Schema.FieldSchema(null, sensTuple, DataType.TUPLE));
parentTuple.add(new Schema.FieldSchema(null, nonSensTuple, DataType.TUPLE));
Schema outputSchema = new Schema();
outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), outputSchema, DataType.TUPLE));
的UDF的exec方法返回:
public Tuple exec(Tuple tuple) throws IOException {
Tuple parentTuple = mTupleFactory.newTuple();
parentTuple.append(tuple1);
parentTuple.append(tuple2);
EDIT2(固定)
...
Schema outputSchema = new Schema();
outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));
return outputSchema;
现在我回来从UDF正确的架构,所有的项目都是chararray但是当我尝试存储这些物品放入Avro的文件类型:字符串,我得到了同样的错误:
java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
解决: 确定,问题在于数据并没有被转换为UDF正文中的正确类型 - exec()方法。看起来现在起作用了!
您是否尝试过使用'使用... AvroStorage()'或'... AvroStorage('no_schema_check')'? – WattsInABox
,你可以看到“no_chema_check”在那里。 – heap
@heap:customerId是长类型的,并且它被保存为字符串,如果它不是long类型的? –