我们可以使用蜂巢状态UDF自动增量值。代码将如下所示。
package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
* UDFRowSequence.
*/
@Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();
public UDFRowSequence() {
result.set(0);
}
public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}
// End UDFRowSequence.java
注册UDF:
CREATE TEMPORARY FUNCTION auto_increment_id AS
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'
用法:
SELECT auto_increment_id() as id, col1, col2 FROM table_name
类似的问题在这里回答(How to implement auto increment in spark SQL)
我需要的是这样的,但问题是, ,它会sc啤酒与数据2亿美元。实际上,我想打破一个包含文件的精确10K行的较小文件中包含2亿行的大文件。我想为每行添加自动增量编号,并在像这样的帮助下批量读取(id> 10,001和id <20,000)。请问这个工作在这个规模,请建议。 –