2015-12-04 94 views
0

我想使用pyspark sql在Hive Table(表名:'sample')上构建查询。 以下是简单pyspark代码我在pyspark壳使用PySpark.sql.HiveContext查询现有的Hive tavle

From pyspark.sql import HiveContext 
sqlContext = HiveContext(sc) 
sqlContext.sql("SELECT * FROM sample").collect() 

下面编译是我遇到的错误:

15/12/04 11:15:20 WARN SparkConf: The configuration key 

'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead. 
15/12/04 11:15:21 INFO HiveContext: Initializing execution hive, version 0.13.1 
15/12/04 11:15:21 INFO metastore: Trying to connect to metastore with URI thrift://maprecruit.server1:9083 
15/12/04 11:15:21 INFO metastore: Connected to metastore. 
15/12/04 11:15:23 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 
15/12/04 11:15:23 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 
15/12/04 11:15:23 INFO ParseDriver: Parsing command: SELECT * FROM sample 
15/12/04 11:15:24 INFO ParseDriver: Parse Completed 
15/12/04 11:15:24 INFO HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes. 
15/12/04 11:15:29 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found 
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe not found 
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) 
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337) 
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) 
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281) 
    at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631) 
    at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189) 
    at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017) 
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:202) 
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:198) 
    at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:156) 
    at org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:198) 
    at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:112) 
    at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:61) 
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:227) 
    at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:373) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:165) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:165) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:165) 
    at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:373) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:222) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) 
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
    at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
    at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:272) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:212) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:229) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:219) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:61) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:59) 
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) 
    at scala.collection.immutable.List.foldLeft(List.scala:84) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51) 
    at scala.collection.immutable.List.foreach(List.scala:318) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51) 
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:933) 
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:933) 
    at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931) 
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131) 
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) 
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:745) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/hdp/2.3.2.0-2950/spark/python/pyspark/sql/context.py", line 502, in sql 
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self) 
    File "/usr/hdp/2.3.2.0-2950/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ 
    File "/usr/hdp/2.3.2.0-2950/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o35.sql. 
: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found) 
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:290) 
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281) 
    at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631) 
    at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189) 
    at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017) 
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:202) 
    at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:198) 
    at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:156) 
    at org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:198) 
    at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:112) 
    at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:61) 
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:227) 
    at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:373) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:165) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:165) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:165) 
    at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:373) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:222) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) 
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
    at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
    at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:272) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:212) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:229) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:219) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:61) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:59) 
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) 
    at scala.collection.immutable.List.foldLeft(List.scala:84) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51) 
    at scala.collection.immutable.List.foreach(List.scala:318) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51) 
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:933) 
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:933) 
    at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931) 
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131) 
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) 
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: MetaException(message:java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found) 
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:346) 
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) 
    ... 67 more 

我知道,我在配置部分,我迷路了。任何人都可以帮忙配置部分?

PS:我使用Hortonworks Ambari HDP-2.2

回答

0

蜂房专有的SerDes通常不在火花(例如HBase的,ORC)工作。

试一试pyspark-hbase