如何枚举HDFS目录中的文件?这是用于使用Scala枚举Apache Spark群集中的文件。我看到有sc.textfile()选项,但它会读取内容。我只想读取文件名。如何枚举HDFS目录中的文件
其实我试过listStatus。但没有奏效。获取下面的错误。 我使用的是Azure HDInsight Spark,blob存储文件夹“[email protected]/example/”包含.json文件。
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path("wasb://[email protected]/example/"))
status.foreach(x=> println(x.getPath)
=========
Error:
========
java.io.FileNotFoundException: Filewasb://[email protected]/example does not exist.
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2076)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:23)
at $iwC$$iwC$$iwC.<init>(<console>:28)
at $iwC$$iwC.<init>(<console>:30)
at $iwC.<init>(<console>:32)
at <init>(<console>:34)
at .<init>(<console>:38)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.Console$.withOut(Console.scala:126)
at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLine(SparkInterpreter.scala:271)
at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLines(SparkInterpreter.scala:246)
at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.execute(SparkInterpreter.scala:104)
at com.cloudera.livy.repl.Session.com$cloudera$livy$repl$Session$$executeCode(Session.scala:98)
at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
谢谢!
请阅读HDFS API文档,尝试一些和后你试过了什么! – eliasah
我不明白你为什么试过spark api。因为你必须看看hdfs api和scala语法来做到这一点。由于@eliasah建议请在SO –