Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different ES(elasticsearch) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping
SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different
However, here comes a problem: I then need a DFS(Distributed File System) to share the big-ass files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data") conf.set("es.nodes", "from.escluster.com") val sc = new SparkContext(conf) val allDataRDD = sc.esRDD("some/lovelydata") val cfg = Map("es.nodes" -> "to.escluster.com") allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)