I am trying to use Apache Spark for comparing two different files based on some common field, and get the values from both files and write it as output file.
I am using Spark SQL for joining both files (after storing the RDD as table).
Is this the correct approach?
compare / join files without Apache SQL?
Please suggest me on this.
Best How To :
If you use plain spark you can join two RDDs.
let a = RDD<Tuple2<K,T>>
let b = RDD<Tuple2<K,S>>
RDD<Tuple2<K,Tuple2<S,T>>> c = a.join(b)
This produces an RDD of every pair for key K. There are also leftOuterJoin, rightOuterJoin, and fullOuterJoin methods on RDD.
So you have to map both datasets to produce two RDD's indexed by your common key, then join them. Here is the documentation i'm referencing.