Running DSE 4.7
So say I have a 4 node DSE Cassandra/Spark cluster...
I have a Cassandra table with say 4,000,000 records in it.
On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?"
Will Spark load all the data into RDD and then filter based on the where clause? Will each spark node have 1,000,000 records per node loaded into memory?
Best How To :
Will spark load all the data into RDD and then filter based on the where clause?
It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours
where email = ? or mobile = ? definitely does not), Spark will load only part of the data.
In your case it will have to scan all the data.
Will each spark node have 1,000,000 records per node loaded into memory?
Again, it depends of your dataset size and amount of RAM on worker nodes. Spark RDDs are not always fully loaded into RAM, in your case it can be split into smaller parts (e.g. 100k rows), loaded into ram, filtered according to your query and saved after that, one-by-one.