I've got a number of logs spread across a number of machines, and I'd like to collect / aggregate some information about them. Maybe first I want to count the number of lines which contain the string "Message", then later I'll add up the numbers in the fifth column of all of them.
Ideally I'd like to have each individual machine perform whatever operation I tell it to on its own set of logs, then return the results somewhere centralized for aggregation. I (shakily) gather that this is analogous to the Reduce operation of the MapReduce paradigm.
My problem seems to be with Map. My gut tells me that Hadoop isn't a good fit, because in order to distribute the work, each worker node needs a common view of all of the underlying data -- the function fulfilled by HDFS. I don't want to aggregate all the existing data only so that I can then distribute operations across it; I want each specific machine to analyze the data that it (and only it) has.
I can't tell if Apache Spark would allow me to do this. The impression that I got from the quick-start guide is that I could have one master node push out a compiled arbitrary JAR and that each worker would run it, in this case over just the data identified by logic within that JAR, and return their results to the master node for me to do with what I please. But this from their FAQ makes me hesitant:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So my question is: Is Apache Spark appropriate for having an existing set of machines analyze data that they already have and aggregating the results?
If it is, could you please reiterate at a high level how Spark would process and aggregate pre-distributed, independent sets of data?
If not, are there any similar frameworks that allow one to analyze existing sets of distributed data?