We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time).
Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not sure if that will remain to be the case in future.
In order to build a generic system that should be able to cater our future needs as well, we are looking to use Apache Spark for data analytics. I know that data can be read into Spark RDDs from HDFS, HBase and S3, but does it support data reading from Redshift directly? If not, we can look to transfer our data to S3 and then read it in Spark RDDs.
My question is if we should carry out our Data Analytics through Redshift's queries directly or should we look to go with the approach above and do analytics through Apache Spark (Problem here is that Data Locality optimization might not be available)?
In case we do analytics through Redshift queries directly, can anyone please suggest a good Workflow Scheduler to write our Analytics jobs with. Our requirement is to be able to execute jobs as a DAG (Job2 should execute only if Job1 succeeds, etc) and be able to schedule our workflows through the proposed Workflow Engine.
Oozie seems like a good fit for our requirements but it turns out that Oozie cannot be used without Hadoop. Does it make sense to set up Hadoop on our machines and then use Oozie Workflow Scheduler to schedule our Data Analysis jobs through Redshift queries?