I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use
spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in
conf = SparkConf().setAppName('myapp').setMaster('local')
sc = SparkContext(conf=conf)
My question is, since
pytest isn't using
spark-submit to run my code, how can I provide my
spark-csv dependency to the python process?
Best How To :
you can use your config file spark.driver.extraClassPath to sort out the problem. Spark-default.conf
and add the property
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on
commons-csv apache jar. The
spark-csv jar you can either build or download from mvn-site.