How to connect pyspark from a spark cluster to a DSE cassandra cluster remotely
This article is meant to help those connecting from a remote Apache Spark cluster to DSE cassandra using pyspark.
To do so, you need to get the byos.properties and byos<version>.jar files from the DSE cassandra nodes, you can do so by first generating the byos.properties file from one of your analytics nodes:
dse client-tool configuration byos-export ~/byos.properties
Now, locate the byos jar file:
$ locate dse-byos
/usr/share/dse/clients/dse-byos_2.11-6.7.10.jar
/usr/share/dse/spark/client-lib/dse-byos_2.11-6.7.10.jar
Use scp to copy them over to your remote spark node where you will be running pyspark:
scp user@dsenode1.example.com:/usr/share/dse/clients/dse-byos_2.10–5.0.1–5.0.0-all.jar byos-5.0.jar
From your node, enter pyspark using the following command:
pyspark --jars dse-byos_2.11–6.7.5.jar --properties-file byos.properties
Of course, you’ll need to alter the file names to be appropriate.