Spark With Cassandra Python Setup
I am trying to use spark to do some simple computations on Cassandra tables, but I am quite lost. I am trying to follow: https://github.com/datastax/spark-cassandra-connector/blob
Solution 1:
- Copy pyspark-cassandra connector spark-folder/jars.
Below code will connect to cassandra.
from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext, SparkSession spark = SparkSession.builder \ .appName('SparkCassandraApp') \ .config('spark.cassandra.connection.host', 'localhost') \ .config('spark.cassandra.connection.port', '9042') \ .config('spark.cassandra.output.consistency.level','ONE') \ .master('local[2]') \ .getOrCreate() sqlContext = SQLContext(spark) ds = sqlContext \ .read \ .format('org.apache.spark.sql.cassandra') \ .options(table='emp', keyspace='demo') \ .load() ds.show(10)
Solution 2:
Cassandra connector doesn't provide any Python modules. All functionality is provided with Data Source API and as long as required jars are present, everything should work out of the box.
How do I let Spark know where my Cassandra cluster is?
Use spark.cassandra.connection.host
property. You can for exampel pass it as an argument for spark-submit
/ pyspark
:
pyspark ... --conf spark.cassandra.connection.host=x.y.z.v
or set in your configuration:
(SparkSession.builder
.config("cassandra.connection.host", "x.y.z.v"))
Configuration like table name or keyspace can be set directly on reader:
(spark.read
.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="test", cluster="cluster")
.load())
So you can follows Dataframes documentation.
As a side note
import com.datastax.spark.connector._
is a Scala syntax and is accepted in Python only accidentally.
Solution 3:
With username and password:
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'localhost') \
.config('spark.cassandra.connection.port', '9042') \
.config("spark.cassandra.auth.username","cassandrauser")\
.config("spark.cassandra.auth.password","cassandrapwd")\
.master('local[2]') \
.getOrCreate()
df = spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="tablename", keyspace="keyspacename").load()
df.show()
Post a Comment for "Spark With Cassandra Python Setup"