Skip to content Skip to sidebar Skip to footer

Spark With Cassandra Python Setup

I am trying to use spark to do some simple computations on Cassandra tables, but I am quite lost. I am trying to follow: https://github.com/datastax/spark-cassandra-connector/blob

Solution 1:

  1. Copy pyspark-cassandra connector spark-folder/jars.
  2. Below code will connect to cassandra.

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, SparkSession
    
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[2]') \
      .getOrCreate()
    
    sqlContext = SQLContext(spark)
    ds = sqlContext \
      .read \
      .format('org.apache.spark.sql.cassandra') \
      .options(table='emp', keyspace='demo') \
      .load()
    
    ds.show(10) 
    

Solution 2:

Cassandra connector doesn't provide any Python modules. All functionality is provided with Data Source API and as long as required jars are present, everything should work out of the box.

How do I let Spark know where my Cassandra cluster is?

Use spark.cassandra.connection.host property. You can for exampel pass it as an argument for spark-submit / pyspark:

pyspark ... --conf spark.cassandra.connection.host=x.y.z.v

or set in your configuration:

(SparkSession.builder
    .config("cassandra.connection.host", "x.y.z.v"))

Configuration like table name or keyspace can be set directly on reader:

(spark.read
    .format("org.apache.spark.sql.cassandra")
    .options(table="kv", keyspace="test", cluster="cluster")
    .load())

So you can follows Dataframes documentation.

As a side note

import com.datastax.spark.connector._

is a Scala syntax and is accepted in Python only accidentally.

Solution 3:

With username and password:

spark = SparkSession.builder \
  .appName('SparkCassandraApp') \
  .config('spark.cassandra.connection.host', 'localhost') \
  .config('spark.cassandra.connection.port', '9042') \
  .config("spark.cassandra.auth.username","cassandrauser")\
  .config("spark.cassandra.auth.password","cassandrapwd")\
  .master('local[2]') \
  .getOrCreate()

df = spark.read.format("org.apache.spark.sql.cassandra")\
   .options(table="tablename", keyspace="keyspacename").load()

df.show()

Post a Comment for "Spark With Cassandra Python Setup"