Monday, May 9, 2016

Spark Cassandra Connector and DataFrame

When you write a DataFrame to a cassandra table, be careful to use SaveMode.Overwrite. In spark-cassandra-connector-1.6.0-M2, TRUNCATE $keyspace.$table will be called. See the code in CassandraSourceRelation.scala.
I did observe something weird when I use the following code to write a data frame to a cluster of Cassandra 2.1.8:
df.write
  .format(""org.apache.spark.sql.cassandra")
  .mode(SaveMode.Overwrite)
  .options(Map("table" -> table, "keyspace" -> keyspace))
  .save
After the scheduled spark job finishes, in CQLSH, the table is empty when running select * from keyspace.table limit 10. The same results if I change consistency level to QUORUM, and even ALL. It might take some time, then the query returns the results.
If I start the job manually from the command line, however, most of time the query can return the results.
If you check the CQL document for TRUNCATE, setting consistency level to ALL is required.
Note: The consistency level must be set to ALL prior to performing a TRUNCATE operation. All replicas must remove the data.
I don’t think the consistency level is changed before calling TRUNCATE $keyspace.$table in spark-cassandra-connector. The default consistency level is LOCAL_QUORUM. That might be the root cause.

No comments:

Post a Comment