Skip to content Skip to sidebar Skip to footer

How To Resolve Pickle Error In Pyspark?

I am iterating through files to gather information about the values in their columns and rows in a dictionary. I have the following code which works locally: def search_nulls(file

Solution 1:

The source of your problem is a following line:

null_cols[str(m)] = defaultdict(lambda: 0)

As you can read in the What can be pickled and unpickled? section of the pickle module documentation:

The following types can be pickled:

  • ...
  • functions defined at the top level of a module (using def, not lambda)
  • built-in functions defined at the top level of a module
  • ...

It should be clear that lambda: 0 doesn't meet above criteria. To make it work you can for example replace lambda expression with int:

null_cols[str(m)] = defaultdict(int)

How is it possible that we can pass lambda expression to the higher order functions in PySpark? The devil is in the detail. PySpark is using different serializers depending on a context. To serialize closures, including lambda expressions it is using custom cloudpickle which supports lambda expressions and nested functions. To handle data it is using default Python tools.


A few side notes:

  • I wouldn't use Python file objects to read data. It is not portable and won't work beyond local file system. You can use SparkContex.wholeTextFiles instead.
  • if you do make sure you close the connections. Using with statement is usually the best approach
  • you can safely strip newline characters before you split the line

Post a Comment for "How To Resolve Pickle Error In Pyspark?"