How To Resolve Pickle Error In Pyspark?

April 21, 2024 Post a Comment

I am iterating through files to gather information about the values in their columns and rows in a dictionary. I have the following code which works locally: def search_nulls(file

Solution 1:

The source of your problem is a following line:

null_cols[str(m)] = defaultdict(lambda: 0)

As you can read in the What can be pickled and unpickled? section of the pickle module documentation:

The following types can be pickled:
...
functions defined at the top level of a module (using def, not lambda)
built-in functions defined at the top level of a module
...

It should be clear that lambda: 0 doesn't meet above criteria. To make it work you can for example replace lambda expression with int:

null_cols[str(m)] = defaultdict(int)

How is it possible that we can pass lambda expression to the higher order functions in PySpark? The devil is in the detail. PySpark is using different serializers depending on a context. To serialize closures, including lambda expressions it is using custom cloudpickle which supports lambda expressions and nested functions. To handle data it is using default Python tools.

A few side notes:

I wouldn't use Python file objects to read data. It is not portable and won't work beyond local file system. You can use SparkContex.wholeTextFiles instead.
if you do make sure you close the connections. Using with statement is usually the best approach
you can safely strip newline characters before you split the line

Learn Python Programming

How To Resolve Pickle Error In Pyspark?

Solution 1:

Post a Comment for "How To Resolve Pickle Error In Pyspark?"