How To Resolve Pickle Error In Pyspark?
I am iterating through files to gather information about the values in their columns and rows in a dictionary. I have the following code which works locally: def search_nulls(file
Solution 1:
The source of your problem is a following line:
null_cols[str(m)] = defaultdict(lambda: 0)
As you can read in the What can be pickled and unpickled? section of the pickle module documentation:
The following types can be pickled:
- ...
- functions defined at the top level of a module (using def, not lambda)
- built-in functions defined at the top level of a module
- ...
It should be clear that lambda: 0
doesn't meet above criteria. To make it work you can for example replace lambda expression with int
:
null_cols[str(m)] = defaultdict(int)
How is it possible that we can pass lambda expression to the higher order functions in PySpark? The devil is in the detail. PySpark is using different serializers depending on a context. To serialize closures, including lambda expressions it is using custom cloudpickle
which supports lambda expressions and nested functions. To handle data it is using default Python tools.
A few side notes:
- I wouldn't use Python
file
objects to read data. It is not portable and won't work beyond local file system. You can useSparkContex.wholeTextFiles
instead. - if you do make sure you close the connections. Using
with
statement is usually the best approach - you can safely strip newline characters before you split the line
Post a Comment for "How To Resolve Pickle Error In Pyspark?"