Spark Java Error: Size Exceeds Integer.max_value
Solution 1:
The Integer.MAX_INT
restriction is on the size of a file being stored. 1.2M rows is not a big thing, to I'm not sure your problem is "the limits of spark". More likely, some part of your work is creating something too big to be handled by any given executor.
I'm no Python coder, but when you "hashed the features of the records" you might be taking a very sparse set of records for a sample and creating an non-sparse array. This will mean a lot of memory for 16384 features. Particularly, when you do zip(line[1].indices, line[1].data)
. The only reason that doesn't get you out of memory right there is the shitload of it you seem to have configured (50G).
Another thing that might help is to increase the partitioning. So if you can't make your rows use less memory, at least you can try having fewer rows on any given task. Any temporary files being created are likely to depend on this, so you'll be more unlikely to hit file limits.
And, totally unrelated to the error but relevant for what you are trying to do:
16384 is indeed a big number of features, in the optimistic case where each one is just a boolean feature, you have a total of 2^16384 possible permutations to learn from, this is a huge number(try it here: https://defuse.ca/big-number-calculator.htm).
It is VERY, VERY likely that no algorithm will be able to learn a decision boundary with just 1.2M samples, you would probably need at least a few trillion trillion examples to make a dent on such a feature space. Machine Learning has its limitations, so don't be surprised if you don't get better-than-random accuracy.
I would definitely recommend trying some sort of dimensionality reduction first!!
Solution 2:
At some point, it tries to store the features and 1.2M * 16384 is greater than Integer.MAX_INT so you are trying to store more than than maximum size of features supported by Spark.
You're probably running into the limits of Apache Spark.
Solution 3:
Increasing the number of partitions may cause Active tasks is a negative number in Spark UI, which probably means that the number of partitions is too high.
Post a Comment for "Spark Java Error: Size Exceeds Integer.max_value"