Python Memory Error Encountered When Replacing Nan Values In Large Pandas Dataframe
I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full. I am attempting to replace all of the strings 'NaN' with numpy.
Solution 1:
One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.
Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv()
option. Once that's done, if you look into the documentation of the df.read_csv()
in the pandas documentation of read_csv, you shall notice this parameter
na_values : scalar, str, list-like, or dict, defaultNone
Additional strings to recognize as NA/NaN. If dict passed, specificper-column NA values. Bydefault the following valuesare interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.
So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.
Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!
Post a Comment for "Python Memory Error Encountered When Replacing Nan Values In Large Pandas Dataframe"