Python - Finding Unicode/ascii Problems
Solution 1:
The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :
try:
# decode using utf-8 (use ascii if you want)
unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
print"The error is there !"
this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.
Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.
Solution 2:
The csv
module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):
import codecs
import csv
classAsciiFile:
def__init__(self, path):
self.f = codecs.open(path, 'rb', 'utf-8')
defclose(self):
self.f.close()
def__iter__(self):
for line in self.f:
# 'replace' for unicode characters -> ?, 'ignore' to ignore them
y = line.encode('ascii', 'replace')
y = y.replace('\0', '?') # Can't handle null characters!yield y
f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()
If you want to find the positions of the characters which you can't be handled by the CSV module, you could do e.g:
import codecs
lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
forx, c in enumerate(line):
ifnot c.encode('ascii', 'ignore') or c == '\0':
print"Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
lineno += 1
f.close()
Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:
import codecs
defOpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
infile = codecs.open(Path, "rb", Encoding, errors=Errors)
for Line in infile:
Line = Line.strip('\r\n')
if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1elif Qualifier != '(None)':
# Take a note of the chars 'before' just # in case of excel-style """ quoting.
cB41 = ''; cB42 = ''
L = ['']
qMode = Falsefor c in Line:
if c==Qualifier and c==cB41==cB42 and qMode:
# Triple qualifiers, so allow it with one
L[-1] = L[-1][:-2]
L[-1] += c
elif c==Qualifier:
# A qualifier, so reverse qual mode
qMode = not qMode
elif c in Delims andnot qMode:
# Not in qual mode and delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
cB42 = cB41
cB41 = c
yield L
else:
# There aren't any qualifiers.
cB41 = ''; cB42 = ''
L = ['']
for c in Line:
cB42 = cB41; cB41 = c
if c in Delims:
# Delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
yield L
for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
...
Solution 3:
You can refer to code snippets in the question below to get a csv reader with unicode encoding support:
Solution 4:
PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.
It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?
To find where the problems (not necessarily errors) are, try a little script like this (untested):
import sys, glob
for pattern in sys.argv[1:]:
for filepath in glob.glob(pattern):
for linex, line inenumerate(open(filepath, 'r')):
ifany(c >= '\x80'for c in line):
print"Non-ASCII in line %d of file %r" % (linex+1, filepath)
printrepr(line)
It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.
I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt
, avoiding any encode/decode problems.
Have you worked through the tutorial from the python-excel.org site?
Post a Comment for "Python - Finding Unicode/ascii Problems"