Skip to content Skip to sidebar Skip to footer

Python3 Qt Unicode File Name Problems

Similar to QDir and QDirIterator ignore files with non-ASCII filenames and UnicodeEncodeError: 'latin-1' codec can't encode character With regard to the second link above, I added

Solution 1:

You're right, 123c is just wrong. The evidence shows that the filename on disk contains an invalid Unicode codepoint U+DCB4. When Python tries to print that character, it rightly complains that it can't. When Qt processes the character in test4 it can't handle it either, but instead of throwing an error it converts it to the Unicode REPLACEMENT CHARACTER U+FFFD. Obviously the new filename no longer matches what's on disk.

Python can also use the replacement character in a string instead of throwing an error if you do the conversion yourself and specify the proper error handling. I don't have Python 3 on hand to test this but I think it will work:

filename = filename.encode('utf-8').decode('utf-8', 'replace')

Solution 2:

Codes like "\udcb4" come from surrogate escape. It's a way for Python to preserve bytes that cannot be interpreted as valid UTF-8. When encoded to UTF-8, surrogates are turned into bytes without the 0xDC byte, so "\udcb4" becomes 0xB4. Surrogate escape makes it possible to deal with any byte sequences in file names. But you need to be careful to use errors="surrogateescape" as documented in the Unicode HOWTO https://docs.python.org/3/howto/unicode.html

Solution 3:

Python2 vs Python3

python
Python 2.7.4 (default, Sep 26 2013, 03:20:56) 
>>>import os>>>os.listdir('.')
['unicode.py', '123c\xb4.wav', '123b\xc3\x86.wav', '123a\xef\xbf\xbd.wav']
>>>os.path.exists(u'123c\xb4.wav')
False
>>>os.path.exists('123c\xb4.wav')
True

>>>n ='123c\xb4.wav'>>>print(n)
123c�.wav
>>>n =u'123c\xb4.wav'>>>print(n)
123c´.wav

That backtick on the last line above is what I've been looking for! ..vs that �

The same directory listed with Python3 shows a different set of filenames

python3
Python 3.3.1 (default, Sep 252013, 19:30:50) 
>>> import os
>>> os.listdir('.')
['unicode.py', '123c\udcb4.wav', '123bÆ.wav', '123a�.wav']
>>> os.path.exists('123c\udcb4.wav')
True

Is this a bug in Python3?

Post a Comment for "Python3 Qt Unicode File Name Problems"