[Buildroot] [PATCH 08/19] support/check-uniq-files: decode as many strings as possible
Arnout Vandecappelle
arnout at mind.be
Fri Feb 8 20:42:23 UTC 2019
On 08/02/2019 18:25, Yann E. MORIN wrote:
> Arnout, All,
>
> On 2019-02-08 00:40 +0100, Arnout Vandecappelle spake thusly:
>> On 07/01/2019 23:05, Yann E. MORIN wrote:
>>> +# If possible, try to decode the binary string s with the user's locale.
>>> +# If s contains characters that can't be decoded with that locale, return
>>> +# the representation (in the user's locale) of the un-decoded string.
>>> +def str_decode(s):
>>> + try:
>>> + return s.decode()
>>> + except UnicodeDecodeError:
>>> + return repr(s)
>>
>> I think s.decode(errors='replace') is exactly what we want: it prints the
>> question mark character for things that can't be represented, just like ls does.
>
> In the case I used as example, i.e. œ (LATIN SMALL LIGATURE OE) as encoded
> in iso8859-15, i.e. \xbd (e.g. stored in a file named 'meh'), with python
> 2.7:
>
> >>> with open('meh', 'rb') as f:
> ... lines = f.readlines()
> ...
> >>> lines
> ['\xbd\n']
> >>> lines[0].decode(errors='replace')
> u'\ufffd\n'
> >>> print('{}'.format(lines[0].decode(errors='replace')))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
Meh, Python2 unicode handling always confuses the hell out of me...
So, to do it well, in python3 you need to do:
print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(),errors='replace'))
while in python2 the proper thing to do is
print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(), \
errors='replace').encode(sys.getfilesystemencoding(),errors='replace'))
(sys.getfilesystemencoding() makes sure we use the user's encoding so stuff that
can be printed gets properly printed).
I couldn't find a way to do the right thing both in python2 and python3...
Regards,
Arnout
> >>>
>
> The output with python3 is indeed what you believe will happen, but I
> don't think it is so nice:
>
> >>> lines
> [b'\xbd\n']
> >>> lines[0].decode(errors='replace')
> '�\n'
> >>> print('{}'.format(lines[0].decode(errors='replace')))
> �
>
> >>>
>
> And anyway, check-uniq file should work with python 2.7, since it is part
> of the build tools, and python 2.7 is what we require.
>
> Regards,
> Yann E. MORIN.
>
More information about the buildroot
mailing list