Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

When I parse this XML with p = xml.parsers.expat.ParserCreate():

<name>Fortuna D&#252;sseldorf</name>

The character parsing event handler includes u'xfc'.

How can u'xfc' be turned into u'ü'?


This is the main question in this post, the rest just shows further (ranting) thoughts about it

Isn't Python unicode broken since u'xfc' shall yield u'ü' and nothing else? u'xfc' is already a unicode string, so converting it to unicode again doesn't work! Converting it to ASCII as well doesn't work.

The only thing that I found works is: (This cannot be intended, right?)

exec( 'print u'' + 'Fortuna Dxfcsseldorf'.decode('8859') + u''')

Replacing 8859 with utf-8 fails! What is the point of that?

Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.

Unicode is no magic - why do so many ppl here have issues?

The underlying problem of unicode conversion is dirt simple:

One bidirectional lookup table 'xFC' <-> u'ü'

unicode( 'Fortuna Dxfcsseldorf' ) 

What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?

Also why did they made it not reversible?:

 >>> u'Fortuna Düsseldorf'.encode('utf-8')
 'Fortuna Dxc3xbcsseldorf'
 >>> unicode('Fortuna Dxc3xbcsseldorf','utf-8')
 u'Fortuna Dxfcsseldorf'    
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
136 views
Welcome To Ask or Share your Answers For Others

1 Answer

You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.

In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by xhh and uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.

As such ü has been replaced by xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.

If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:

>>> u'Fortuna Düsseldorf'
u'Fortuna Dxfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf

If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:

>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna Dxc3xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf

The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...