When I parse this XML with p = xml.parsers.expat.ParserCreate()
:
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'xfc'
.
How can u'xfc'
be turned into u'ü'
?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'xfc'
shall yield u'ü'
and nothing else?
u'xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u'' + 'Fortuna Dxfcsseldorf'.decode('8859') + u''')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table 'xFC' <-> u'ü'
unicode( 'Fortuna Dxfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'
?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna Dxc3xbcsseldorf'
>>> unicode('Fortuna Dxc3xbcsseldorf','utf-8')
u'Fortuna Dxfcsseldorf'
See Question&Answers more detail:os