Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have the following code:

from xml.etree import ElementTree

file_path = 'some_file_path'

document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))

If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"

<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>

In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
269 views
Welcome To Ask or Share your Answers For Others

1 Answer

Running your code as follows, and it's working fine:

from xml.etree import ElementTree
from StringIO import StringIO 


xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""

print("parsing xml document")
# using StringIO to simulate reading from file  
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8')) 

for elem in document.iter():
  print(elem.tag) 

And the output is as expected:

parsing xml document
pages
page
textbox
textline
text
text
text

So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...