I run a Stanford CoreNLP Server with the following command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
I try to parse the sentence Who was Darth Vader’s son?
. Note that the apostrophe behind Vader
is not an ASCII character.
The online demo successfully parse the sentence:
The server I run on localhost fails:
I also tried to perform the query using Python.
import requests
url = 'http://localhost:9000/'
sentence = 'Who was Darth Vader’s son?'
r=requests.post(url, params={'properties' : '{"annotators": "tokenize,ssplit,pos,ner", "outputFormat": "json"}'}, data=sentence.encode('utf8'))
tree = r.json()
The last command raises an exception:
ValueError: Invalid control character at: line 1 column 1172 (char 1171)
However, I noticed occurrences of the character x00
in the text (i.e. r.text
). If I remove them, the json parsing succeeds:
import json
tree = json.loads(r.text.replace('x00', ''))
Finally, r.encoding
is ISO-8859-1
, even though I did not use the option -strict
to run the server. Note that it does not change anything if I manually replace it by UTF-8
.
If I run the same code replacing url = 'http://localhost:9000/'
by url = 'http://corenlp.run/'
, then everything succeeds. The call r.json()
returns a dict, r.encoding
is indeed UTF-8
, and no character x00
is in the text.
What is wrong with the CoreNLP server I run?
See Question&Answers more detail:os