I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.
jsdom
is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.- Popular libraries for HTML parsing (e.g.
cheerio
) neither support XPath, nor expose W3C-compliant DOM. - Effective HTML parsing is, obviously, implemented in WebKit, so using
phantom
orcasper
would be an option, but those require to be running in a special way, not justnode <script>
. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to runnode-inspector
withphantom
. Spooky
is an option, but it's buggy enough, so that it didn't run at all on my machine.
What's the right way to parse an HTML page with XPath then?
See Question&Answers more detail:os