Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I wanted to extract the attributes form an xml using Pig Latin.

This is a sample of the xml file

<CATALOG>
<BOOK>
<TITLE test="test1">Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>

I used this script but it didn't work:

REGISTER ./piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();

A =  LOAD './books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);

B = FOREACH A GENERATE XPath(x, 'BOOK/TITLE/@test'), XPath(x, 'BOOK/PRICE');
dump B;

The output was:

(,24.90)

I hope someone can help me with this. Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.4k views
Welcome To Ask or Share your Answers For Others

1 Answer

There are 2 bugs in piggybank's XPath class:

  1. The ignoreNamespace logic breaks searching for XML attributes https://issues.apache.org/jira/browse/PIG-4751

  2. The ignoreNamepace parameter is defaulted to true and cannot be overwritten https://issues.apache.org/jira/browse/PIG-4752

Here is my workaround using XPathAll:

XPathAll(x, 'BOOK/TITLE/@test', true, false).$0 as (test:chararray)

Also if you still need to ignore namespaces:

XPathAll(x, '//*[local-name()='BOOK']//*[local-name()='TITLE']/@test', true, false).$0 as (test:chararray)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...