Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm using Spark 3.0.x in Java and we've a transformation requirement that goes as follows:

A sample (hypothetical) Dataframe:

+==============+===========+
| UserCategory | Location  |
+==============+===========+
| Premium      | Mumbai    |
+--------------+-----------+
| Premium      | Bangalore |
+--------------+-----------+
| Gold         | Kolkata   |
+--------------+-----------+
| Premium      | Moskow    |
+--------------+-----------+
| Silver       | Tokyo     |
+--------------+-----------+
| Gold         | Sydney    |
+--------------+-----------+
| Bronze       | Dubai     |
+--------------+-----------+
| ...          | ...       |
+--------------+-----------+

Let's say, we have a UserCategory column on which we need to perform a groupby(). Then on each row of the grouped data we need to apply certain function. The function is dependent on both the columns of the dataframe. E.g. we might generate a CouponCode based on how many users are there in a specific UserCategory and also the user Location. Point is, it has to have a groupby() before the row-by-row rule coming in. The expected output might be something like the following:

+==============+===========+============+
| UserCategory | Location  | CouponCode |
+==============+===========+============+
| Premium      | Mumbai    | IN01P      |
+--------------+-----------+------------+
| Premium      | Bangalore | IN02P      |
+--------------+-----------+------------+
| Premium      | Moskow    | RU03P      |
+--------------+-----------+------------+
| Gold         | Kolkata   | IN01G      |
+--------------+-----------+------------+
| Gold         | Sydney    | AU01G      |
+--------------+-----------+------------+
| Silver       | Tokyo     | JP01S      |
+--------------+-----------+------------+
| Bronze       | London    | UK01B      |
+--------------+-----------+------------+
| ...          | ...       | ...        |
+--------------+-----------+------------+

Now, if we were using PySpark, I could've leveraged pandas udf to achieve the functionality. But our stack is built on Java so that's not a possibility. I was wondering what could be a way to achieve the same in Java.

I was looking at the javadoc here, and it seems a groupby() on dataframe returns a RelationalGroupedDataset. Then I browsed through the methods available on the RelationalGroupedDataset class, and the only one that looks promising is the apply() function. Unfortunately it doesn't have any documentation and the method signature looks a bit intimidating:

public static RelationalGroupedDataset apply(Dataset<Row> df,
                                             scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
                                             RelationalGroupedDataset.GroupType groupType)

So, question is, is that the way to go? If so, then a dummy example of the apply() method would be very much appreciated. Else, if you could point me towards a different approach of achieving the same, that'd also be fine.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
3.9k views
Welcome To Ask or Share your Answers For Others

1 Answer

等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...