I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.
customer_id|month_id|m1 |m2 |m3 .......m_n
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 02 |40 |50
1002 | 02 |50 |60
1003 | 02 |60 |70
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.
For Example:
rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))
But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?
LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)
Or any other way of dataframe select I can create the select expression?
Output:
customer_id|month_id|m1 |m2 |n1 |n2
1001 | 01 |10 |20 |10 |20
1002 | 01 |20 |30 |20 |30
1003 | 01 |30 |40 |30 |40
1001 | 02 |40 |50 |50 |70
1002 | 02 |50 |60 |70 |90
1003 | 02 |60 |70 |90 |110
1001 | 03 |70 |80 |120 |150
1002 | 03 |80 |90 |150 |180
1003 | 03 |90 |100 |180 |210
See Question&Answers more detail:os