Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.

customer_id|month_id|m1  |m2 |m3 .......m_n
1001       |  01    |10  |20    
1002       |  01    |20  |30    
1003       |  01    |30  |40
1001       |  02    |40  |50    
1002       |  02    |50  |60    
1003       |  02    |60  |70
1001       |  03    |70  |80    
1002       |  03    |80  |90    
1003       |  03    |90  |100

Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.

For Example:

rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))

But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?

LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)

Or any other way of dataframe select I can create the select expression?

Output:

customer_id|month_id|m1     |m2    |n1   |n2  
1001       |  01    |10     |20    |10   |20  
1002       |  01    |20     |30    |20   |30  
1003       |  01    |30     |40    |30   |40  
1001       |  02    |40     |50    |50   |70  
1002       |  02    |50     |60    |70   |90
1003       |  02    |60     |70    |90   |110  
1001       |  03    |70     |80    |120  |150
1002       |  03    |80     |90    |150  |180
1003       |  03    |90     |100   |180  |210
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
142 views
Welcome To Ask or Share your Answers For Others

1 Answer

with slight modification to @Lamanus suggestion the below code might be helpful to solve your problem,

# pyspark --driver-memory 1G --executor-memory 2G --executor-cores 1 --num-executors 1
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.sql.window import Window

drow = Row("customer_id","month_id","m1","m2","m3","m4")
data=[drow("1001","01","10","20","10","20"),drow("1002","01","20","30","20","30"),drow("1003","01","30","40","30","40"),drow("1001","02","40","50","40","50"),drow("1002","02","50","60","50","60"),drow("1003","02","60","70","60","70"),drow("1001","03","70","80","70","80"),drow("1002","03","80","90","80","90"),drow("1003","03","90","100","90","100")]

df = spark.createDataFrame(data)
df.show()
'''
+-----------+--------+---+---+---+---+
|customer_id|month_id| m1| m2| m3| m4|
+-----------+--------+---+---+---+---+
|       1001|      01| 10| 20| 10| 20|
|       1002|      01| 20| 30| 20| 30|
|       1003|      01| 30| 40| 30| 40|
|       1001|      02| 40| 50| 40| 50|
|       1002|      02| 50| 60| 50| 60|
|       1003|      02| 60| 70| 60| 70|
|       1001|      03| 70| 80| 70| 80|
|       1002|      03| 80| 90| 80| 90|
|       1003|      03| 90|100| 90|100|
+-----------+--------+---+---+---+---+
'''


a = ["m1","m2"]
b = ["m3","m4"]
rownum_window = (Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
expr = ["*",sum(col("m1")).over(rownum_window).alias("sum1"), sum(col("m2")).over(rownum_window).alias("sum2"),avg(col("m3")).over(rownum_window).alias("avg1"), avg(col("m4")).over(rownum_window).alias("avg2") ]
df.select(expr).show()

'''
+-----------+--------+---+---+---+---+-----+-----+----+----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|avg1|avg2|
+-----------+--------+---+---+---+---+-----+-----+----+----+
|       1003|      01| 30| 40| 30| 40| 30.0| 40.0|30.0|40.0|
|       1003|      02| 60| 70| 60| 70| 90.0|110.0|45.0|55.0|
|       1003|      03| 90|100| 90|100|180.0|210.0|60.0|70.0|
|       1002|      01| 20| 30| 20| 30| 20.0| 30.0|20.0|30.0|
|       1002|      02| 50| 60| 50| 60| 70.0| 90.0|35.0|45.0|
|       1002|      03| 80| 90| 80| 90|150.0|180.0|50.0|60.0|
|       1001|      01| 10| 20| 10| 20| 10.0| 20.0|10.0|20.0|
|       1001|      02| 40| 50| 40| 50| 50.0| 70.0|25.0|35.0|
|       1001|      03| 70| 80| 70| 80|120.0|150.0|40.0|50.0|
+-----------+--------+---+---+---+---+-----+-----+----+----+
'''

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...