python - Remove all rows that are duplicates with respect to some rows

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

python - Remove all rows that are duplicates with respect to some rows

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|283924|  1.5|   0|
|283924|  1.5|   1|
|982384|  3.0|   0|
|982384|  3.0|   1|
|892383|  2.0|   0|
|892383|  2.5|   1|
+------+-----+----+

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

In this case:

Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:

The output would be:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|892383|  2.5|   1|
|892383|  2.0|   0|
+------+-----+----+

I've tried

df.dropDuplicates(subset = ['id', 'value'], keep = False)

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

How else could I do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.1k views

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:43:01+0000

You can do that using the window functions

from pyspark.sql import Window, functions as F
df.withColumn(
  'fg', 
  F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()

Categories

python - Remove all rows that are duplicates with respect to some rows

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags