Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to find all of the rows which have duplicates in the columns of city, round_latitude, and round_longitude. So, if two rows share the same values in each of those columns, it would be returned.

I'm not exactly sure what is going on here: I'm certain that there are duplicates in the dataset. No error is returned when running In[38], the column names are returned but there are no entries. What am I doing wrong here? How can I fix this?

If it helps, I'm also working off of some of the code in this guide. (The format is HTML.)

# In[29]:

def dl_by_loc(path):
    endname = "USA_downloads.csv"
    with open(path + endname, "r") as f:
        data = pd.read_csv(f)
        data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
        data = data.groupby(["round_latitude","round_longitude","city"]).count()
        data = data.rename(columns = {"date":"downloads"})
        return data["downloads"]


# In[30]:

downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)


# In[31]:

downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]


# In[32]:

downloads_by_coords.head()


# In[38]:

by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes

Here are a few lines from the data, as requested:

2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
176 views
Welcome To Ask or Share your Answers For Others

1 Answer

dl_by_loc(path) returns a Series with a MultiIndex:

round_latitude  round_longitude  city        
30.0            -95.0            Houston         1
40.0            -75.0            Philadelphia    3
Name: downloads, dtype: int64

If you take a look at the definition of that function, it groups the DataFrame by round_latitude, round_longitude and city columns and counts the number of occurrences. Later on, you convert this to a DataFrame by calling reset_index(). Now, the downloads column is showing how many times each lat, lon, city combination occurred in the original DataFrame. Since it is a groupby result, these combinations are in fact not duplicated because they were aggregated previously. If you want to detect duplicated ones from this DataFrame, you can use:

by_coords[by_coords['downloads']>1]

Your method would still work in the original DataFrame. Note that removing duplicates or grouping data with float type data has some risks. Pandas generally handles them but to make sure, if you want 1-digit precision, you can multiply by 10 and convert to integer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...