Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm practising my importing and cleaning skills and have reached a bit of a quagmire. I've been importing from here. The importing works and I have been able to drop na's. However, the issue is that certain observations are written in such a way (for example 13.7 (2016)). Because of how they're written they're read in as strings and even if they weren't they would contain false information.

I want to get rid of the year observations which are in the parentheses but preserve the data observation itself.

At present here is my code:

#Declare Missing Variables
missing_values = ['?', np.nan]
#Read Data
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate', na_values=missing_values)
#Set Dataset and Drop Variables
df = dfs[3]
df_drops = df[['Year', 'Undetermined', 'Sources and notes']]
df.drop(df_drops, inplace = True, axis=1)

print(df)
# pd.to_numeric(df['Guns per 100 inhabitants'])

Any help appreciated!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
220 views
Welcome To Ask or Share your Answers For Others

1 Answer

Bit of a workaround, but you could clean it up by splitting the string by a space and then taking the first entry.

df['Guns per 100 inhabitants (clean)'] = np.array([float(s.split(' ')[0]) for s in df['Guns per 100 inhabitants'])

I tried it out with your example and there are still some errors (for example, one entry is formatted '6.2-19.4', and some entries are already floats not strings so s.split(' ') throws an error) but I think this solves the year in parentheses issue.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...