I have a df similar to this (this is just an example, original df in spanish and is cumbersome to copy paste an excerpt here):
date city1 city2 ID company
01-10-2020 Mexico Mexico 1234 ColaCola
03-01-2020 Mexico Baja 567 Cola cola
02-09-2020 Mexico Culiacan 8900 Cola Cola Inc.
03-04-2020 Mexico Tulum 2344 Cola Cola Inc
06-07-2020 Mexico Ver 3459 Cola cola inc
so, i need to have all those variations of company's name under same one:
date city1 city2 ID company
01-10-2020 Mexico Mexico 1234 Cola Cola
03-01-2020 Mexico Baja 567 Cola Cola
02-09-2020 Mexico Culiacan 8900 Cola Cola
03-04-2020 Mexico Tulum 2344 Cola Cola
06-07-2020 Mexico Ver 3459 Cola Cola
I tried using:
df['company'].str.replace({'ColaCola': 'Cola Cola', 'Cola cola':'Cola Cola'})
and so on. The problem was, there are a lot of variations on company's name (original is way longer): capital/not capital letters, spaces, typos, periods, spaces...you name it! To do it manually it would take me hours. So, I needed a better way to do this. Then I came across wuzzyfuzzy. But I cant get past the examples. I don't really get it.
I think something like this could work:
for row in df.company:
fuzz.partial_ratio("Cola Cola": "str.row")
if fuzz.partial_ratio >= 90:
"str.row" = "Cola Cola"
or something like this. Excuse me, I have never been able to use rightly loops. Please help me.
question from:https://stackoverflow.com/questions/65863710/normalize-strings-of-a-column-with-fuzzy