Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a df similar to this (this is just an example, original df in spanish and is cumbersome to copy paste an excerpt here):

date          city1  city2      ID    company 
01-10-2020   Mexico  Mexico    1234   ColaCola
03-01-2020   Mexico  Baja      567    Cola cola
02-09-2020   Mexico  Culiacan  8900   Cola Cola Inc.
03-04-2020   Mexico  Tulum     2344   Cola Cola Inc
06-07-2020   Mexico  Ver       3459   Cola cola inc

so, i need to have all those variations of company's name under same one:

    date          city1  city2      ID    company 
    01-10-2020   Mexico  Mexico    1234   Cola Cola
    03-01-2020   Mexico  Baja      567    Cola Cola
    02-09-2020   Mexico  Culiacan  8900   Cola Cola 
    03-04-2020   Mexico  Tulum     2344   Cola Cola 
    06-07-2020   Mexico  Ver       3459   Cola Cola 

I tried using:

df['company'].str.replace({'ColaCola': 'Cola Cola', 'Cola cola':'Cola Cola'})

and so on. The problem was, there are a lot of variations on company's name (original is way longer): capital/not capital letters, spaces, typos, periods, spaces...you name it! To do it manually it would take me hours. So, I needed a better way to do this. Then I came across wuzzyfuzzy. But I cant get past the examples. I don't really get it.

I think something like this could work:

for row in df.company:
      fuzz.partial_ratio("Cola Cola": "str.row")
    if fuzz.partial_ratio >= 90:
    "str.row" = "Cola Cola"

or something like this. Excuse me, I have never been able to use rightly loops. Please help me.

question from:https://stackoverflow.com/questions/65863710/normalize-strings-of-a-column-with-fuzzy

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
204 views
Welcome To Ask or Share your Answers For Others

1 Answer

obj = df['company']

# have a look at `company`
obj.value_counts().sort_index()

# use regexp and find the common part in regexp
cond = obj.str.contains('colas*cola', flags=re.IGNORECASE)
df.loc[cond, 'NAME_new'] = 'Cola Cola'
...
# find the other company name's common & unique part and rename it

print(df)

0        date   city1     city2    ID         company   NAME_new
1  01-10-2020  Mexico    Mexico  1234        ColaCola  Cola Cola
2  03-01-2020  Mexico      Baja   567       Cola cola  Cola Cola
3  02-09-2020  Mexico  Culiacan  8900  Cola Cola Inc.  Cola Cola
4  03-04-2020  Mexico     Tulum  2344   Cola Cola Inc  Cola Cola
5  06-07-2020  Mexico       Ver  3459   Cola cola inc  Cola Cola

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...