Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am having trouble with a for loop inside a function. I am calculating cosine distances for a list of word vectors. with each vector, I am calculating the cosine distance and then appending it as a new column to the pandas dataframe. the problem is that there are several models, so i am comparing a word vector from model 1, with that word in every other model.

This means that some words are not present in all models. In this case, I use an exception for the KeyError and allow the loop to move on without throwing an error. If this happens, I also ask that a 0 value is added the pandas dataframe. This is causing duplicated indexes and am stuck with moving forward from here. The code is as follows:

from scipy.spatial.distance import cosine
import pandas as pd

def cosines(model1, model2, model3, model4, model5, model6, model7, words):
    df = pd.DataFrame()

    model = [model2, model3, model4, model5, model6, model7]

    for i in model:
        for j in words:
            try:
                cos = 1 - cosine(model1.wv[j], i.wv[j])
                print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}')
                tempdf = pd.DataFrame([cos], columns=[f'{j}'], index=[f'{i.name}'])
                #print(tempdf)
                df = pd.concat([df, tempdf], axis=0)
            except KeyError:
                print(word not present at {i.name}')
                ke_tempdf = pd.DataFrame([0], columns=[f'{j}'], index=[f'{i.name}'])
                df = pd.concat([df, ke_tempdf], axis=0)
                pass
    return df

The function works, however, for each KeyError - instead of adding a 0 at one row, it creates a new duplicated one with the value 0. With two words this duplicated the dataframe, but the ultimate aim is to have a list of many words. The resulting dataframe is found below:

        word1       word2
model1  0.000000    NaN
model1  NaN         0.761573
model2  0.000000    NaN
model2  NaN         0.000000
model3  0.000000    NaN
model3  NaN         0.000000
model4  0.245140    NaN
model4  NaN         0.680306
model5  0.090268    NaN
model5  NaN         0.662234
model6  0.000000    NaN
model6  NaN         0.709828

As you can see for every word that isn't present, instead of adding a 0 to existing model row (NaN) it is adding a new row with the number 0. it should read: model1, 0, 0.76 etc, instead of the duplicated rows. any help is much appreciated, thank you!

question from:https://stackoverflow.com/questions/65941350/duplicated-rows-in-pandas-append-inside-for-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
186 views
Welcome To Ask or Share your Answers For Others

1 Answer

I can't quite test it without your model objects, but I think this would address your issue:

from scipy.spatial.distance import cosine
import pandas as pd

def cosines(model1, model2, model3, model4, model5, model6, model7, words):
    df = pd.DataFrame()

    model = [model2, model3, model4, model5, model6, model7]

    for i in model:
        cos_dict = {}
        for j in words:
            try:
                cos_dict[j] = 1 - cosine(model1.wv[j], i.wv[j])
                print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}')
            except KeyError:
                print(f'word not present at {i.name}')
                cos_dict[j] = 0
                
        tempdf = pd.DataFrame.from_dict(cos_dict, orient='columns')
        tempdf.index = [f'{i.name}']
        
        df = pd.concat([df, tempdf])
            
    return df

It collects the values for the words for each model in a dictionary in the inner loop, and only tacks them into the full dataframe once in the outer loop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...