Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Context: I have two panda dataframes that characterize a network, df_nodes and df_edges. They can be matched through a shared identfier, id.

df_nodes looks roughly like this:

    id:     att_1:   att_2:  att_3:
    id1     red       ...    ...
    id2     red       ...    ...
    id3     blue      ...    ...

df_edgescharacterizes a (weighted) directed network, but I am interested in the (weighted) undirected representation for now.

   id_from: id_to:   weight:  
    id1     id2        0.5    .
    id1     id3        0.2      
    id2     id4        0.4

Two features are as follows:

  • The same node sometimes appears the id_from column and at other times in id_to (in the example, this would be id_4; in practice there are millions of edges).

  • More importantly, df_edges includes connections to nodes that are not in df_nodes, ie I don't have any attribute data for those.

Objective: I would like to create a nx.Graph() object that only includes edges between those nodes for which I have attributes data, ie which are in df_nodes. I then want to add (selected) attributes data in df_nodes, and compute statistics such as the average (standard deviation, ...) weighted degree for the group of nodes with some attribute value (eg where df_nodes[att_1]='red').

Approach thus far: I am new to network analysis, so probably what I'm doing is misguided. I first create G

G = nx.from_pandas_edgelist(df_edges, 'id_from', 'id_to', 'weight', nx.Graph()) 

then tried adding the attribute of interest

nx.set_node_attributes(G, df_nodes[['id','att_1',]].set_index('id').to_dict('index'),'id')

I thought I could then use something like the following to filter out the nodes that meet an attribute value.

nodes_subset = [x for x,y in G.nodes(data=True) if y['att_1']='red']

But (i) doing so throws a key error, presumably because many nodes don't even have att_1, and (ii) the approach seems very inefficient.

I'd be very grateful for any help on how to achieve the objective (and do so efficiently, given the size of the actual data)!

question from:https://stackoverflow.com/questions/65600720/computing-network-properties-for-subset-of-nodes

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
210 views
Welcome To Ask or Share your Answers For Others

1 Answer

I expect that filtering a Pandas dataframe will be quicker than filtering a Networkx graph. So I would try the following:

Create a dictionary of nodes in the attribute table:

nodes_with_attributes = {x:0 for x in df_nodes['id'].values}

(Look ups in a dictionary are much faster than finding an element in a list, at the cost of memory.)

Then filter the edges:

df_filtered_edges = df_edges[
     (df_edges['id_from'].isin(nodes_with_attributes)& 
     (df_edges['id_to'].isin(nodes_with_attributes)]

Then you can make the filtered graph directly from the filtered dataframe.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...