Context: I have two panda dataframes that characterize a network, df_nodes
and df_edges
. They can be matched through a shared identfier, id
.
df_nodes
looks roughly like this:
id: att_1: att_2: att_3:
id1 red ... ...
id2 red ... ...
id3 blue ... ...
df_edges
characterizes a (weighted) directed network, but I am interested in the (weighted) undirected representation for now.
id_from: id_to: weight:
id1 id2 0.5 .
id1 id3 0.2
id2 id4 0.4
Two features are as follows:
The same node sometimes appears the
id_from
column and at other times inid_to
(in the example, this would beid_4
; in practice there are millions of edges).More importantly,
df_edges
includes connections to nodes that are not indf_nodes
, ie I don't have any attribute data for those.
Objective: I would like to create a nx.Graph()
object that only includes edges between those nodes for which I have attributes data, ie which are in df_nodes
. I then want to add (selected) attributes data in df_nodes
, and compute statistics such as the average (standard deviation, ...) weighted degree for the group of nodes with some attribute value (eg where df_nodes[att_1]='red'
).
Approach thus far: I am new to network analysis, so probably what I'm doing is misguided.
I first create G
G = nx.from_pandas_edgelist(df_edges, 'id_from', 'id_to', 'weight', nx.Graph())
then tried adding the attribute of interest
nx.set_node_attributes(G, df_nodes[['id','att_1',]].set_index('id').to_dict('index'),'id')
I thought I could then use something like the following to filter out the nodes that meet an attribute value.
nodes_subset = [x for x,y in G.nodes(data=True) if y['att_1']='red']
But (i) doing so throws a key error, presumably because many nodes don't even have att_1
, and (ii) the approach seems very inefficient.
I'd be very grateful for any help on how to achieve the objective (and do so efficiently, given the size of the actual data)!
question from:https://stackoverflow.com/questions/65600720/computing-network-properties-for-subset-of-nodes