Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Say I have the following logical relationships:

# ANIMALS
cat = [hobbes, tigger, garfield]
dog = [lassie]
frog = [kermit, hypnotoad]

# HABITATS
tree = [cat, frog]
river = [dog, frog, turtle]
house = [cat, dog]

There will never be duplicates within a set. I want to match them into pairs of (animal, habitat)

(hobbes, tree)
(garfield, tree)
(tigger, tree)
(kermit, tree)
(hypnotoad, tree)
(lassie, river)
(kermit, river)
(hypnotoad, river)
(hobbes, house)
(garfield, house)
(tigger, house)
(lassie, house)

If this were SQL, it would be something like....

CREATE TABLE animal (animal_name, animal_type);
CREATE TABLE habitat_to_animal (habitat_type, animal_type);

SELECT
    animal.animal_name,
    habitat_to_animal.habitat_type
FROM animal
JOIN habitat_to_animal
    ON animal.animal_type = habitat_to_animal.animal_type

They're stored in numpy arrays (actual arrays are numpy.int32) like so:

# animals are sorted by animal_kind
animal_kind = ['cat', 'cat', 'cat', 'dog', 'frog', 'frog']
animal_name = ['hobbes', 'tigger', 'garfield', 'lassie', 'kermit', 'hypnotoad']

# habitats are sorted by habitat_type
habitat_type = ['tree', 'tree', 'river', 'river', 'river', 'house', 'house']
habitat_animal = ['cat', 'frog', 'dog', 'frog', 'turtle', 'cat', 'fish']

What is the fastest way to do this in numpy if arrays are very large (millions of entries each)?

EDIT: This must be numpy only (no Pandas or other libs). As far as data cardinality goes, imagine both sets are ~10 million "animals" and ~100 million "habitats", and 200 million actual pairs output (most habitats will be empty, and some will have 5-10 animals). That means that generating the full cross product and filtering is a no-go. As are any loops; data is way too big.

Real arrays are integer IDs and have nothing to do with animals or habitats, but it's easier to read this way =)

question from:https://stackoverflow.com/questions/65928382/numpy-how-to-express-a-many-to-many-relationship

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
93 views
Welcome To Ask or Share your Answers For Others

1 Answer

If you want selection capabilities like SQL, I'm going to suggest you think about Pandas for handling the data. It can happily store Numpy arrays, but you have very flexible filtering options for slicing the data lots of different ways.

For what you're asking, I don't know if you even need to do much more than just make the DataFrame and iterate over it. Here's an example that tries to stick close to your SQL example:

import pandas as pd

animal_kind = ['cat', 'cat', 'cat', 'dog', 'frog', 'frog']
animal_name = ['hobbes', 'tigger', 'garfield', 'lassie', 'kermit', 'hypnotoad']

habitat_type = ['tree', 'tree', 'river', 'river', 'river', 'house', 'house', 'house']
habitat_animal = ['cat', 'frog', 'dog', 'frog', 'turtle', 'cat', 'fish', 'dog']

df_animal = pd.DataFrame(data=animal_kind, columns=["animal"], index=animal_name)
habitat_to_animal = pd.DataFrame()
habitat_to_animal["animal"] = habitat_animal
habitat_to_animal["habitat"] = habitat_type

for i in range(len(habitat_to_animal)):
    animal, habitat = habitat_to_animal.loc[i]
    names = df_animal[df_animal.animal == animal].index
    for name in names:
        print(f"({name},{habitat})")

This produces your list as requested. I think it will scale fairly well, but there are 3-4 different ways to iterate over rows in Pandas, and the fastest way depends on the data and what you're doing to it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...