Let's assume we have data instances like this:
[
[15, 20, ("banana","apple","cucumber"), ...],
[91, 12, ("orange","banana"), ...],
...
]
I am wondering how I can encode the third element of these datapoints. For multiple features values we could use sklearn's OneHotEncoder, but as far as I could find out, it cannot handle inputs of different length.
Here is what I've tried out:
X = [[15, 20, ("banana","apple","cucumber")], [91, 12, ("orange","banana")]]
ct = ColumnTransformer(
[
("genre_encoder", OneHotEncoder(), [2])
],
remainder='passthrough'
)
print(ct.fit_transform(X))
This will only output
[[1.0 0.0 15 20]
[0.0 1.0 91 12]]
as expected, because the tuples are handled as the possible values this feature can be represented with.
We can't embed our features directly (like [15, 12, "banana", "apple", "cucumber"]
), because
- we don't know how many instances of this feature we will have (two? three?)
- each position would be interpreted as an own feature and thus if we had
banana
in the first nominal slot in one datapoint and in the second one in our second nominal slot, they would not count to the same "pool of values" a feature can embody
Example:
X = [["banana","apple","cucumber"], ["orange","banana", "cucumber"]]
enc = OneHotEncoder()
print(enc.fit_transform(X).toarray())
[[1. 0. 1. 0. 1.]
[0. 1. 0. 1. 1.]]
This representation contains 5 slots instead of 4, because the first slot is interpreted as using banana
or orange
, the second one as apple
or banana
and the last one only has the option cucumber
.
(This would also not solve the problem of having different amounts of feature values per datapoint. And replacing empty ones with None
does not solve the problem either, because then None
faces this positional ambiguity.)
Any idea how to encode those "Multi-Muliti-"features, that can take multiple values and consist of a varying amount of elements? Thank you in advance!
question from:https://stackoverflow.com/questions/65927997/multi-feature-one-hot-encoder-with-varying-amount-of-feature-instances