python - Multi-Feature One-Hot-Encoder with varying amount of feature instances

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

python - Multi-Feature One-Hot-Encoder with varying amount of feature instances

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Let's assume we have data instances like this:

[
    [15, 20, ("banana","apple","cucumber"), ...],
    [91, 12, ("orange","banana"), ...],
    ...
]

I am wondering how I can encode the third element of these datapoints. For multiple features values we could use sklearn's OneHotEncoder, but as far as I could find out, it cannot handle inputs of different length.

Here is what I've tried out:

X = [[15, 20, ("banana","apple","cucumber")], [91, 12, ("orange","banana")]]

ct = ColumnTransformer(
    [
        ("genre_encoder", OneHotEncoder(), [2])
    ],
    remainder='passthrough'
)
print(ct.fit_transform(X))

This will only output

[[1.0 0.0 15 20]
 [0.0 1.0 91 12]]

as expected, because the tuples are handled as the possible values this feature can be represented with.

We can't embed our features directly (like [15, 12, "banana", "apple", "cucumber"]), because

we don't know how many instances of this feature we will have (two? three?)
each position would be interpreted as an own feature and thus if we had banana in the first nominal slot in one datapoint and in the second one in our second nominal slot, they would not count to the same "pool of values" a feature can embody

Example:

X = [["banana","apple","cucumber"], ["orange","banana", "cucumber"]]
enc = OneHotEncoder()
print(enc.fit_transform(X).toarray())

[[1. 0. 1. 0. 1.]
 [0. 1. 0. 1. 1.]]

This representation contains 5 slots instead of 4, because the first slot is interpreted as using banana or orange, the second one as apple or banana and the last one only has the option cucumber.

(This would also not solve the problem of having different amounts of feature values per datapoint. And replacing empty ones with None does not solve the problem either, because then None faces this positional ambiguity.)

Any idea how to encode those "Multi-Muliti-"features, that can take multiple values and consist of a varying amount of elements? Thank you in advance!

question from:https://stackoverflow.com/questions/65927997/multi-feature-one-hot-encoder-with-varying-amount-of-feature-instances

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

129 views

1 Answer

深蓝 · Answer 1 · 2021-10-06T18:54:41+0000

I solved it for now by transforming it into a CountVectorizer Problem, thanks to David Maspis answer on the datascience stackexchange.

Categories

python - Multi-Feature One-Hot-Encoder with varying amount of feature instances

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags