Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Let's assume we have data instances like this:

[
    [15, 20, ("banana","apple","cucumber"), ...],
    [91, 12, ("orange","banana"), ...],
    ...
]

I am wondering how I can encode the third element of these datapoints. For multiple features values we could use sklearn's OneHotEncoder, but as far as I could find out, it cannot handle inputs of different length.

Here is what I've tried out:

X = [[15, 20, ("banana","apple","cucumber")], [91, 12, ("orange","banana")]]

ct = ColumnTransformer(
    [
        ("genre_encoder", OneHotEncoder(), [2])
    ],
    remainder='passthrough'
)
print(ct.fit_transform(X))

This will only output

[[1.0 0.0 15 20]
 [0.0 1.0 91 12]]

as expected, because the tuples are handled as the possible values this feature can be represented with.

We can't embed our features directly (like [15, 12, "banana", "apple", "cucumber"]), because

  1. we don't know how many instances of this feature we will have (two? three?)
  2. each position would be interpreted as an own feature and thus if we had banana in the first nominal slot in one datapoint and in the second one in our second nominal slot, they would not count to the same "pool of values" a feature can embody

Example:

X = [["banana","apple","cucumber"], ["orange","banana", "cucumber"]]
enc = OneHotEncoder()
print(enc.fit_transform(X).toarray())

[[1. 0. 1. 0. 1.]
 [0. 1. 0. 1. 1.]]

This representation contains 5 slots instead of 4, because the first slot is interpreted as using banana or orange, the second one as apple or banana and the last one only has the option cucumber.

(This would also not solve the problem of having different amounts of feature values per datapoint. And replacing empty ones with None does not solve the problem either, because then None faces this positional ambiguity.)

Any idea how to encode those "Multi-Muliti-"features, that can take multiple values and consist of a varying amount of elements? Thank you in advance!

question from:https://stackoverflow.com/questions/65927997/multi-feature-one-hot-encoder-with-varying-amount-of-feature-instances

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
129 views
Welcome To Ask or Share your Answers For Others

1 Answer

I solved it for now by transforming it into a CountVectorizer Problem, thanks to David Maspis answer on the datascience stackexchange.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...