I have a Dataset below like:
|word |features |prediction|
|simple sentence |(2000,[1092,1980],[0.0,0.5753641449035617]) |1 |
|simple important sentence |(2000,[537,1092,1980],[0.28768207245178085,0.0,0.28768207245178085])|0 |
|important sentence |(2000,[537,1092],[0.5753641449035617,0.0]) |0 |
here I have 2 clusters (0 and 1),
I want to select the words that have the most weight in each cluster (for at least 2 words)
for example:
(For convenience, I specified weights for each word )
1 sentence(1092)(0.0) simple(1980)(0.5753641449035617)
0 important(537)(0.28768207245178085) sentence(1092)(0.0) simple(1980)(0.28768207245178085)
0 important(537)(0.5753641449035617) sentence(1092)(0.0)
So based on the above dataset The highest weight among the words of cluster 1 is related to
"simple"(0.5753641449035617) and "sentence"(0.0)
also the highest weight in cluster 0 is related to
"important"(0.5753641449035617) and "simple"(0.28768207245178085)
Based on the above, I expect the output to look like the following
|prediction|docname |top_terms | weight
+----------+------------------------------------------------------------------------------------------+---------------------+ ---------------------+
|1 |[simple sentence simple] |[simple, sentence] | [0.0,0.5753641449035617]
|0 |[simple important sentence, important sentence] |[important, simple]| | [0.5753641449035617,0.28768207245178085]
Please help me that how to I resolve it
question from:https://stackoverflow.com/questions/65915468/how-to-perform-group-by-and-aggregate-operation-on-spark