I have a dataframe in spark. Each row represents a person and I want to retrieve possible connections among them. The rule to have a link is that, for each possible pair, if they have the same prop1:String and the absolute difference of prop2:Int is < 5 then the link exists. I am trying to understand the best way to accomplish this task working with data frame.
I am trying to retrieve indexed RDDs:
val idusers = people.select("ID")
.rdd
.map(r => r(0).asInstanceOf[Int])
.zipWithIndex
val prop1users = people.select("ID")
.rdd
.map(r => (r(0).asInstanceOf[Int], r(1).asInstanceOf[String]))
val prop2users = people.select("ID")
.rdd
.map(r => (r(0).asInstanceOf[Int], r(2).asInstanceOf[Int]))
then start removing duplicates like:
var links = idusers
.join(idusers)
.filter{ case (v1, v2) => v2._1 != v2._2 }
but then I got stuck to check for prop1... anyway, is there a way to accomplish all these steps just using data frame?
See Question&Answers more detail:os