I have a dataset (as an RDD
) that I divide into 4 RDDs by using different filter
operators.
val RSet = datasetRdd.
flatMap(x => RSetForAttr(x, alLevel, hieDict)).
map(x => (x, 1)).
reduceByKey((x, y) => x + y)
val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp"))
val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc"))
val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv"))
val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv"))
I sent Rp
and RpSV
to the following function calculateEntropy
:
def calculateEntropy(Rx: RDD[(String, Int)], RxSv: RDD[(String, Int)]): Map[Int, Map[String, Double]] = {
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
.
.
}
}
I have two questions:
1- When I loop operation on RxSv
as:
RxSv.foreach{item=> { ... }}
it collects all items of the partitions, but I want to only a partition where i am in. If you said that user map
function but I don't change anything on RDD
.
So when I run the code on a cluster with 4 workers and a driver the dataset is divided into 4 partitions and each worker runs the code. But for example i use foreach loop as i specified in the code. Driver collects all data from workers.
2- I have encountered with a problem on this code
val t = Rx.filter(x => x._1.split(",")(2).equals(abc(2)))
The error :
org.apache.spark.SparkException: This RDD lacks a SparkContext.
It could happen in the following cases:
(1) RDD
transformations
and actions
are NOT invoked by the driver, but inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x)
is invalid because the values transformation
and count
action
cannot be performed inside of the rdd1.map
transformation
. For more information, see SPARK-5063.
(2) When a Spark Streaming
job recovers from checkpoint, this exception will be hit if a reference to an RDD
not defined by the streaming job is used in DStream
operations. For more information, See SPARK-13758.