scala - How to perform one operation on each executor once in spark

Question

Welcome To Ask or Share your Answers For Others

scala - How to perform one operation on each executor once in spark

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction.

For performing prediction, What I have tried is,

Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation. -----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)

Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

696 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:29:32+0000

You have two options:

1. Create a singleton object with a lazy val representing the data:

    object WekaModel {
        lazy val data = {
            // initialize data here. This will only happen once per JVM process
        }
    }

Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.

    elementsRDD.map { element =>
        // use WekaModel.data here
    }

Advantages

is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.

Disadvantages

Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

This allows you to initialize whatever you need for the entire partition.

    elementsRDD.mapPartition { elements =>
        val model = new WekaModel()

        elements.map { element =>
            // use model and element. there is a single instance of model per partition.
        }
    }

Advantages:

Provides more flexibility in the initialization and deinitialization of objects.

Disadvantages

Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.

Categories

scala - How to perform one operation on each executor once in spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

1. Create a singleton object with a lazy val representing the data:

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

scala - How to perform one operation on each executor once in spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

1. Create a singleton object with a lazy val representing the data:

2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.