Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as a simple text file where each line is a vector and each element is seperated by space, for example:

1 2 3
5.1 3.6 2.1
3 0.24 1.333

I'm using numpy's function loadtxt() in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.

Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?, should I create the RDD in another way?

Some code for how I create my RDD:

data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)

readData function:

 def readPointBatch(iterator):
     return [(np.loadtxt(iterator,dtype=np.float64)]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.1k views
Welcome To Ask or Share your Answers For Others

1 Answer

It would be a little bit more idiomatic and slightly faster to simply map with numpy.fromstring as follows:

import numpy as np.

path = ...
initial_num_of_partitions = ...

data = (sc.textFile(path, initial_num_of_partitions)
   .map(lambda s: np.fromstring(s, dtype=np.float64, sep=" ")))

but ignoring that there is nothing particularly wrong with your approach. As far as I can tell, with basic configuration, it is roughly twice a slow a simply reading the data and slightly slower than creating dummy numpy arrays.

So it looks like the problem is somewhere else. It could be cluster misconfiguration, cost of fetching data from S3 or even unrealistic expectations.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...