I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas.
After reading a bit on its manual page, I can't find a way to do this trivially parallelizable task:
ts.apply(func) # for pandas series
df.apply(func, axis = 1) # for pandas DF row apply
At the moment, to achieve this in dask, AFAIK,
ddf.assign(A=lambda df: df.apply(func, axis=1)).compute() # dask DataFrame
which is ugly syntax and is actually slower than outright
df.apply(func, axis = 1) # for pandas DF row apply
Any suggestion?
Edit: Thanks @MRocklin for the map function. It seems to be slower than plain pandas apply. Is this related to pandas GIL releasing issue or am I doing it wrong?
import dask.dataframe as dd
s = pd.Series([10000]*120)
ds = dd.from_pandas(s, npartitions = 3)
def slow_func(k):
A = np.random.normal(size = k) # k = 10000
s = 0
for a in A:
if a > 0:
s += 1
else:
s -= 1
return s
s.apply(slow_func) # 0.43 sec
ds.map(slow_func).compute() # 2.04 sec
See Question&Answers more detail:os