Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:

 a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

I want to binaries the cell values.

I did the following:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

While the code works fine, but it takes a lot of time to run.

Why is that?

Is there a faster way?

Thanks

Edit:

Error when doing df.to_pickle

df_preference.to_pickle('df_preference.pickle')

I get this:

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
      1 # Pickling the data to the disk
      2 
----> 3 df_preference.to_pickle('df_preference.pickle')

\dwdfhome01Anacondalibsite-packagespandascoregeneric.pyc in to_pickle(self, path)
   1170         """
   1171         from pandas.io.pickle import to_pickle
-> 1172         return to_pickle(self, path)
   1173 
   1174     def to_clipboard(self, excel=None, sep=None, **kwargs):

\dwdfhome01Anacondalibsite-packagespandasiopickle.pyc in to_pickle(obj, path)
     13     """
     14     with open(path, 'wb') as f:
---> 15         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
     16 
     17 

SystemError: error return without exception set
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
200 views
Welcome To Ask or Share your Answers For Others

1 Answer

UPDATE:

read this topic and this issue in regards to your error

Try to save your DF as HDF5 - it's much more convenient.

You may also want to read this comparison...

OLD answer:

try this:

In [110]: (df>0).astype(np.int8)
Out[110]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).

df>0 works with vectorized data, so it does it much faster

.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0

UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:

In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))

In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop

In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop

In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...