Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I believe the following function is a working solution for pandas DataFrame rolling argmin/max:

import numpy as np

def data_frame_rolling_arg_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))

It is inspired from a partial solution for rolling idxmax on pandas Series.

Explanations:

  • Apply the numpy argmin/max function to the rolling window.
  • Only keep the non-NaN values.
  • Convert the values to int.
  • Realign the values to original row numbers.
  • Use applymap to replace the row numbers by the index values.
  • Combine with the original DataFrame filled with NaN in order to add the first rows with expected NaN values.

In [1]: index = map(chr, range(ord('a'), ord('a') + 10))

In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)

In [3]: df                                                                                                                                                                                                                                                                       
Out[3]: 
    0   1   2
a  -4  15   0
b   0  -6   4
c   7   8 -18
d  11  12 -16
e   6   3  -6
f  -1   4  -9
g   6 -10  -7
h   8  11 -25
i  -2 -10  -8
j   0  10  -7

In [4]: data_frame_rolling_arg_func(df, 3, 'max')                                                                                                                                                                                                                                
Out[4]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    a    b
d    d    d    b
e    d    d    e
f    d    d    e
g    e    f    e
h    h    h    g
i    h    h    g
j    h    h    j

In [5]: data_frame_rolling_arg_func(df, 3, 'min')                                                                                                                                                                                                                                
Out[5]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    b    c
d    b    b    c
e    e    e    c
f    f    e    d
g    f    g    f
h    f    g    h
i    i    g    h
j    i    i    h

My question are:

  • Can you find any mistakes?
  • Is there a better solution? That is: more performant and/or more elegant.

And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
785 views
Welcome To Ask or Share your Answers For Others

1 Answer

The NaN issue I mentioned in a comment to the OP can be solved in the following manner:

import numpy as np
import pandas as pd


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws, min_periods=0).apply(getattr(np, f'arg{func}'),
                                                raw=True)[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))


def main():
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_arg_func(df, 3, 'min'))
    print(data_frame_rolling_arg_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Result:

$ python demo.py 
      0   1   2
a   3.0   0   7
b   1.0   3  11
c   1.0  15  -6
d   NaN   2 -16
e   NaN   0  24
f   NaN   0  14
g   2.0   0   4
h  -1.0 -11  16
i  17.0   0  -2
j   3.0  -5  -8
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    b    a    c
d    d    d    d
e    d    e    d
f    d    e    d
g    e    e    g
h    f    h    g
i    h    h    i
j    h    h    j
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    c    b
d    d    c    b
e    d    c    e
f    d    d    e
g    e    e    e
h    f    f    h
i    i    g    h
j    i    i    h

The handling of NaN values is a little subtle. I want my rolling idxmin/max function to cooperate well with the regular DataFrame rolling min/max functions. These, by default, will generate a NaN value as soon as the window input shows a NaN value. And so will the rolling apply function by default. But for the apply function, that is a problem, because I will not be able to transform the NaN value into an index. However this is a pity, since the NaN values in the output show up because they can be found in the input, so the NaN value index in the input is what I would like my rolling idxmin/max function to produce. Fortunately, this is exactly what I will get if I use the following combination of parameters:

  • min_periods=0 for the pandas rolling function. The apply function will then get a chance to produce its own value regardless of how many NaN values are found in the input window.
  • raw=True for the apply function. This parameter ensures that the input to the applied function is passed as a numpy array instead of a pandas Series. np.argmin/max will then return the index of the first input NaN value, which is exactly what we want. It should be noted that without raw=True, i.e. in the pandas Series case, np.argmin/max seems to ignore the NaN values, which is NOT what we want. The nice thing with raw=True is that it should improve performance too! More about that later.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...