performance - Split Python sequence (time series/array) into subsequences with overlap

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

performance - Split Python sequence (time series/array) into subsequences with overlap

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I need to extract all subsequences of a time series/array of a given window. For example:

>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> window = 3
>>> subsequences(ts, window)
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6],
       [5, 6, 7],
       [5, 7, 8],
       [6, 8, 9]])

Naive methods that iterate over the sequence are of course expensive, for example:

def subsequences(ts, window):
    res = []
    for i in range(ts.size - window + 1):
        subts = ts[i:i+window]
        subts.reset_index(drop=True, inplace=True)
        subts.name = None
        res.append(subts)
    return pd.DataFrame(res)

I found a better way by copying the sequence, shifting it by a different value until the window is covered, and splitting the different sequences with reshape. Performance is around 100x better, because the for loop iterates over the window size, and not the sequence size:

def subsequences(ts, window):
    res = []
    for i in range(window):
        subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window))
        res.append(subts)
    return pd.DataFrame(np.concatenate(res, axis=0))

I've seen that pandas includes several rolling functions in the pandas.stats.moment module, and I guess what they do is somehow similar to the subsequencing problem. Is there anywhere in that module, or anywhere else in pandas to make this more efficient?

Thank you!

UPDATE (SOLUTION):

Based on @elyase answer, for this specific case there is a slightly simpler implementation, let me write it down here, and explain what it's doing:

def subsequences(ts, window):
    shape = (ts.size - window + 1, window)
    strides = ts.strides * 2
    return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)

Given the 1-D numpy array, we first compute the shape of the resulting array. We will have a row starting at each position of the array, with just the exception of the last few elements, at which starting them there wouldn't be enough elements next to complete the window.

See on the first example in this description, how the last number we start at is 6, because starting at 7, we can't create a window of three elements. So, the number of rows is the size minus the window plus one. The number of columns is simply the window.

Next, the tricky part is telling how to fill the resulting array, with the shape we just defined.

To do we consider that the first element will be the first. Then we need to specify two values (in a tuple of two integers as the argument to the parameter strides). The values specify the steps we need to do in the original array (the 1-D one) to fill the second (the 2-D one).

Consider a different example, where we want to implement the np.reshape function, from a 9 elements 1-D array, to a 3x3 array. The first element fills the first position, and then, the one at its right, would be the next on the 1-D array, so we move 1 step. Then, the tricky part, to fill the first element of the second row, we should do 3 steps, from the 0 to the 4, see:

>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> new = array([[0, 1, 2],
                 [3, 4, 5],
                 [6, 7, 8])]

So, to reshape, our steps for the two dimensions would be (1, 3). For our case, where it exists overlap, it is actually simpler. When we move right to fill the resulting array, we start at the next position in the 1-D array, and when we move right, again we get the next element, so 1 step, in the 1-D array. So, the steps would be (1, 1).

There is only one last thing to note. The strides argument does not accept the "steps" we used, but instead the bytes in memory. To know them, we can use the strides method of numpy arrays. It returns a tuple with the strides (steps in bytes), with one element for each dimension. In our case we get a 1 element tuple, and we want it twice, so we have the * 2.

The np.lib.stride_tricks.as_strided function performs the filling using the described method without copying the data, which makes it quite efficient.

Finally, note that the function posted here assumes a 1-D input array (which is different from a 2-D array with 1 element as row or column). See the shape method of the input array, and you should get something like (N, ) and not (N, 1). This method would fail on the latter. Note that the method posted by @elyase handles two dimension input array (that's why this version is slightly simpler).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

587 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T02:56:40+0000

This is 34x faster than your fast version in my machine:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

>>> rolling_window(ts.values, 3)
array([[0, 1, 2],
      [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5],
      [4, 5, 6],
      [5, 6, 7],
      [6, 7, 8],
      [7, 8, 9]])

Credit goes to Erik Rigtorp.

Categories

performance - Split Python sequence (time series/array) into subsequences with overlap

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags