最有效的方法来转移MultiIndex时间序列[英] Most efficient way to shift MultiIndex time series

本文是小编为大家收集整理的关于最有效的方法来转移MultiIndex时间序列的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一个由许多堆叠时间序列组成的 DataFrame.索引是 (poolId, month),其中两个都是整数,"月"是自 2000 年以来的月数.计算多个变量的一个月滞后版本的最佳方法是什么?

现在,我正在做类似的事情:

cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))

对于我的数据,这花了我整整 60 秒的时间来运行.(我有 48k 个不同的池,总共有 718k 行.)

我将其从 R 代码和等效的 data.table 调用转换为:

dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]

运行只需 9 秒.(这里的"myshift"类似于"function(x) c(x[-1], NA)".)

有没有办法让 pandas 版本在速度方面恢复正常?我在 0.8.1 上对此进行了测试.

编辑:这是一个生成足够接近的数据集的示例,因此您可以了解我的意思:

ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

我尝试了 64 秒.该数据的每个系列都从第 0 个月开始;真的,它们都应该在 np.max(lens) 月结束,开始日期参差不齐,但已经足够了.

编辑 2:这是一些比较 R 代码.这需要 0.8 秒.系数 80,不好.

library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])

推荐答案

我建议你重塑数据并与 groupby 方法相比做一个单一的转变:

result = df.unstack(0).shift(1).stack()

这会切换关卡的顺序,以便您交换和重新排序:

result = result.swaplevel(0, 1).sortlevel(0)

您可以验证它已滞后一个周期(您需要 shift(1) 而不是 shift(-1)):

In [17]: result.ix[1]
Out[17]: 
              a         b         c         d         e
month                                                  
1      0.752511  0.600825  0.328796  0.852869  0.306379
2      0.251120  0.871167  0.977606  0.509303  0.809407
3      0.198327  0.587066  0.778885  0.565666  0.172045
4      0.298184  0.853896  0.164485  0.169562  0.923817
5      0.703668  0.852304  0.030534  0.415467  0.663602
6      0.851866  0.629567  0.918303  0.205008  0.970033
7      0.758121  0.066677  0.433014  0.005454  0.338596
8      0.561382  0.968078  0.586736  0.817569  0.842106
9      0.246986  0.829720  0.522371  0.854840  0.887886
10     0.709550  0.591733  0.919168  0.568988  0.849380
11     0.997787  0.084709  0.664845  0.808106  0.872628
12     0.008661  0.449826  0.841896  0.307360  0.092581
13     0.727409  0.791167  0.518371  0.691875  0.095718
14     0.928342  0.247725  0.754204  0.468484  0.663773
15     0.934902  0.692837  0.367644  0.061359  0.381885
16     0.828492  0.026166  0.050765  0.524551  0.296122
17     0.589907  0.775721  0.061765  0.033213  0.793401
18     0.532189  0.678184  0.747391  0.199283  0.349949

In [18]: df.ix[1]
Out[18]: 
              a         b         c         d         e
month                                                  
0      0.752511  0.600825  0.328796  0.852869  0.306379
1      0.251120  0.871167  0.977606  0.509303  0.809407
2      0.198327  0.587066  0.778885  0.565666  0.172045
3      0.298184  0.853896  0.164485  0.169562  0.923817
4      0.703668  0.852304  0.030534  0.415467  0.663602
5      0.851866  0.629567  0.918303  0.205008  0.970033
6      0.758121  0.066677  0.433014  0.005454  0.338596
7      0.561382  0.968078  0.586736  0.817569  0.842106
8      0.246986  0.829720  0.522371  0.854840  0.887886
9      0.709550  0.591733  0.919168  0.568988  0.849380
10     0.997787  0.084709  0.664845  0.808106  0.872628
11     0.008661  0.449826  0.841896  0.307360  0.092581
12     0.727409  0.791167  0.518371  0.691875  0.095718
13     0.928342  0.247725  0.754204  0.468484  0.663773
14     0.934902  0.692837  0.367644  0.061359  0.381885
15     0.828492  0.026166  0.050765  0.524551  0.296122
16     0.589907  0.775721  0.061765  0.033213  0.793401
17     0.532189  0.678184  0.747391  0.199283  0.349949

这种方法的性能还不错(在 0.9.0 中可能会慢一点):

In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s

本文地址:https://www.itbaoku.cn/post/1727935.html

问题描述

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?

Right now, I do something like:

cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))

For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)

I'm converting this from R code and the equivalent data.table call:

dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]

only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)

Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.

Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:

ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.

Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.

library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])

推荐答案

I would suggest you reshape the data and do a single shift versus the groupby approach:

result = df.unstack(0).shift(1).stack()

This switches the order of the levels so you'd want to swap and reorder:

result = result.swaplevel(0, 1).sortlevel(0)

You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):

In [17]: result.ix[1]
Out[17]: 
              a         b         c         d         e
month                                                  
1      0.752511  0.600825  0.328796  0.852869  0.306379
2      0.251120  0.871167  0.977606  0.509303  0.809407
3      0.198327  0.587066  0.778885  0.565666  0.172045
4      0.298184  0.853896  0.164485  0.169562  0.923817
5      0.703668  0.852304  0.030534  0.415467  0.663602
6      0.851866  0.629567  0.918303  0.205008  0.970033
7      0.758121  0.066677  0.433014  0.005454  0.338596
8      0.561382  0.968078  0.586736  0.817569  0.842106
9      0.246986  0.829720  0.522371  0.854840  0.887886
10     0.709550  0.591733  0.919168  0.568988  0.849380
11     0.997787  0.084709  0.664845  0.808106  0.872628
12     0.008661  0.449826  0.841896  0.307360  0.092581
13     0.727409  0.791167  0.518371  0.691875  0.095718
14     0.928342  0.247725  0.754204  0.468484  0.663773
15     0.934902  0.692837  0.367644  0.061359  0.381885
16     0.828492  0.026166  0.050765  0.524551  0.296122
17     0.589907  0.775721  0.061765  0.033213  0.793401
18     0.532189  0.678184  0.747391  0.199283  0.349949

In [18]: df.ix[1]
Out[18]: 
              a         b         c         d         e
month                                                  
0      0.752511  0.600825  0.328796  0.852869  0.306379
1      0.251120  0.871167  0.977606  0.509303  0.809407
2      0.198327  0.587066  0.778885  0.565666  0.172045
3      0.298184  0.853896  0.164485  0.169562  0.923817
4      0.703668  0.852304  0.030534  0.415467  0.663602
5      0.851866  0.629567  0.918303  0.205008  0.970033
6      0.758121  0.066677  0.433014  0.005454  0.338596
7      0.561382  0.968078  0.586736  0.817569  0.842106
8      0.246986  0.829720  0.522371  0.854840  0.887886
9      0.709550  0.591733  0.919168  0.568988  0.849380
10     0.997787  0.084709  0.664845  0.808106  0.872628
11     0.008661  0.449826  0.841896  0.307360  0.092581
12     0.727409  0.791167  0.518371  0.691875  0.095718
13     0.928342  0.247725  0.754204  0.468484  0.663773
14     0.934902  0.692837  0.367644  0.061359  0.381885
15     0.828492  0.026166  0.050765  0.524551  0.296122
16     0.589907  0.775721  0.061765  0.033213  0.793401
17     0.532189  0.678184  0.747391  0.199283  0.349949

Perf isn't too bad with this method (it might be a touch slower in 0.9.0):

In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s