潘达应用在第二列的运行总数中[英] Pandas apply with running total in second column

本文是小编为大家收集整理的关于潘达应用在第二列的运行总数中的处理方法,想解了潘达应用在第二列的运行总数中的问题怎么解决?潘达应用在第二列的运行总数中问题的解决办法?那么可以参考本文帮助大家快速定位并解决问题。

问题描述

我正在将一个函数转换为 pandas,它循环遍历一个集合并根据条件和运行总计更新每个值.函数是这样的

def calculate_value():
    cumulative_amount = 0

    for row in rows:
        if row['amount'] < 0:
            return 0

        amount = 0

        if row['kind'] == 'A':
            amount = row['amount'] * row['input_amount']
        elif row['kind'] == 'B':
            amount = row['input_amount'] - cumulative_amount
        elif row['kind'] == 'C':
            amount = row['amount']

        cumulative_amount += amount
        row['result'] = amount

        if row['kind'] == 'B':
            break

    return rows

基本上,循环遍历所有行,并添加一个 result 值.但是这个 result 可能取决于累积的运行总计.此外,如果我们达到某个值 (row['kind'] == 'B'),我们应该中断并停止处理新行.

当把它转换成 pandas 时,我似乎应该使用 apply.到目前为止,我有以下代码,几乎可以工作,但是当我尝试使用 shift(-1) 获取 cumulative_amount 时,它总是以 nan 的形式返回.

在 pandas 中最好的方法是什么?

def calculate_value(row: Series):
    if row['amount'] < 0 or row.shift(-1)['kind'] == 'B':
        row['cumulative_amount'] = 0
        row['result'] = 0
        return row

    amount = 0

    if np.isnan(row.shift(-1)['cumulative_amount']):
        cumulative_amount = 0
    else:
        cumulative_amount = row.shift(-1)['cumulative_amount']

    if row['kind'] == 'A':
        amount = row['amount'] * row['input_amount']
    elif row['kind'] == 'B':
        amount = row['input_amount'] - cumulative_amount
    elif row['kind'] == 'C':
        amount = row['amount']

    row['cumulative_amount'] = amount + cumulative_amount
    row['result'] = amount
    return row

df['cumulative_amount'] = 0
new_df = df.apply(lambda x: calculate_value(x), axis=1)

输入和期望输出的示例是

df = pd.DataFrame({
    'kind': {1: 'C', 2: 'E', 3: 'A', 4: 'A', 5: 'B', 6: 'C'},
    'amount': {1: -800, 2: 100, 3: 0.5, 4: 0.5, 5: 0, 6: 200},
    'input_amount': {1: 800, 2: 800, 3: 800, 4: 800, 5: 800, 6: 800}
})

   amount  input_amount kind  cumulative_amount  result
1  -800.0           800    C                0.0     0.0
2   100.0           800    E                0.0     0.0
3     0.5           800    A              400.0   400.0
4     0.5           800    A              800.0   400.0
5     0.0           800    B              800.0     0.0
6   200.0           800    C              800.0     0.0

推荐答案

如果我理解正确的话,只有类型 'B' 的 result 依赖于其他行.所以你可以先做其他事情:

df['result'] = 0.

a = (df.kind == 'A') & (df.amount >= 0) 
c = (df.kind == 'C') & (df.amount >= 0)

df.loc[a, 'result'] = df.loc[a, 'amount'] * df.loc[a, 'input_amount']
df.loc[c, 'result'] = df.loc[c, 'amount']

做cumsum:

df['cumulative_amount'] = df.result.cumsum()

更正 'cumulative_amount' 的值(对于 所有 类型的 'B' 出现):

df.loc[(df.kind == 'B'), 'result'] = df.loc[(df.kind == 'B'), 'input_amount'].values - df.loc[(df.kind.shift(-1) == 'B'), 'cumulative_amount'].values

在'B'第一次出现后更正'result'和'cumulative_amount'的值:

df.loc[(df.kind == 'B').cumsum().shift() > 0, 'result'] = 0
# (df.kind == 'B').cumsum().shift() is a running count of the number of B's encountered prior to the row index, 
# so you want to 'stop' once this number is no longer zero
# You could of course do this more simply by figuring out which position in the index has the first B, 
# then using .ix or .iloc, but it's actually longer to type out.

df['cumulative_amount'] = df.result.cumsum() # Once more, because we've changed the value of results below B.

本文地址:https://www.itbaoku.cn/post/1727789.html