pandas中的groupby和移除配对记录[英] groupby and remove pair records in pandas

本文是小编为大家收集整理的关于pandas中的groupby和移除配对记录的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有这样的数据框架,

col1    col2    col3    col4
a1      b1      c1      +
a1      b1      c1      +
a1      b2      c2      +
a1      b2      c2      -
a1      b2      c2      +

如果在col1,col2和col3中有两个具有相同值的记录以及col4中的相反符号,则应从DataFrame中删除它们.

输出:

col1    col2    col3    col4
a1      b1      c1      +
a1      b1      c1      +
a1      b2      c2      +

到目前为止,我尝试了pandas duplicated和groupby,但没有成功找到对.如何做?

推荐答案

我认为需要其他推荐答案

这是我的尝试:

df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
  .groupby(['col1','col2','col3','count']).cumcount())\
  .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]

输出:

  col1 col2 col3 col4
0   a1   b1   c1    +
1   a1   b1   c1    +
4   a1   b2   c2    +

在更健壮的测试集上:

df = pd.DataFrame(
    [['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+'],
     ['a1','b3','c3','+'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+']], 
    columns=['col1', 'col2', 'col3', 'col4']
)

输入数据框:

   col1 col2 col3 col4
0    a1   b1   c1    +
1    a1   b1   c1    +
2    a1   b2   c2    +
3    a1   b2   c2    -
4    a1   b2   c2    +
5    a1   b3   c3    +
6    a1   b3   c3    -
7    a1   b3   c3    -
8    a1   b3   c3    -
9    a1   b3   c3    +
10   a1   b3   c3    +
11   a1   b3   c3    +
12   a1   b3   c3    +

df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
  .groupby(['col1','col2','col3','count']).cumcount())\
  .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]

输出:

   col1 col2 col3 col4
0    a1   b1   c1    +
1    a1   b1   c1    +
4    a1   b2   c2    +
11   a1   b3   c3    +
12   a1   b3   c3    +

其他推荐答案

考虑到评论说:"如果有两个具有相同值的记录在Col1,Col2和Col3中以及Col4中的相反符号,则应将它们从DataFrame中删除",那么:

1)识别和删除重复:df.drop_duplicates()

2)将它们按三个第一列分组:df.groupby(['col1', 'col2', 'col3'])

3)仅保留尺寸1的组(否则,这意味着我们同时具有"+"和" - "):.filter(lambda group: len(group) == 1)

全部:

df.drop_duplicates().groupby(['col1', 'col2', 'col3']).filter(lambda g: len(g) == 1)

本文地址:https://www.itbaoku.cn/post/1727757.html

问题描述

I have a dataframe like this,

col1    col2    col3    col4
a1      b1      c1      +
a1      b1      c1      +
a1      b2      c2      +
a1      b2      c2      -
a1      b2      c2      +

If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe.

Output:

col1    col2    col3    col4
a1      b1      c1      +
a1      b1      c1      +
a1      b2      c2      +

So far I tried pandas duplicated and groupby but didn't succeeded with finding pairs. How to do this ?

推荐答案

I think need cumcount for count groups define all 4 columns and then groupby again with helper Series define +- groups and compare with set:

s = df.groupby(['col1','col2','col3', 'col4']).cumcount()
df = df[~df.groupby(['col1','col2','col3', s])['col4']
           .transform(lambda x: set(x) == set(['+','-']))]
print (df)
  col1 col2 col3 col4
0   a1   b1   c1    +
1   a1   b1   c1    +
6   a1   b2   c2    +

For better understanding create new column:

df['help'] = df.groupby(['col1','col2','col3', 'col4']).cumcount()
print (df)
  col1 col2 col3 col4  help
0   a1   b1   c1    +     0
1   a1   b1   c1    +     1
2   a1   b2   c2    +     0
3   a1   b2   c2    -     0
4   a1   b2   c2    +     1

df = df[~df.groupby(['col1','col2','col3', 'help'])['col4']
           .transform(lambda x: set(x) == set(['+','-']))]
print (df)
  col1 col2 col3 col4  help
0   a1   b1   c1    +     0
1   a1   b1   c1    +     1
4   a1   b2   c2    +     1

其他推荐答案

Here's my attempt:

df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
  .groupby(['col1','col2','col3','count']).cumcount())\
  .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]

Output:

  col1 col2 col3 col4
0   a1   b1   c1    +
1   a1   b1   c1    +
4   a1   b2   c2    +

On a more robust test set:

df = pd.DataFrame(
    [['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+'],
     ['a1','b3','c3','+'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+']], 
    columns=['col1', 'col2', 'col3', 'col4']
)

Input dataframe:

   col1 col2 col3 col4
0    a1   b1   c1    +
1    a1   b1   c1    +
2    a1   b2   c2    +
3    a1   b2   c2    -
4    a1   b2   c2    +
5    a1   b3   c3    +
6    a1   b3   c3    -
7    a1   b3   c3    -
8    a1   b3   c3    -
9    a1   b3   c3    +
10   a1   b3   c3    +
11   a1   b3   c3    +
12   a1   b3   c3    +

df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
  .groupby(['col1','col2','col3','count']).cumcount())\
  .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]

Output:

   col1 col2 col3 col4
0    a1   b1   c1    +
1    a1   b1   c1    +
4    a1   b2   c2    +
11   a1   b3   c3    +
12   a1   b3   c3    +

其他推荐答案

Considering the comment saying that " If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe", then:

1) Identify and drop duplicates: df.drop_duplicates()

2) Group them by the three first columns: df.groupby(['col1', 'col2', 'col3'])

3) Only keep groups that are of size 1 (otherwise, it means we have both "+" and "-"): .filter(lambda group: len(group) == 1)

All in one:

df.drop_duplicates().groupby(['col1', 'col2', 'col3']).filter(lambda g: len(g) == 1)