本文是小编为大家收集整理的关于pandas中的groupby和移除配对记录的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
我有这样的数据框架,
col1 col2 col3 col4 a1 b1 c1 + a1 b1 c1 + a1 b2 c2 + a1 b2 c2 - a1 b2 c2 +
如果在col1,col2和col3中有两个具有相同值的记录以及col4中的相反符号,则应从DataFrame中删除它们.
输出:
col1 col2 col3 col4 a1 b1 c1 + a1 b1 c1 + a1 b2 c2 +
到目前为止,我尝试了pandas duplicated和groupby,但没有成功找到对.如何做?
推荐答案
问题描述
I have a dataframe like this,
col1 col2 col3 col4 a1 b1 c1 + a1 b1 c1 + a1 b2 c2 + a1 b2 c2 - a1 b2 c2 +
If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe.
Output:
col1 col2 col3 col4 a1 b1 c1 + a1 b1 c1 + a1 b2 c2 +
So far I tried pandas duplicated and groupby but didn't succeeded with finding pairs. How to do this ?
推荐答案
I think need cumcount for count groups define all 4 columns and then groupby again with helper Series define +- groups and compare with set:
s = df.groupby(['col1','col2','col3', 'col4']).cumcount() df = df[~df.groupby(['col1','col2','col3', s])['col4'] .transform(lambda x: set(x) == set(['+','-']))] print (df) col1 col2 col3 col4 0 a1 b1 c1 + 1 a1 b1 c1 + 6 a1 b2 c2 +
For better understanding create new column:
df['help'] = df.groupby(['col1','col2','col3', 'col4']).cumcount() print (df) col1 col2 col3 col4 help 0 a1 b1 c1 + 0 1 a1 b1 c1 + 1 2 a1 b2 c2 + 0 3 a1 b2 c2 - 0 4 a1 b2 c2 + 1 df = df[~df.groupby(['col1','col2','col3', 'help'])['col4'] .transform(lambda x: set(x) == set(['+','-']))] print (df) col1 col2 col3 col4 help 0 a1 b1 c1 + 0 1 a1 b1 c1 + 1 4 a1 b2 c2 + 1
其他推荐答案
Here's my attempt:
df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\ .groupby(['col1','col2','col3','count']).cumcount())\ .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4 0 a1 b1 c1 + 1 a1 b1 c1 + 4 a1 b2 c2 +
On a more robust test set:
df = pd.DataFrame( [['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+'], ['a1','b3','c3','+'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+']], columns=['col1', 'col2', 'col3', 'col4'] )
Input dataframe:
col1 col2 col3 col4 0 a1 b1 c1 + 1 a1 b1 c1 + 2 a1 b2 c2 + 3 a1 b2 c2 - 4 a1 b2 c2 + 5 a1 b3 c3 + 6 a1 b3 c3 - 7 a1 b3 c3 - 8 a1 b3 c3 - 9 a1 b3 c3 + 10 a1 b3 c3 + 11 a1 b3 c3 + 12 a1 b3 c3 + df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\ .groupby(['col1','col2','col3','count']).cumcount())\ .groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4 0 a1 b1 c1 + 1 a1 b1 c1 + 4 a1 b2 c2 + 11 a1 b3 c3 + 12 a1 b3 c3 +
其他推荐答案
Considering the comment saying that " If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe", then:
1) Identify and drop duplicates: df.drop_duplicates()
2) Group them by the three first columns: df.groupby(['col1', 'col2', 'col3'])
3) Only keep groups that are of size 1 (otherwise, it means we have both "+" and "-"): .filter(lambda group: len(group) == 1)
All in one:
df.drop_duplicates().groupby(['col1', 'col2', 'col3']).filter(lambda g: len(g) == 1)