本文是小编为大家收集整理的关于pandas idxmax:在出现并列的情况下返回所有的行的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
我正在使用一个数据框架,其中每行的重量都具有其概率.现在,我想选择具有最高概率的行,并且我正在使用pandas idxmax()来做到这一点,但是当有领带时,它只是返回了绑扎的第一行.就我而言,我想获取所有领带的行.
此外,我正在作为一个研究项目的一部分,在该项目中,我正在处理以下数百万个数据框架,因此保持速度是一个问题.
示例:
我的数据看起来像这样:
data = [['chr1',100,200,0.2], ['ch1',300,500,0.3], ['chr1', 300, 500, 0.3], ['chr1', 600, 800, 0.3]]
在此列表中,我创建一个pandas dataframe如下:
weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])
看起来像这样:
chrom start end probability 0 chr1 100 200 0.2 1 ch1 300 500 0.3 2 chr1 300 500 0.3 3 chr1 600 800 0.3
然后选择适合Argmax(概率)的行:
selected = weighted.ix[weighted['probability'].idxmax()]
当然哪个返回:
chrom ch1 start 300 end 500 probability 0.3 Name: 1, dtype: object
有联系时,是否有(快速)获得所有值?
谢谢!
推荐答案
好吧,这可能是您要寻找的解决方案:
weighted.loc[weighted['probability']==weighted['probability'].max()].T # 1 2 3 #chrom ch1 chr1 chr1 #start 300 300 600 #end 500 500 800 #probability 0.3 0.3 0.3
其他推荐答案
瓶颈在于计算布尔索引器.您可以通过与基础numpy数组进行计算来绕过与pd.Series对象关联的开销:
df2 = df[df['probability'].values == df['probability'].values.max()]
用熊猫等效制定性能基准:
# tested on Pandas v0.19.2, Python 3.6.0 df = pd.concat([df]*100000, ignore_index=True) %timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop %timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop
问题描述
I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie.
Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.
Example:
My data looks like this:
data = [['chr1',100,200,0.2], ['ch1',300,500,0.3], ['chr1', 300, 500, 0.3], ['chr1', 600, 800, 0.3]]
From this list, I create a pandas dataframe as follows:
weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])
Which looks like this:
chrom start end probability 0 chr1 100 200 0.2 1 ch1 300 500 0.3 2 chr1 300 500 0.3 3 chr1 600 800 0.3
Then select the row that fits argmax(probability) using:
selected = weighted.ix[weighted['probability'].idxmax()]
Which of course returns:
chrom ch1 start 300 end 500 probability 0.3 Name: 1, dtype: object
Is there a (fast) way to the get all the values when there are ties?
thanks!
推荐答案
Well, this might be solution you are looking for:
weighted.loc[weighted['probability']==weighted['probability'].max()].T # 1 2 3 #chrom ch1 chr1 chr1 #start 300 300 600 #end 500 500 800 #probability 0.3 0.3 0.3
其他推荐答案
The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series objects by performing calculations with the underlying NumPy array:
df2 = df[df['probability'].values == df['probability'].values.max()]
Performance benchmarking with the Pandas equivalent:
# tested on Pandas v0.19.2, Python 3.6.0 df = pd.concat([df]*100000, ignore_index=True) %timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop %timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop