本文是小编为大家收集整理的关于使用pandas对长数据集进行宽幅处理的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
有很多类似标题的问题,但我无法解决我的数据集遇到的问题.
数据集:
ID Country Type Region Gender IA01_Raw IA01_Class1 IA01_Class2 IA02_Raw IA02_Class1 IA02_Class2 QA_Include QA_Comments SC1 France A Europe Male 4 8 1 J 4 1 yes N/A SC2 France A Europe Female 2 7 2 Q 6 4 yes N/A SC3 France B Europe Male 3 7 2 K 8 2 yes N/A SC4 France A Europe Male 4 8 2 A 2 1 yes N/A SC5 France B Europe Male 1 7 1 F 1 3 yes N/A ID6 France A Europe Male 2 8 1 R 3 7 yes N/A ID7 France B Europe Male 2 8 1 Q 4 6 yes N/A UC8 France B Europe Male 4 8 2 P 4 2 yes N/A
所需输出:
ID Country Type Region Gender IA Raw Class1 Class2 QA_Include QA_Comments SC1 France A Europe Male 01 K 8 1 yes N/A SC1 France A Europe Male 01 L 8 1 yes N/A SC1 France A Europe Male 01 P 8 1 yes N/A SC1 France A Europe Male 02 Q 8 1 yes N/A SC1 France A Europe Male 02 R 8 1 yes N/A SC1 France A Europe Male 02 T 8 1 yes N/A SC1 France A Europe Male 03 G 8 1 yes N/A SC1 France A Europe Male 03 R 8 1 yes N/A SC1 France A Europe Male 03 G 8 1 yes N/A SC1 France A Europe Male 04 K 8 1 yes N/A SC1 France A Europe Male 04 A 8 1 yes N/A SC1 France A Europe Male 04 P 8 1 yes N/A SC1 France A Europe Male 05 R 8 1 yes N/A ....<小时>
在数据集中,我的列名称为 IA[X]_NAME 其中 X = 1..9 和 NAME = Raw, Class1 和 Class2.
我要做的只是转置这些列,使其看起来像必需输出中显示的表格,即 IA 将显示 X 值,就像这样 raw 和 classes 将显示它们的透视值.
所以为了实现它,我将列切片为:
idVars = list(excel_df_final.columns[0:40]) + list(excel_df_final.columns[472:527]) #These contain columns like ID, Country, Type etc valueVars = excel_df_final.columns[41:472].tolist() #All the IA_ columns
我不知道这一步是否必要,但这给了我完美的列切片,但是当我把它放在 melt 中时,它不能正常工作.我几乎尝试了其他问题中可用的所有方法.
pd.melt(excel_df_final, id_vars=idVars,value_vars=valueVars)
我也试过这个:
excel_df_final.set_index(idVars)[41:472].unstack()
但是没有用,这里是从宽到长的实现,它也没有用:
pd.wide_to_long(excel_df_final, stubnames = ['IA', 'Raw', 'Class1', 'Class2'], i=idVars, j=valueVars)
我得到的从宽到长的错误是:
<块引用>ValueError: 操作数不能与形状一起广播 (95,)(431,)
因为我的数据集实际上有 526 列,所以我将它们分成两个列表,其中一个包含 95 个列名,它们将是 i,其余 431 个是我需要在如示例数据集中所示的行.
推荐答案
这将帮助您入门.本质是使用set_index,列转换为MultiIndex,然后stack.可能存在更好的解决方案,但我会这样做,因为它是您输出的一个简单步骤.
# Set the index with columns that we don't want to "transpose" df2 = df.set_index([ 'ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include', 'QA_Comments']) # Convert headers to MultiIndex -- this is so we can melt IA values df2.columns = pd.MultiIndex.from_tuples(map(tuple, df2.columns.str.split('_'))) # Call stack to replicate data, then reset the index out = df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
out ID Country Type Region Gender QA_Include QA_Comments IA Class1 Class2 Raw 0 SC1 France A Europe Male yes NaN IA01 8 1 4 1 SC1 France A Europe Male yes NaN IA02 4 1 J 2 SC2 France A Europe Female yes NaN IA01 7 2 2 3 SC2 France A Europe Female yes NaN IA02 6 4 Q 4 SC3 France B Europe Male yes NaN IA01 7 2 3 5 SC3 France B Europe Male yes NaN IA02 8 2 K 6 SC4 France A Europe Male yes NaN IA01 8 2 4 7 SC4 France A Europe Male yes NaN IA02 2 1 A 8 SC5 France B Europe Male yes NaN IA01 7 1 1 9 SC5 France B Europe Male yes NaN IA02 1 3 F 10 ID6 France A Europe Male yes NaN IA01 8 1 2 11 ID6 France A Europe Male yes NaN IA02 3 7 R 12 ID7 France B Europe Male yes NaN IA01 8 1 2 13 ID7 France B Europe Male yes NaN IA02 4 6 Q 14 UC8 France B Europe Male yes NaN IA01 8 2 4 15 UC8 France B Europe Male yes NaN IA02 4 2 P
问题描述
There are a lot of questions out there with similar titles but I'm unable to solve the issues that I'm having with my dataset.
Dataset:
ID Country Type Region Gender IA01_Raw IA01_Class1 IA01_Class2 IA02_Raw IA02_Class1 IA02_Class2 QA_Include QA_Comments SC1 France A Europe Male 4 8 1 J 4 1 yes N/A SC2 France A Europe Female 2 7 2 Q 6 4 yes N/A SC3 France B Europe Male 3 7 2 K 8 2 yes N/A SC4 France A Europe Male 4 8 2 A 2 1 yes N/A SC5 France B Europe Male 1 7 1 F 1 3 yes N/A ID6 France A Europe Male 2 8 1 R 3 7 yes N/A ID7 France B Europe Male 2 8 1 Q 4 6 yes N/A UC8 France B Europe Male 4 8 2 P 4 2 yes N/A
Required output:
ID Country Type Region Gender IA Raw Class1 Class2 QA_Include QA_Comments SC1 France A Europe Male 01 K 8 1 yes N/A SC1 France A Europe Male 01 L 8 1 yes N/A SC1 France A Europe Male 01 P 8 1 yes N/A SC1 France A Europe Male 02 Q 8 1 yes N/A SC1 France A Europe Male 02 R 8 1 yes N/A SC1 France A Europe Male 02 T 8 1 yes N/A SC1 France A Europe Male 03 G 8 1 yes N/A SC1 France A Europe Male 03 R 8 1 yes N/A SC1 France A Europe Male 03 G 8 1 yes N/A SC1 France A Europe Male 04 K 8 1 yes N/A SC1 France A Europe Male 04 A 8 1 yes N/A SC1 France A Europe Male 04 P 8 1 yes N/A SC1 France A Europe Male 05 R 8 1 yes N/A ....
In the Dataset I've columns which are names as IA[X]_NAME where X = 1..9 and NAME = Raw, Class1 and Class2.
What I am trying to do is to just transpose these columns so that it looks like the table shown in Required output i.e. IA will show X value and just like this raw and classes will show their perspective values.
So in order to achieve it I sliced the columns as:
idVars = list(excel_df_final.columns[0:40]) + list(excel_df_final.columns[472:527]) #These contain columns like ID, Country, Type etc valueVars = excel_df_final.columns[41:472].tolist() #All the IA_ columns
I don't know if this step was necessary but this gave me the perfect slices of columns but when I put it in melt it is not working properly. I have tried almost every method that is available in other questions.
pd.melt(excel_df_final, id_vars=idVars,value_vars=valueVars)
I've also tried this:
excel_df_final.set_index(idVars)[41:472].unstack()
but didn't work and here is Wide to long implementation which also didn't work:
pd.wide_to_long(excel_df_final, stubnames = ['IA', 'Raw', 'Class1', 'Class2'], i=idVars, j=valueVars)
The error I got for wide to long is:
ValueError: operands could not be broadcast together with shapes (95,) (431,)
As my dataset has 526 columns in real, so that is why I've divided them into two lists one contains 95 column names which will be the i and the rest 431 are the one that I need to show in the row as shown in the sample data set.
推荐答案
This will get you started. The essence is using set_index, column conversion to MultiIndex, then stack. Better solutions exist, possibly, but I would do it this way because it is an easy step to your output.
# Set the index with columns that we don't want to "transpose" df2 = df.set_index([ 'ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include', 'QA_Comments']) # Convert headers to MultiIndex -- this is so we can melt IA values df2.columns = pd.MultiIndex.from_tuples(map(tuple, df2.columns.str.split('_'))) # Call stack to replicate data, then reset the index out = df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
out ID Country Type Region Gender QA_Include QA_Comments IA Class1 Class2 Raw 0 SC1 France A Europe Male yes NaN IA01 8 1 4 1 SC1 France A Europe Male yes NaN IA02 4 1 J 2 SC2 France A Europe Female yes NaN IA01 7 2 2 3 SC2 France A Europe Female yes NaN IA02 6 4 Q 4 SC3 France B Europe Male yes NaN IA01 7 2 3 5 SC3 France B Europe Male yes NaN IA02 8 2 K 6 SC4 France A Europe Male yes NaN IA01 8 2 4 7 SC4 France A Europe Male yes NaN IA02 2 1 A 8 SC5 France B Europe Male yes NaN IA01 7 1 1 9 SC5 France B Europe Male yes NaN IA02 1 3 F 10 ID6 France A Europe Male yes NaN IA01 8 1 2 11 ID6 France A Europe Male yes NaN IA02 3 7 R 12 ID7 France B Europe Male yes NaN IA01 8 1 2 13 ID7 France B Europe Male yes NaN IA02 4 6 Q 14 UC8 France B Europe Male yes NaN IA01 8 2 4 15 UC8 France B Europe Male yes NaN IA02 4 2 P