替换类别数据(熊猫)[英] Replacing Category Data (pandas)

本文是小编为大家收集整理的关于替换类别数据(熊猫)的处理方法,想解了替换类别数据(熊猫)的问题怎么解决?替换类别数据(熊猫)问题的解决办法?那么可以参考本文帮助大家快速定位并解决问题。

问题描述

我有一些带有几个类别列的大文件.类别也是一个慷慨的词,因为这些基本上是描述/部分句子.

这是每个类别的唯一值:

Category 1 = 15 
Category 2 = 94
Category 3 = 294
Category 4 = 401

Location 1 = 30
Location 2 = 60 

然后,甚至还有具有重复数据的用户(名字,姓氏,ID等).

我正在考虑以下解决方案,以使文件大小较小:

1)创建一个将每个类别与唯一整数匹配的文件

2)创建一个映射(是否有一种方法可以从读取另一个文件来执行此操作?就像我会创建.csv并将其加载为另一个数据框架然后匹配它吗?或者我最初必须从字面上键入它? )

3)基本上进行加入(vlookup),然后用长对象名称

del旧列

pd.merge(df1, categories, on = 'Category1', how = 'left') 
del df1['Category1']

在这种情况下,人们通常会做什么?这些文件非常大. 60列和大多数数据是长的,重复的类别和时间戳.从字面上看,根本没有数值数据.对我来说很好,但是由于共享驱动空间分配超过几个月,分享文件几乎是不可能的.

推荐答案

保存到CSV时从Categorical dtype中受益,您可能需要遵循此过程:

  1. 将您的类别定义提取到单独的数据范围/文件
  2. 将您的分类数据转换为int代码
  3. 将转换后的数据框保存到CSV以及定义dataFrames

当您需要再次使用它们时:

  1. 从CSV文件还原数据帧
  2. 带有INT代码到类别定义的MAP DATAFRAME
  3. 将映射的列转换为分类

说明过程:

制作示例数据框:

df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()

    <class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null object
locations     100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None

注意大小:2.3+ MB - 这大约是CSV文件的大小. 现在将这些数据转换为Categorical:

df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

注意记忆使用的下降到976.6 KB 但是,如果您立即将其保存到CSV:

df.to_csv('test1.csv')

...您会在文件中看到此内容:

index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2

的意思是"分类"已转换为以保存CSV的字符串. 因此,在保存定义后,让我们摆脱Categorical数据中的标签:

categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details

       index           
0      Location1
1      Location2

现在秘密Categorical to int dtype:

for col in df.select_dtypes(include=['category']).columns:
    df[col] = df[col].cat.codes
print df.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int8
locations     100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None

将转换的数据保存到csv,并注意该文件现在只有没有标签的数字. 文件大小还将反映此更改.

df.to_csv('test2.csv')

index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1

也保存定义:

categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')

当您需要还原文件时,请从csv文件加载它们:

df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int64
locations     100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None

categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

print categories_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None

locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()

       locations
index           
0      Location1
1      Location2

print locations_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations    2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None

现在使用map用类别描述替换int编码数据,然后将它们转换为Categorical:

df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()

      categories  locations
index                      
0      Category1  Location1
1      Category2  Location2
2      Category3  Location1
3      Category4  Location2
4      Category1  Location1

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

请注意,当我们首次将数据转换为Categorical时,内存使用量回到了它. 如果您需要多次重复此过程,则不难自动化此过程.

其他推荐答案

pandas具有Categorical数据类型,可以做到这一点.它基本上将类别映射到幕后整数.

在内部,数据结构由类别数组和一个 整数代码阵列指向类别中的实际值 阵列.

文档为在这里..

其他推荐答案

这是一种在单个.csv中使用分类列保存数据框的方法:

Example:
------      -------
Fatcol      Thincol: unique strings once, then numbers
------      -------
"Alberta"   "Alberta"
"BC"        "BC"
"BC"        2   -- string 2
"Alberta"   1   -- string 1
"BC"        2
...

The "Thincol" on the right can be saved as is in a .csv file,
and expanded to the "Fatcol" on the left after reading it in;
this can halve the size of big .csv s with repeated strings.

Functions
---------
fatcol( col: Thincol ) -> Fatcol, list[ unique str ]
thincol( col: Fatcol ) -> Thincol, dict( unique str -> int ), list[ unique str ]

Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists:
    Fatcol: list of strings
    Thincol: list of strings or ints or NaN s
If a `col` is a `pandas.Series`, its `.values` are used.

这将700m .csv降至248m,但write_csv在我的iMac上以〜1 mb/sec的速度运行.

本文地址:https://www.itbaoku.cn/post/1727801.html