Pandas的数据格式化[英] Pandas datetime formatting

本文是小编为大家收集整理的关于Pandas的数据格式化的处理方法,想解了Pandas的数据格式化的问题怎么解决?Pandas的数据格式化问题的解决办法?那么可以参考本文帮助大家快速定位并解决问题。

问题描述

是否可以用零后缀来表示 pd.to_datetime?似乎正在删除零.

print pd.to_datetime("2000-07-26 14:21:00.00000",
                format="%Y-%m-%d %H:%M:%S.%f")

结果是

2000-07-26 14:21:00

想要的结果是

2000-07-26 14:21:00.00000

我知道这些值的含义是一样的,但它会很好地保持一致性.

推荐答案

做一些测试表明,当用 format="%H:%M:%S.%f" 格式化日期时间数据时,%f 能够纳秒分辨率,前提是小数点后第九位非零.格式化字符串时,会根据小数点后最低有效位的位置添加可变数量的尾随零,从无到五个,并且假设它也是最后一位.这是来自测试数据的表格,其中 position 是最低有效非零的位置,也是最后一位数字, zeros 是通过格式化添加的尾随零的数量:

    position zeros
       9      0
       8      1
       7      2
       6      0
       5      1
       4      2
       3      3
       2      4
       1      5

当一列整体被格式化为"%H:%M:%S.%f"时,其所有元素的小数点后位数相同,可以通过添加或删除来完成尾随零,即使这会增加或降低原始数据的分辨率.我想这样做的原因是一致性和令人愉悦的美学,通常不会引入过多的错误,因为在数字计算中,尾随零通常不会影响即时结果,但是它们会影响对其错误的估计以及它们应该如何呈现(尾随零点, 重要人物规则).

以下是使用 pandas.to_datetime 将"%H:%M:%S.%f"格式应用于单个字符串和 pandas.Series(DataFrame 列)并应用 pandas.DataFrame.convert_objects(convert_dates='coerce') 到带有可以转换为日期时间的列的 DataFrames.

在字符串上,pandas 使用"%H:%M:%S.%f"在时间转换中保留小数点后第九位的非零数字,如果未提供日期,则添加日期:

import pandas as pd
pd.to_datetime ("10:00:00.000000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000001')

pd.to_datetime ("2015-09-17 10:00:00.000000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[15]: Timestamp('2015-09-17 10:00:00.000000001')

在此之前以及对于最后一个非零数字是最后一个数字的测试,它在最后一个非零数字之后最多添加五个尾随零,以提高原始数据的分辨率,除非最后一个非零数字数字在小数点右边六位:

pd.to_datetime ("10:00:00.00000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000010')

pd.to_datetime ("2015-09-17 10:00:00.00000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[16]: Timestamp('2015-09-17 10:00:00.000000010')

pd.to_datetime ("10:00:00.0000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000100')

pd.to_datetime ("2015-09-17 10:00:00.0000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[17]: Timestamp('2015-09-17 10:00:00.000000100')

pd.to_datetime ("10:00:00.000001",format="%H:%M:%S.%f")
Out[33]: Timestamp('1900-01-01 10:00:00.000001')

pd.to_datetime ("2015-09-17 10:00:00.000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[18]: Timestamp('2015-09-17 10:00:00.000001')

pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
Out[6]: Timestamp('1900-01-01 10:00:00.000010')

pd.to_datetime ("2015-09-17 10:00:00.00001",format="%Y-%m-%d %H:%M:%S.%f")
Out[19]: Timestamp('2015-09-17 10:00:00.000010')

pd.to_datetime ("10:00:00.0001",format="%H:%M:%S.%f")
Out[9]: Timestamp('1900-01-01 10:00:00.000100')

pd.to_datetime ("2015-09-17 10:00:00.0001",format="%Y-%m-%d %H:%M:%S.%f")
Out[21]: Timestamp('2015-09-17 10:00:00.000100')

pd.to_datetime ("10:00:00.001",format="%H:%M:%S.%f")
Out[10]: Timestamp('1900-01-01 10:00:00.001000')

pd.to_datetime ("2015-09-17 10:00:00.001",format="%Y-%m-%d %H:%M:%S.%f")
Out[22]: Timestamp('2015-09-17 10:00:00.001000')

pd.to_datetime ("10:00:00.01",format="%H:%M:%S.%f")
Out[12]: Timestamp('1900-01-01 10:00:00.010000')

pd.to_datetime ("2015-09-17 10:00:00.01",format="%Y-%m-%d %H:%M:%S.%f")
Out[24]: Timestamp('2015-09-17 10:00:00.010000'

pd.to_datetime ("10:00:00.1",format="%H:%M:%S.%f")
Out[13]: Timestamp('1900-01-01 10:00:00.100000')

pd.to_datetime ("2015-09-17 10:00:00.1",format="%Y-%m-%d %H:%M:%S.%f")
Out[26]: Timestamp('2015-09-17 10:00:00.100000')

让我们看看它是如何与 DataFrame 一起工作的:

!type test.csv # here type is Windows substitute for Linux cat command
date,mesg
10:00:00.000000001,one
10:00:00.00000001,two
10:00:00.0000001,three
10:00:00.000001,four
10:00:00.00001,five
10:00:00.0001,six
10:00:00.001,seven
10:00:00.01,eight
10:00:00.1,nine
10:00:00.000000001,ten
10:00:00.000000002,eleven
10:00:00.000000003,twelve

df = pd.read_csv('test.csv')
df
Out[30]: 
                  date    mesg
0   10:00:00.000000001     one
1    10:00:00.00000001     two
2     10:00:00.0000001   three
3      10:00:00.000001    four
4       10:00:00.00001    five
5        10:00:00.0001     six
6         10:00:00.001   seven
7          10:00:00.01   eight
8           10:00:00.1    nine
9   10:00:00.000000001     ten
10  10:00:00.000000002  eleven
11  10:00:00.000000003  twelve

df.dtypes
Out[31]: 
date    object
mesg    object
dtype: object

使用没有格式选项的 convert_objects 对 DataFrame 进行日期时间转换,即使某些原始数据的分辨率小于或大于该值,也可以提供微秒级分辨率,并添加今天的日期:

df2 = df.convert_objects(convert_dates='coerce')
df2
Out[32]: 
                     date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

df2.dtypes
Out[33]: 
date    datetime64[ns]
mesg            object
dtype: object

从原始数据创建的 DataFrame 列中元素值的更高分辨率在没有显式格式的情况下完成日期时间转换后无法通过"%H:%M:%S.%f"格式恢复说明符(即使用 DataFrame.convert_objects):

df2['date'] = pd.to_datetime(df2['date'],format="%H:%M:%S.%f")
df2
Out[34]: 
                         date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

如果至少一个元素在第九位有一个非零数字(如 pandas.to_datetime 文档),但也增加了原始数据的分辨率小于纳秒的分辨率到该级别并添加 1900-01-01 作为日期:

df3 = df.copy(deep=True)
df3['date'] = pd.to_datetime(df3['date'],format="%H:%M:%S.%f",coerce=True)
df3
Out[35]:
                            date    mesg
0  1900-01-01 10:00:00.000000001     one
1  1900-01-01 10:00:00.000000010     two
2  1900-01-01 10:00:00.000000100   three
3  1900-01-01 10:00:00.000001000    four
4  1900-01-01 10:00:00.000010000    five
5  1900-01-01 10:00:00.000100000     six
6  1900-01-01 10:00:00.001000000   seven
7  1900-01-01 10:00:00.010000000   eight
8  1900-01-01 10:00:00.100000000    nine
9  1900-01-01 10:00:00.000000001     ten
10 1900-01-01 10:00:00.000000002  eleven
11 1900-01-01 10:00:00.000000003  twelve

使用 "%H:%M:%S.%f" 格式化 DataFrame 列在小数点后具有最低有效非零数字的数据后添加零(在整个列上,并根据位置添加零:zeros 上面的表)并将所有其他数据的分辨率与该分辨率对齐,即使这样做会增加或降低某些原始数据的分辨率:

df4 = pd.read_csv('test2.csv')
df4
Out[36]: 
                  date    mesg
0   10:00:00.000000000     one
1    10:00:00.00000000     two
2     10:00:00.0000000   three
3      10:00:00.000000    four
4       10:00:00.00000    five
5        10:00:00.0001     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8            10:00:00.    nine
9   10:00:00.000000000     ten
10  10:00:00.000000000  eleven
11   10:00:00.00000000  twelve

df4['date'] = pd.to_datetime(df4['date'],format="%H:%M:%S.%f",coerce=True)
df4
Out[37]: 
                         date    mesg
0  1900-01-01 10:00:00.000000     one
1  1900-01-01 10:00:00.000000     two
2  1900-01-01 10:00:00.000000   three
3  1900-01-01 10:00:00.000000    four
4  1900-01-01 10:00:00.000000    five
5  1900-01-01 10:00:00.000100     six
6  1900-01-01 10:00:00.000000   seven
7  1900-01-01 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  1900-01-01 10:00:00.000000     ten
10 1900-01-01 10:00:00.000000  eleven
11 1900-01-01 10:00:00.000000  twelve

当使用相同的 DataFrame 但日期列中包含日期时,同样的事情发生了:

df25
Out[38]: 
                             date    mesg
0   2015-09-10 10:00:00.000000000     one
1    2015-09-11 10:00:00.00000000     two
2     2015-09-12 10:00:00.0000000   three
3      2015-09-13 10:00:00.000000    four
4       2015-09-14 10:00:00.00000    five
5        2015-09-15 10:00:00.0001     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8            2015-09-18 10:00:00.    nine
9   2015-09-19 10:00:00.000000000     ten
10  2015-09-20 10:00:00.000000000  eleven
11   2015-09-21 10:00:00.00000000  twelve

df25['date'] = pd.to_datetime(df25['date'],format="%Y-%m-%d %H:%M:%S.%f",coerce=True)
df25
Out[39]: 
                         date    mesg
0  2015-09-10 10:00:00.000000     one
1  2015-09-11 10:00:00.000000     two
2  2015-09-12 10:00:00.000000   three
3  2015-09-13 10:00:00.000000    four
4  2015-09-14 10:00:00.000000    five
5  2015-09-15 10:00:00.000100     six
6  2015-09-16 10:00:00.000000   seven
7  2015-09-17 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  2015-09-19 10:00:00.000000     ten
10 2015-09-20 10:00:00.000000  eleven
11 2015-09-21 10:00:00.000000  twelve

当没有原始数据在小数点后具有非零有效数字时,使用 DataFrame 列"%H:%M:%S.%f"进行格式化可能会在所有数据的小数点后统一提供两个零即使增加或降低某些原始数据的分辨率的数据:

df5 = pd.read_csv('test3.csv')
df5
Out[40]: 
                  date    mesg
0         10:00:00.000     one
1           10:00:00.0     two
2         10:00:00.000   three
3         10:00:00.000    four
4          10:00:00.00    five
5         10:00:00.000     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8           10:00:00.0    nine
9   10:00:00.000000000     ten
10        10:00:00.000  eleven
11        10:00:00.000  twelve

df5['date'] = pd.to_datetime(df5['date'],format="%H:%M:%S.%f",coerce=True)
df5
Out[41]: 
                  date    mesg
0  1900-01-01 10:00:00     one
1  1900-01-01 10:00:00     two
2  1900-01-01 10:00:00   three
3  1900-01-01 10:00:00    four
4  1900-01-01 10:00:00    five
5  1900-01-01 10:00:00     six
6  1900-01-01 10:00:00   seven
7  1900-01-01 10:00:00   eight
8  1900-01-01 10:00:00    nine
9  1900-01-01 10:00:00     ten
10 1900-01-01 10:00:00  eleven
11 1900-01-01 10:00:00  twelve

在使用相同的 DataFrame 但日期列中包含日期进行此测试时发生了同样的事情:

df45
Out[42]: 
                             date    mesg
0         2015-09-10 10:00:00.000     one
1           2015-09-11 10:00:00.0     two
2         2015-09-12 10:00:00.000   three
3         2015-09-13 10:00:00.000    four
4          2015-09-14 10:00:00.00    five
5         2015-09-15 10:00:00.000     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8           2015-09-18 10:00:00.0    nine
9   2015-09-19 10:00:00.000000000     ten
10        2015-09-20 10:00:00.000  eleven
11        2015-09-21 10:00:00.000  twelve

df45['date'] = pd.to_datetime(df45['date'],format="%Y-%m-%d %H:%M:    %S.%f",coerce=True)
df45
Out[43]: 
                  date    mesg
0  2015-09-10 10:00:00     one
1  2015-09-11 10:00:00     two
2  2015-09-12 10:00:00   three
3  2015-09-13 10:00:00    four
4  2015-09-14 10:00:00    five
5  2015-09-15 10:00:00     six
6  2015-09-16 10:00:00   seven
7  2015-09-17 10:00:00   eight
8  2015-09-18 10:00:00    nine
9  2015-09-19 10:00:00     ten
10 2015-09-20 10:00:00  eleven
11 2015-09-21 10:00:00  twelve

本文地址:https://www.itbaoku.cn/post/1728187.html