加快时间戳操作

下变换（MS - >日期时间 - > CONVER时区）需要很长的时间来运行（4分钟），可能是因为我用的大数据帧的工作：加快时间戳操作

for column in ['A', 'B', 'C', 'D', 'E']: 
    # Data comes in unix time (ms) so I need to convert it to datetime 
    df[column] = pd.to_datetime(df[column], unit='ms') 

    # Get times in EST 
    df[column] = df[column].apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))

有任何方式来加快它？我是否已经以最有效的方式使用Pandas数据结构和方法？

来源

2014-09-04 Amelio Vazquez-Reina

这些都可以作为DatetimeIndex方法，这将是多更快：

df[column] = pd.DatetimeIndex(df[column]).tz_localize('UTC').tz_convert('US/Eastern')

注：0.15.0你将有机会获得这些作为系列dt accessor：

df[column] = df[column].dt.tz_localize('UTC').tz_convert('US/Eastern')

来源

2014-09-04 04:20:50

谢谢，虽然第一个选项似乎不适用于** 0.14.1 **。我在'tz_localize'中得到一个错误，说'TypeError：index不是有效的DatetimeIndex或PeriodIndex'。 – 2014-09-04 12:53:38

Andy，我的'.dt'命名空间下没有'tz_localize'。这是否实施？这将是超级有用的。 – TomAugspurger 2014-09-04 14:39:44

@ user815423426这很奇怪，这听起来像列没有正确地转换为日期，因此DatetimeIndex失败...做了你的to_datetime行工作 - 你检查结果dtype？这可能有点费劲。 – 2014-09-04 17:25:38

我会在Bash中使用date命令尝试这个尝试。日期证明比常规转换的gawk更快。 Python可能会为此而挣扎。

为了加快速度，甚至可以在一个临时文件中导出A列，在另一个临时文件中导出B列等。（你甚至可以在Python中执行此操作）。然后并行运行5列。

for column in ['A']: 
    print>>thefileA, column 
for column in ['B']: 
    print>>thefileB, column

然后bash脚本：

#!/usr/bin/env bash 
readarray a < thefileA 
for i in $(a); do 
    date -r item: $i 
done

你会希望有一个主bash脚本运行在python python pythonscript.py第一部分。然后，您将想要从主人./FILEA.sh &的背景中调用每个bash脚本。这将单独运行每列，并自动指定节点。对于readarray之后的bash循环，我不是100％，这是正确的语法。如果你在linux上，使用date -d @ item。

来源

2014-09-04 04:20:44 PhysicalChemist

加快时间戳操作

回答

相关问题