Faster Pandas: modin swifter pandarallel and others

Faster Pandas,一些加速pandas处理的技巧和工具

Smaller

加速的前置工作之一,内存优化

参考:

Pandas内存优化 trick

Read by chunk

1
df_reader = pd.read_csv(file, low_memory=False, lineterminator="\n", usecols=columns_needed, iterator=True, chunksize=360000)

Parrallel Processing

Modin

Modin介绍

想让pandas运行更快吗?那就用Modin吧 - 机器之心的文章 - 知乎 https://zhuanlan.zhihu.com/p/62398921

Modin官方

https://github.com/modin-project/modin

https://modin.readthedocs.io/en/latest/index.html

Modin使用

swifter

pandarallel

Multithreads

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from tqdm import tqdm
import time
import multiprocessing

def a2g2a(data):
result = function(data)
return result

pbar_data = tqdm(data_adds[index_begin:])
num_threads = 6
pool = multiprocessing.Pool(num_threads)
list_a2g2a = pool.map(a2g2a, pbar_data)
pool.close()
pool.join()
print("num_threads:{}, messages length:{}".format(num_threads, len(list_a2g2a)))
0%