如何快速处理一百万个字符串以删除引号并将它们连接在一起

我试图从一大组字符串（字符串列表）中删除多余的引号，因此对于每个原始字符串，它看起来像，如何快速处理一百万个字符串以删除引号并将它们连接在一起

"""str_value1"",""str_value2"",""str_value3"",1,""str_value4"""

我要删除的开始和结束引号和额外的对每个字符串值引号的，所以结果将是什么样子，

"str_value1","str_value2","str_value3",1,"str_value4"

，然后由新线加入列表中的每个字符串。

我曾尝试下面的代码，

for line in str_lines[1:]: 
     strip_start_end_quotes = line[1:-1] 
     splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"') 
     str_lines[str_lines.index(line)] = splited_line_rem_quotes 

for_pandas_new_headers_str = '\n'.join(splited_lines)

，但它实在是太慢了（运行年龄）如果列表中包含了100多万串线。那么在时间效率方面做什么最好的方法是什么？

我也试着多处理这项任务由

def preprocess_data_str_line(data_str_lines): 
""" 

:param data_str_lines: 
:return: 
""" 
    for line in data_str_lines: 
     strip_start_end_quotes = line[1:-1] 
     splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"') 
     data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes 

    return data_str_lines 


def multi_process_prepcocess_data_str(data_str_lines): 
    """ 

    :param data_str_lines: 
    :return: 
    """ 
    # if cpu load < 25% and 4GB of ram free use 3 cores 
    # if cpu load < 70% and 4GB of ram free use 2 cores 
    cores_to_use = how_many_core() 

    data_str_blocks = slice_list(data_str_lines, cores_to_use) 

    for block in data_str_blocks: 
     # spawn processes for each data string block assigned to every cpu core 
     p = multiprocessing.Process(target=preprocess_data_str_line, args=(block,)) 
     p.start()

，但我不知道如何连接的结果返回到列表中，这样我可以加入该列表由新线串。

所以，理想情况下，我正在考虑使用多处理+快速功能来预处理每条线以加速整个过程。

来源

2017-08-02 daiyue

我想在data_str_lines.index(line)上花费了大量的处理时间 - 要找到第n个元素的行，它必须先查看N-1个元素才能找到原始行的索引（所以不要循环100万次，你循环〜500亿次）。相反 - 保持当前的指数跟踪和更新列表，当您去，如：

for idx, line in enumerate(data_str_lines): 
    # Do whatever you need to do with `line`... to create a `new_line` 
    # ... 
    # Update line to be the new line 
    data_str_lines[idx] = new_line 

for_pandas = '\n'.join(data_str_lines)

来源

2017-08-02 16:41:27

如果我只是想重复的'data_str_lines'，例如子列表'data_str_lines [1：]'，我发现'idx'对于子列表中的第一个字符串是0，而不是原始列表中的1;所以必须'idx + 1';是否有直接的方式获得其原始索引。 – daiyue

@daiyue使用'enumerate（data_str_lines [1：]，start = 1）' –

片段会创建一个副本，所以如果你真的想避免内存开销，你可以使用：'enumerate（itertools.islice（data_str_lines， 1，None），start = 1）' –

如何快速处理一百万个字符串以删除引号并将它们连接在一起

回答

相关问题