分裂并检查长度仍然可以比一个正则表达式更快:
In [24]: s = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"
In [25]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 413 ns per loop
In [26]: r = re.compile(r'(?<=\|)[^|]*')
In [27]: timeit r.search(s)
1000000 loops, best of 3: 452 ns per loop
In [28]: s = "78 Indonesia Pamela Reid [email protected] 147.3.67.193"
In [29]: timeit r.search(s)
1000000 loops, best of 3: 1.66 µs per loop
In [30]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 342 ns per loop
可以剃略偏关:
for line in f:
spl = line.split("|",2)
if len(spl) > 2:
print(spl[1])
....
上匹配和不匹配的行的一些定时通过创建一个本地参考str.split:
_spl = str.split
for line in f:
spl = _spl(s,"|",2)
if len(spl) > 2:
.....
由于在每一行相同数量的管道总是:
def main(argv):
seen = set() # only use if you actually need a set of all names
with open("test.txt", 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
v = row[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
的if/else似乎是多余的,你被附加到文件,而不管,如果你想保留一组列的[1]”除此之外,除非你真的想要一组所有的名字,我会从代码中删除它。
应用相同的逻辑来划分:
def main(argv):
seen = set()
with open("test.txt", 'r') as infile:
_spl = str.split
for row in infile:
v = _spl(row,"|",2)[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
什么会造成很大的开销也不断开拓和写入,但除非你可以将所有行存储在内存中有没有简单的方法来解决它。
至于读得好,就用十个百万行只是分裂两次文件胜过CSV阅读:
In [15]: with open("in.txt") as f:
....: print(sum(1 for _ in f))
....:
10000000
In [16]: paste
def main(argv):
with open(argv, 'r') as infile:
for row in infile:
v = row.split("|", 2)[1]
if v:
pass
## -- End pasted text --
In [17]: paste
def main_r(argv):
with open(argv, 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
if row[1]:
pass
## -- End pasted text --
In [18]: timeit main("in.txt")
1 loops, best of 3: 3.85 s per loop
In [19]: timeit main_r("in.txt")
1 loops, best of 3: 6.62 s per loop
你可以逐行分析是正确的话,根本不是知识 – The6thSense
关系按如果有一个管道将总是存在一个以上或可以在任何地方出现管道? –
@PadraicCunningham我不明白你在说什么 – v1shnu