2016-03-05 54 views
1

使用Python,我必须编写一个基本上“清理”数据文本文件的脚本。到目前为止,我已经取出了所有不需要的字符或将它们替换为可接受的字符(例如,可以用空格替换破折号-)。现在我已经到了必须分开加在一起的单词的地步。这里是文本的第15行的代码段文件用大写字母分隔连接词

AccessibleComputing Computer accessibility 
AfghanistanHistory History of Afghanistan 
AfghanistanGeography Geography of Afghanistan 
AfghanistanPeople Demographics of Afghanistan 
AfghanistanCommunications Communications in Afghanistan 
AfghanistanMilitary Afghan Armed Forces 
AfghanistanTransportations Transport in Afghanistan 
AfghanistanTransnationalIssues Foreign relations of Afghanistan 
AssistiveTechnology Assistive technology 
AmoeboidTaxa Amoeba 
AsWeMayThink As We May Think 
AlbaniaHistory History of Albania 
AlbaniaPeople Demographics of Albania 
AlbaniaEconomy Economy of Albania 
AlbaniaGovernment Politics of Albania 

我想要做的是独立的是在其中大写字母出现点相连接的话。例如,我希望第一行看起来像这样:

Accessible Computing Computer accessibility 

脚本必须接受文件输入并将结果写入输出文件。这是我目前所拥有的,根本不起作用! (不知道如果我在正确的轨道或没有在任)

import re 

input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r') 
output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w') 

for line in input_file: 
    if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'): 
     newline = line. 

output_file.write(newline) 

input_file.close() 
output_file.close() 
+0

我想要做的是在连接到前一个单词的大写字母之前插入一个空格。我早些时候看到了这个话题,但我无法弄清楚文件输入:( – lsch91

回答

1

我建议用下面的正则表达式来分割的话:

import re, os 

input_file = 'input.txt' 
output_file = 'output.txt' 

with open(input_file, 'r') as f_in: 
    with open(output_file, 'w') as f_out: 
     for line in f_in.readlines(): 
      p = re.compile(r'[A-Z][a-z]+|\S+') 

      matches = re.findall(p, line) 
      matches = ' '.join(matches) 

      f_out.write(matches+ os.linesep) 

假设data.txt包含您粘贴在文章中的文本,它将打印:

Accessible Computing Computer accessibility 
Afghanistan History History of Afghanistan 
Afghanistan Geography Geography of Afghanistan 
Afghanistan People Demographics of Afghanistan 
Afghanistan Communications Communications in Afghanistan 
Afghanistan Military Afghan Armed Forces 
Afghanistan Transportations Transport in Afghanistan 
Afghanistan Transnational Issues Foreign relations of Afghanistan 
Assistive Technology Assistive technology 
Amoeboid Taxa Amoeba 
As We May Think As We May Think 
Albania History History of Albania 
Albania People Demographics of Albania 
Albania Economy Economy of Albania 
Albania Government Politics of Albania 
... 
+0

这个工作!非常感谢! – lsch91

0

你可以这样做:

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', '\g<end> \g<start>', line) 

这将在每个小写大写字母之间插入空格彼此相邻(假设你只有英文字符)。

+0

还有一个文件中的unicode(这是30万行长) – lsch91

1

这不是最好的方法,但它很简单。

from string import uppercase 

s = 'AccessibleComputing Computer accessibility' 

>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
        for n, c in enumerate(word)) 
      for word in s.split()) 
'Accessible Computing Computer accessibility' 

顺便说一下,这是你应该怎么做你的文件读/写:

f_in = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt" 
f_out = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt" 

def func(line): 
    processed_line = ... # your line processing function 
    return processed_line 

with open(f_in, 'r') as fin: 
    with open(f_out, 'w') a fout: 
     for line in fin.readlines(): 
      fout.write(func(line)) 
+0

谢谢!我会试试这个,让你知道它是怎么回事 – lsch91

+0

好吧,欢迎您的到来,并为此感到高兴。 – Saleem

相关问题