2013-10-25 27 views
1

我有一个脱字符分隔的文件。文件中唯一的脱字符是分隔符 - 文本中没有。几个字段是自由文本字段并包含嵌入的换行符。这使得解析文件非常困难。我需要记录末尾的换行符,但我需要将它们从文本字段中删除。删除分隔文件中的嵌套换行符?

这是来自Global Integrated Shipping Information System的开源海事盗版数据。这里有三条记录,前面是标题行。首先,船名是NORMANNIA,第二个是Unkown,第三个是KOTA BINTANG。

ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken 
NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^- 
Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened. 

Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam) 

ReCAAP ISC via Focal Point (Singapore)"^- 
KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^- 

你会注意到第一个和第三个记录都很好并且很容易解析。第二个记录“Unkown”有一些嵌套的换行符。

我应该如何去除python脚本中的嵌套换行符(但不包括记录末尾的那些字符)(或者,如果有更简单的方法),以便我可以将这些数据导入SAS?

回答

1

我通过计算遇到分隔符的数量和手动切换到一个新的纪录解决了这个问题,当我达成了一个记录相关的数字。然后,我删除了所有换行符,并将数据写回新文件。实质上,它是原始文件,其中从字段中删除了换行符,但在每条记录的末尾添加了换行符。这里是代码:

f = open("events.csv", "r") 

carets_per_record = 33 

final_file = [] 
temp_file = [] 
temp_str = '' 
temp_cnt = 0 

building = False 

for i, line in enumerate(f): 

    # If there are no carets on the line, we are building a string 
    if line.count('^') == 0: 
     building = True 

    # If we are not building a string, then set temp_str equal to the line 
    if building is False: 
     temp_str = line 
    else: 
     temp_str = temp_str + " " + line 

    # Count the number of carets on the line 
    temp_cnt = temp_str.count('^') 

    # If we do not have the proper number of carets, then we are building 
    if temp_cnt < carets_per_record: 
     building = True 

    # If we do have the proper number of carets, then we are finished 
    # and we can push this line to the list 
    elif temp_cnt == carets_per_record: 
     building = False 
     temp_file.append(temp_str) 

# Strip embedded newline characters from the temp file 
for i, item in enumerate(temp_file): 
    final_file.append(temp_file[i].replace('\n', '')) 

# Write the final_file list out to a csv final_file 
g = open("new_events.csv", "wb") 


# Write the lines back to the file 
for item in enumerate(final_file): 
    # item is a tuple, so we get the content part and append a new line 
    g.write(item[1] + '\n') 

# Close the files we were working with 
f.close() 
g.close() 
1

将数据加载到一个字符串,然后做

import re 
newa=re.sub('\n','',a) 

会有在纽瓦

newa=re.sub('\n(?!$)','',a) 

没有换行和离开的人在该行的结束,但去掉休息

+1

这是否也不会删除记录换行符的结尾呢? – Clay

+0

我试过你的第二个例子,它也删除了行尾的换行符 - 不仅仅是嵌入行。 – Clay

2

我看你已经标记为正则表达式,但我会建议使用内置的CSV库来解析这个。 CSV库将正确解析文件,并保留换行符。

Python的CSV例子:http://docs.python.org/2/library/csv.html

+0

我同意,csv库易于使用,似乎适合您的问题 – Vorsprung

+0

嗯,我真正需要的是一个csv文件,在字段中没有换行符,这样我就可以将它导入SAS。实际上,似乎删除这些换行符的正则表达式方法的步骤较少。在解析数据后,如何处理将数据重新导出到csv以获取格式良好的csv文件?一些内部文本字段也嵌入了引号,而另一些则没有。 – Clay

+0

@Clay:也许你可以上传一个示例文件到要点,我们可以告诉你如何使用csv模块将它解析为CSV,然后正确地重新输出它。你真正需要的是现场报价,你的意见似乎并不包含。 – VooDooNOFX