2017-09-18 36 views
1

我的文本文件包含段落是这样的。提取段落与两个相似的titiles之间的特定词

summary 

A result oriented and dedicated professional with three years’ experience in Software Development. A proactive individual with a logical approach to challenges, performs effectively even within a highly pressurised working environment. 

summary 

Oct 28th, 2010 – Till date Cognizant Technology Solutions  


Project #1 

Title   Wealth Passport – R7.3 
Client     Northern Trust 
Operating System Windows XP 
Technologies  J2EE, JSP, Struts, Oracle, PL/SQL 
Team Size  3 
Role   Team Member 
Period     22nd Aug’ 2013 - Till Date  
Project Description 
Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities. 
Role and Responsibilities 
Handled dockets in SGY and PMM applications. 
Done root cause analysis to existing issues in a short span of time. 
Designed and developed enhancements in PMM application. 
Preparing Unit Test cases for the developed Java modules and executing them. 


Project #2 
Title   PFS Development – WP Filecabinet and R7.2 
Client     Northern Trust 
Operating System Windows XP 
Technologies  J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO 
Team Size  1 
Role   Team Member – JavaEE Developer 
Period     18th June’ 2013 – 21st Aug’ 2013 
Project Description 
PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0 
Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files. Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system. 

现在,我想以提取其中包含像"Project", "Teamsize " 词语而不提取其他摘要部分节摘要。 我曾尝试下面这段代码,它提取两摘要内容

import re 
import os 
with open ('9.txt', encoding='latin-1') as infile, open ('d.txt','w',encoding='latin-1') as outfile : 
    copy = False 
    for line in infile: 
     if line.strip() == 'summary': 
      re.compile('\r\nproject*\r\n') 
      copy = True 
     elif line.strip() == "summary": 
      copy =False 
     elif copy: 
      outfile.write(line) 
     #fh = open("d.txt",'r') 
     contents = fh.read() 
     len(contents) 

,我期待一个文本文件作为d.txt来保存它包含内容

summary 

    Oct 28th, 2010 – Till date Cognizant Technology Solutions  


    Project #1 

    Title   Wealth Passport – R7.3 
    Client     Northern Trust 
    Operating System Windows XP 
    Technologies  J2EE, JSP, Struts, Oracle, PL/SQL 
    Team Size  3 
    Role   Team Member 
    Period     22nd Aug’ 2013 - Till Date  
    Project Description 
    Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities. 
    Role and Responsibilities 
    Handled dockets in SGY and PMM applications. 
    Done root cause analysis to existing issues in a short span of time. 
    Designed and developed enhancements in PMM application. 
    Preparing Unit Test cases for the developed Java modules and executing them. 


    Project #2 
    Title   PFS Development – WP Filecabinet and R7.2 
    Client     Northern Trust 
    Operating System Windows XP 
    Technologies  J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO 
    Team Size  1 
    Role   Team Member – JavaEE Developer 
    Period     18th June’ 2013 – 21st Aug’ 2013 
    Project Description 
    PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0 
    Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files. Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system. 
+0

您是否可以控制文本文件的格式?如果是这样,将它们声明为'json','txt'或'csv'(仅举几个)文件格式将更容易解析。 –

+0

'd.txt'中的预期输出是什么? –

+0

摘要部分包含项目词 –

回答

0

若要提取包含单词你有兴趣在所有summary节:

split_on = 'summary\n\n' 
must_contain = ['Project', 'Team Size'] 

with open('9.txt') as f_input, open('d.txt', 'w') as f_output: 
    for part in f_input.read().split(split_on): 
     if all(text in part for text in must_contain): 
      f_output.write(split_on + part) 
+0

我有很多文件随机的方式,我有特定的单词来检查和提取部分不是所有文件都与上面类似 –

+0

我想提取随机文本文件输入的任何部分与特定的单词提及像{项目,团队等, } –

+0

我已更新脚本以过滤包含所需单词列表的所有部分。 –

0

第二个条件语句,这里将从不跑,因为它与第一个条件相同。在summary的第一个实例后,含义副本始终为True

if line.strip() == 'summary': 
    re.compile('\r\nproject*\r\n') 
    copy = True 
elif line.strip() == "summary": 
    copy =False 

什么我建议是具有拿起“摘要”标签(我假定这些意味着要开始注释块/结束)一个说法 - 和切换copy

要切换一个布尔值,你可以简单的将它设置为它自身的逆:

a = True 
a = not a 
# a is now False 

例如:

if line.strip() == 'summary': 
    copy = not copy 
elif copy: 
    outfile.write(line) 
相关问题