使用Python和.txt文件

我已经下载了以下词典从Project Gutenberg的创建字典http://www.gutenberg.org/cache/epub/29765/pg29765.txt（这是25 MB，所以如果你是一个缓慢的连接避免点击链接）使用Python和.txt文件

在文件我正在寻找的关键词是大写，例如HALLUCINATION，然后在字典中有一些专门用于发音的行，这些行对我来说已经过时了。

我想提取的是定义，用“Defn”表示，然后打印行。我已经想出了这个相当丑陋的'解决方案'

def lookup(search): 
    find = search.upper()     # transforms our search parameter all upper letters 
    output = []        # empty dummy list 
    infile = open('webster.txt', 'r')  # opening the webster file for reading 
    for line in infile: 
     for part in line.split(): 
      if (find == part): 
       for line in infile: 
        if (line.find("Defn:") == 0): # ugly I know, but my only guess so far 
         output.append(line[6:]) 
         print output    # uncertain about how to proceed 
         break

现在这当然只打印“Defn：”后面的第一行。在Python中处理.txt文件时，我是新手，因此对于如何进行操作一无所知。我确实读过一个元组中的行，并注意到有特殊的新行字符。

所以我想以某种方式告诉Python继续阅读，直到它用完我想的新行字符，但也不计算最后一行必须阅读。

有人可以请提高我有用的功能，我可能可以用来解决这个问题（与一个最小的例子，将不胜感激）。期望的输出的

例：

查找（ “幻觉”）

出：向漂移;误入歧途;犯错;失误 - 使用精神进程。 [R.]拜伦。

查找（ “幻觉”）

出：其不具有现实，或对象的感知\ r \ n 感觉不具有相应的外部原因，从\ r \ n 紊乱所引起的或神经系统，如deli妄的震颤;妄想。\ r \ n 幻觉总是大脑混乱的证据，并且是精神错乱的常见现象。 W. A. Hammond。

从文本：

HALLUCINATE 
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of 
hallucinari, alucinari, to wander in mind, talk idly, dream.] 

Defn: To wander; to go astray; to err; to blunder; -- used of mental 
processes. [R.] Byron. 

HALLUCINATION 
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.] 

1. The act of hallucinating; a wandering of the mind; error; mistake; 
a blunder. 
This must have been the hallucination of the transcriber. Addison. 

2. (Med.) 

Defn: The perception of objects which have no reality, or of 
sensations which have no corresponding external cause, arising from 
disorder or the nervous system, as in delirium tremens; delusion. 
Hallucinations are always evidence of cerebral derangement and are 
common phenomena of insanity. W. A. Hammond. 

HALLUCINATOR 
Hal*lu"ci*na`tor, n. Etym: [L.]

来源

2014-10-20 Spaced

为什么不使用'urllib'访问该文件？ – Beginner 2014-10-20 17:12:23

@Beginner，我不知道这个函数，我只用了3周的代码就可以在Python中使用:-)但是感谢你提及它，我将不得不谷歌它。但是访问这个文件并不是我的问题，'阅读'它是。 – Spaced 2014-10-20 17:13:37

@Beginner：OP是否询问获取文件？没有.. – RickyA 2014-10-20 17:13:44

这里每学期

至少一个定义是一个返回函数第一个定义：

def lookup(word): 
    word_upper = word.upper() 
    found_word = False 
    found_def = False 
    defn = '' 
    with open('dict.txt', 'r') as file: 
     for line in file: 
      l = line.strip() 
      if not found_word and l == word_upper: 
       found_word = True 
      elif found_word and not found_def and l.startswith("Defn:"): 
       found_def = True 
       defn = l[6:] 
      elif found_def and l != '': 
       defn += ' ' + l 
      elif found_def and l == '': 
       return defn 
    return False 

print lookup('hallucination')

说明：我们必须考虑四种不同情况。

我们还没有找到单词。我们必须将当前行与大写字母中要查找的单词进行比较。如果他们是平等的，我们找到了这个词。
我们已经找到这个词，但还没有找到定义的开始。因此我们必须寻找以Defn:开头的行。如果我们发现，我们的行添加到定义（不包括六个字符Defn:。
我们已经找到了定义的开始。在这种情况下，我们只需添加行定义。
我们已经发现定义的开始和当前行是空的定义是完整的，我们返回的定义

如果我们什么也没找到，我们返回False

注意：。有一些条目，如CRANE，有多个定义ve代码无法处理。它只会返回第一个定义。然而，考虑到文件的格式，编写完美的解决方案并不容易。

来源

2014-10-20 17:43:38

从here我学到一个简单的方法来处理内存映射文件，就好像它们是字符串中使用它们。然后你可以使用这样的东西来获得术语的第一个定义。

def lookup(search): 
    term = search.upper() 
    f = open('webster.txt') 
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
    index = s.find('\r\n\r\n' + term + '\r\n') 
    if index == -1: 
     return None 
    definition = s.find('Defn:', index) + len('Defn:') + 1 
    endline = s.find('\r\n\r\n', definition) 
    return s[definition:endline] 

print lookup('hallucination') 
print lookup('hallucinate')

假设：

还有就是如果有一个以上的，只有第一个返回

来源

2014-10-20 17:42:51 dreyescat

我将不得不阅读很多内容才能理解它，但它看起来像一个很好的方法。有没有办法使查找“独特”？意思是他们找到了确切的单词，例如查找（“疫苗接种”）返回了反对电话的定义 – Spaced 2014-10-20 17:53:38

假设所有术语都在双重\ r \ n之后，我们可以找到具体的术语。看我的编辑。 – dreyescat 2014-10-20 18:00:08

这也会找到部分匹配 – 2014-10-20 18:26:27

您可以分割成段，并使用搜索词的索引，找到第一个DEFN后段：使用整个文件返回

def find_def(f,word): 
    import re 
    with open(f) as f: 
     lines = f.read() 
     try: 
      start = lines.index("{}\r\n".format(word)) # find where our search word is 
     except ValueError: 
      return "Cannot find search term" 
     paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions 
     for para in paras: 
      if para.startswith("Defn:"): # if para startswith Defn: we have what we need 
       return para # return the para 

print(find_def("in.txt","HALLUCINATION"))

：

In [5]: print find_def("gutt.txt","VACCINATOR") 
Defn: One who, or that which, vaccinates. 

In [6]: print find_def("gutt.txt","HALLUCINATION") 
Defn: The perception of objects which have no reality, or of 
sensations which have no corresponding external cause, arising from 
disorder or the nervous system, as in delirium tremens; delusion. 
Hallucinations are always evidence of cerebral derangement and are 
common phenomena of insanity. W. A. Hammond.

略短的版本：

def find_def(f,word): 
    import re 
    with open(f) as f: 
     lines = f.read() 
     try: 
      start = lines.index("{}\r\n".format(word)) 
     except ValueError: 
      return "Cannot find search term" 
     defn = lines[start:].index("Defn:") 
     return re.split("\s+\r\n",lines[start+defn:],1)[0]

来源

2014-10-20 18:05:17

使用Python和.txt文件

回答

相关问题