如何分隔不规则套用的字符串以获取单词？ - Python

因为我的话不是全部由后者划定的。单词列表将包含诸如“USA”之类的词，我不知道该怎么做。 '美国'应该是一个字。不能分开。

myList=[u'USA',u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery', 
     u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building', 
     u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote', 
     u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'horseRidingDiscipline']

如何编辑列表中的元素。
我想获得更改的元素列表如下所示：

updatemyList=[u'USA',u'Chancellor', u'current Rank', u'geoloc Department', u'population Urban', u'apparent Magnitude', u'Train', u'artery', 
      u'education', u'right Child', u'fuel', u'Synagogue', u'Abbey', u'Research Project', u'language Family', u'building', 
      u'Snooker Player', u'production Company', u'sibling', u'oclc', u'notable Student', u'total Cargo', u'Ambassador', u'copilote', 
      u'code Book', u'Voice Actor', u'Nuclear Power Station', u'Chess Player', u'runway Length', u'horse Riding Discipline']

字是能够分离

来源

2016-10-24 bob90937

第二个列表中缺少“u'managerYearsEndYear”这个词。监督？ – Ukimiku

谢谢，我会编辑它 – bob90937

再次，与'nltk'无关; P出于好奇，对于你的名单，所有的单词都是由后者划定的？当你有'美国'时会发生什么？如果输出结果是'U'A'或'u'USA'？ – alvas

你可以使用应用re.sub

import re 

first_cap_re = re.compile('(.)([A-Z][a-z]+)') 
all_cap_re = re.compile('([a-z0-9])([A-Z])') 


def convert(word): 
    s1 = first_cap_re.sub(r'\1 \2', word) 
    return all_cap_re.sub(r'\1 \2', s1) 


updated_words = [convert(word) for word in myList]

从Adapated： Elegant Python function to convert CamelCase to snake_case?

来源

2016-10-24 09:11:05

您可以使用正则表达式将不在单词开头的每个大写字母加上空间：

re.sub(r"(?!\b)(?=[A-Z])", " ", your_string)

在第一对括号的比特指“未在单词的开头”，并且在第二对括号的位装置“后跟大写字母”。正则表达式在这两个条件成立的地方匹配空字符串，并用空格替换空字符串，即它在这些位置插入一个空格。

来源

2016-10-24 09:12:32

它适用于某些元素。然而，当我写'美国'的结果是'美国'，这不是我想要的 – bob90937

然后，你将不得不指定如何正确地分词。 'USAToday'和'USAtoday'应该发生什么事，计算机应该如何检测？ –

能做到这一点使用正则表达式，但更容易用小的算法来理解（不考虑极端情况类似缩写，例如NLTK）

def split_camel_case(string): 
    new_words = [] 
    current_word = "" 
    for char in string: 
     if char.isupper() and current_word: 
      new_words.append(current_word) 
      current_word = "" 
     current_word += char 
    return " ".join(new_words + [current_word]) 


old_words = ["HelloWorld", "MontyPython"] 
new_words = [split_camel_case(string) for string in old_words] 
print(new_words)

来源

2016-10-24 09:19:10

old_words = [u'Telecommunicationsfirms'，u'KKKKKKKK'，u'tattoo'，u'EducationInstitution']然而，结果是[u'Telecommunicationfirms'，u'K KKKKKKKK'，u'tattoo'，u'ec教育机构'] – bob90937

@ bob90937将'电信公司'拆分为'电信公司'超出了原始问题的范围， –

下面的代码片段，只要你想分开的话：

myList=[u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery', u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building', u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote', u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'managerYearsEndYear', 'horseRidingDiscipline'] 

updatemyList = [] 


for word in myList: 
    phrase = word[0] 

    for letter in word[1:]: 
     if letter.isupper(): 
      phrase += " " 
     phrase += letter 

    updatemyList.append(phrase) 

print updatemyList

来源

2016-10-24 09:25:12 Ukimiku

它适用于某些元素。然而，当这个词就像old_words = [u'Telecommunicationsfirms'，u'KKKKKKKK'，u'tattoo'，u'EducationInstitution']然而，结果是[u'Telecommunicationfirms'，u'K KKKKKKKK'，u' tattoo'，u'Education Institution'] – bob90937

引用上面的罗杰托马斯，“@ bob90937将'电信公司'拆分为'电信公司'超出了原始问题的范围” – Ukimiku

你可以简单地检查一下，看看单词中的所有字母是否都是大写字母，如果是，就忽略它们，即将它们计为单个单词？

我已经用类似的代码在过去，它看起来有点硬编码的，但它的工作权（在我的情况我想捕捉的缩写最多4个字母长）

def CapsSumsAbbv(): 
for word in words: 
     for i,l in enumerate(word): 
      try: 
       if word[i] == word[i].upper() and word[i+1] == word[i+1].upper() and word[i+2] == word[i+2].upper() and word[i+3] == word[i+3].upper(): 
        try: 
         word = int(word) 
        except: 
         if word not in allcaps: 
          allcaps.append(word) 
      except: 
       pass

若要进一步扩展，如果有条目（如u'USAMilitarySpending'），则可以修改上面的代码，以便如果连续有两个以上的Caps字母，但也有较低的大写字母，则会在last和last-1大写字母之间添加空格，以便它变成u'USA Military Spending'

来源

2016-10-24 10:10:50

如何分隔不规则套用的字符串以获取单词？ - Python

回答

相关问题