从Python或R中的文件名列表中提取子字符串

我的问题与以下内容非常相似：How to get a Substring from list of file names。我是Python的新手，并且更喜欢Python（或R）的类似解决方案。我想查看一个目录并从每个适用的文件名中提取一个特定的子字符串，并将其输出为矢量（首选），列表或数组。例如，假设我有以下文件名目录：从Python或R中的文件名列表中提取子字符串

data_ABC_48P.txt 
data_DEF_48P.txt 
data_GHI_48P.txt 
other_96.txt 
another_98.txt

我想引用目录和提取下列作为字符向量（对于R中使用）或列表：

"ABC", "DEF", "GHI"

我试过如下：

from os import listdir 
from os.path import isfile, join 
files = [ f for f in listdir(path) if isfile(join(path,f)) ] 
import re 
m = re.search('data_(.+?)_48P', files)

，但我得到了以下错误：

TypeError: expected string or buffer

files是typelist

In [10]: type(files) 
Out[10]: list

即使我最终想这个特征向量作为输入R代码里面，我们试图给我们所有的“脚本”的过渡到Python和使用[R仅用于数据分析，所以Python解决方案会很棒。我也使用Ubuntu，所以cmd行或bash脚本解决方案也可以工作。提前致谢！

来源

2014-12-05 Ursus Frost

使用列表理解一样，

[re.search(r'data_(.+?)_48P', i).group(1) for i in files if re.search(r'data_.+?_48P', i)]

您需要遍历列表内容序抓住你想要的字符串。

来源

2014-12-05 17:17:36

re.search需要字符串不列出。

使用

m=[] 
for line in files: 
    import re 
    m.append(re.search('data_(.+?)_48P', line).group(1))

来源

2014-12-05 17:15:36 vks

@AvinashRaj感谢名单了很多!!!!!! – vks 2014-12-05 17:27:34

re.search()不接受一个列表作为参数，你需要使用一个循环，并通过每一个必须是字符串的功能元素，你可以使用positive look-around为您预计字符串，则作为re.search结果是你需要group发电机得到的字符串

>>> for i in files : 
... try : 
... print re.search(r'(?<=data_).*(?=_48P)', i).group(0) 
... except AttributeError: 
... pass 
... 
ABC 
DEF 
GHI

来源

2014-12-05 17:19:20 Kasramvd

from os import listdir 
from os.path import isfile, join 
import re 
strings = [] 
for f in listdir(path): 
    if isfile(join(path,f)): 
     m = re.search('data_(.+?)_48P', f) 
     if m: 
      strings.append(m.group(1)) 

print strings

输出：

['ABC', 'DEF', 'GHI']

来源

2014-12-05 17:25:35 ISanych

在R：

list.files('~/desktop/test') 
# [1] "another_98.txt" "data_ABC_48P.txt" "data_DEF_48P.txt" "data_GHI_48P.txt" "other_96.txt" 

gsub('_', '', unlist(regmatches(l <- list.files('~/desktop/test'), 
           gregexpr('_(\\w+?)_', l, perl = TRUE)))) 
# [1] "ABC" "DEF" "GHI"

另一种方式：

l <- list.files('~/desktop/test', pattern = '_(\\w+?)_') 

sapply(strsplit(l, '[_]'), '[[', 2) 
# [1] "ABC" "DEF" "GHI"

来源

2014-12-05 17:49:18 rawr

从Python或R中的文件名列表中提取子字符串

回答

相关问题