将用户输入字符串转换为原始字符串文字以构造正则表达式

我知道有一些关于将字符串转换为原始字符串字符串的帖子，但它们都不能帮助我解决问题。将用户输入字符串转换为原始字符串文字以构造正则表达式

我的问题是：

说，例如，我想知道模式“\节”是在文本“ABCD \ sectiondefghi”。当然，我可以这样做：

import re 

motif = r"\\section" 
txt = r"abcd\sectiondefghi" 
pattern = re.compile(motif) 
print pattern.findall(txt)

这会给我我想要的。但是，每次我想在新文本中找到新模式时，我都必须更改令人痛苦的代码。因此，我想写的东西更灵活，像这样（test.py）：

import re 
import sys 

motif = sys.argv[1] 
txt = sys.argv[2] 
pattern = re.compile(motif) 
print pattern.findall(txt)

然后，我想在终端像这样运行：

python test.py \\section abcd\sectiondefghi

但是，这是行不通的（我讨厌使用\\\\section）。

那么，有没有办法将我的用户输入（从终端或从文件）转换为python原始字符串？或者是否有更好的方式从用户输入中进行正则表达式模式编译？

非常感谢。

来源

2013-07-24 dbrg77

使用re.escape()，以确保输入的文本是在正则表达式视为文字文本：

pattern = re.compile(re.escape(motif))

演示：

>>> import re 
>>> motif = r"\section" 
>>> txt = r"abcd\sectiondefghi" 
>>> pattern = re.compile(re.escape(motif)) 
>>> txt = r"abcd\sectiondefghi" 
>>> print pattern.findall(txt) 
['\\section']

re.escape()逃避所有非字母数字;在每个这样的一个字符的前面添加反斜杠：

>>> re.escape(motif) 
'\\\\section' 
>>> re.escape('\n [hello world!]') 
'\\\n\\ \\[hello\\ world\\!\\]'

来源

2013-07-24 09:42:19

另一方面，如果你正在寻找文字字符串，重新是错误的工具。 – Fredrik

@Fredrik：我认为这将成为更大模式的一部分，而OP只是简化了。 –

@MartijnPieters谢谢，re.escape确实有帮助！ – dbrg77

一种方式做到这一点是利用一个参数解析器，像optparse或argparse。

您的代码将是这个样子：

import re 
from optparse import OptionParser 

parser = OptionParser() 
parser.add_option("-s", "--string", dest="string", 
        help="The string to parse") 
parser.add_option("-r", "--regexp", dest="regexp", 
        help="The regular expression") 
parser.add_option("-a", "--action", dest="action", default='findall', 
        help="The action to perform with the regexp") 

(options, args) = parser.parse_args() 

print getattr(re, options.action)(re.escape(options.regexp), options.string)

使用它我的一个例子：

> code.py -s "this is a string" -r "this is a (\S+)" 
['string']

使用你的例子：

> code.py -s "abcd\sectiondefghi" -r "\section" 
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.

来源

2013-07-24 09:47:05

所以仅仅是明确的，是你搜索的东西（你的例子中的“\ section”）应该是一个正则表达式还是一个文字字符串？如果是后者，re模块并不是真正适合该任务的工具;给定一个搜索字符串needle和目标串haystack，你可以这样做：

# is it in there 
needle in haystack 

# how many copies are there 
n = haystack.count(needle) 
python test.py \\section abcd\sectiondefghi 
# where is it 
ix = haystack.find(needle)

所有这些都是比基于正则表达式的版本更有效率。

re.escape仍然是有用的，如果你需要插入文字片段插入在运行时较大的正则表达式，但如果你最终做re.compile(re.escape(needle))，有针对的任务大多数情况下更好的工具。

编辑：我开始怀疑这里真正的问题是shell的逃脱规则，它与Python或原始字符串无关。也就是说，如果你输入：

python test.py \\section abcd\sectiondefghi

为Unix风格的外壳，“\节”的一部分转换为“\节”由外壳，Python看到它之前。要解决这个问题最简单的方法是通过将单引号内的参数告诉shell跳过转义，你可以这样做：

python test.py '\\section' 'abcd\sectiondefghi'

地比较和对比：

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi 
-c,test.py,\section,abcdsectiondefghi 

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi' 
-c,test.py,\\section,abcd\sectiondefghi

（明确使用打印上一个连接字符串在这里，以避免repr添加更多的混淆...）

来源

2013-07-24 11:09:43 Fredrik

将用户输入字符串转换为原始字符串文字以构造正则表达式

回答

相关问题