2015-05-06 26 views
3

这是我的数据的一个样本:删除标点符号格式的文本 - 星火

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25). 

我想删除所有标点符号除了点,并与length < = 2删除的话,比如我的预期输出()。是:

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 . 

,这应该在Scala中实现, 我已经试过:

replaceAll("""\\W\s""", "") 
replaceAll(""""[^a-zA-Z\.]""", "") 

但无法正常工作,任何人都可以帮助我吗?

+0

'$ 25'有一个特殊的字符,你没有删除。 – tuxdna

回答

13

望着正则表达式的javadoc(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html ),我们看到标点符号的字符类是\p{Punct},我们可以删除来自角色类的字符使用[a-z&&[^def]]。从那以后,很容易定义一个正则表达式,将删除所有标点符号除了点:

s.replaceAll("""[\p{Punct}&&[^.]]""", "") 

删除单词,大小< = 2可以像这样做:

s.replaceAll("""\b\p{IsLetter}{1,2}\b""") 

结合了两下,这给出:

s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "") 

请注意我如何添加\s*删除冗余空间。

此外,你可以看到上述正则表达式完全删除'$',因为它一个标点符号(由unicode定义)。 如果这是不可取的(似乎表明您的预期输出),请更精确地考虑标点符号。 例如,你可能希望只考虑下列字符为标点符号:?.!:()

s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "") 

或者,你可以只添加“$”您“不标点”人物名单,以点一起:

s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "") 
0

如何:

replaceAll("(\\(|\\)|'|/", "") 

然后你只需要添加更多的标点符号使用删除|,并确保逃避像字符(和)双反斜线?

0

你可以尝试过滤这样的字符串:

val example = "Hey there! It's me, myself and I." 
example.filterNot(x => x == ',' || x == '!' || x == 'm') 
res3: String = Hey there It's e yself and I. 
0

试试这个,应当编制:

val str = """ 
    |case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
    |xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25). 
    """.stripMargin('|') 

println(str) 
val pat = """[^\w\s\.\$]""" 
val pat2 = """\s\w{2}\s""" 
println(str.replaceAll(pat, "").replaceAll(pat2, "")) 

OUTPUT:

case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25.