2011-12-09 83 views
4

想我也包含换行符和制表符的长字符串:标记化字符串

var x = "This is a long string.\n\t This is another one on next line."; 

那么我们如何才能拆分此字符串成令牌,使用正则表达式?

我不想使用.split(' '),因为我想学习JavaScript的正则表达式。

更复杂的字符串,可能是这样的:

var y = "This @is a #long $string. Alright, lets split this."; 

现在我想只提取有效出这个字符串,没有特殊字符和标点符号,也就是我想这些:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"]; 

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"]; 
+0

你想拆它是什么?你说'ssplit('')',但你也提到了换行符和制表符。你似乎在寻找[正则表达式教程](http://www.regular-expressions.info/tutorial.html),这不是Stack Overflow的重点。 – nnnnnn

+0

@nnnnnn:我从MDN阅读[本文档(https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions)。但同时,我正在做一些实验。这是我第一次尝试将句子分解成单词。 – Nawaz

回答

7

这里是你问什么的jsfiddle例如:http://jsfiddle.net/ayezutov/BjXw5/1/

基本上,代码非常简单:

var y = "This @is a #long $string. Alright, lets split this."; 
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string" 

var match = y.match(regex); 
for (var i = 0; i<match.length; i++) 
{ 
    document.write(match[i]); 
    document.write('<br>'); 
} 

UPDATE: 基本上可以扩大分离器的列表人物:http://jsfiddle.net/ayezutov/BjXw5/2/

var regex = /[^\s\.,!?]+/g; 

更新2: 只有字母的所有时间: http://jsfiddle.net/ayezutov/BjXw5/3/

var regex = /\w+/g; 
+1

你的两个例子都给出了错误的结果。结果是包含特殊字符。 – Nawaz

+0

嘿,我以为这是你的意图:)如果你只希望输出中的字母:http://jsfiddle.net/ayezutov/BjXw5/3/。 'var regex =/\ w +/g;' –

+0

+1。那很好。看起来这可以用许多不同的方式来书写。 – Nawaz

2

使用\s+来标记字符串。

+0

这似乎并不奏效。我做了'var re =/\ s + /; var words = re.exec(x);'我做错了什么? – Nawaz

+1

@Nawaz'var words = x.split(/ \ s + /);' – Kai

+1

@Nawaz另外尝试'var words = y.split(/ [^ A-Za-z0-9] + /);'去掉标点符号也是如此。 – Kai

1
var words = y.split(/[^A-Za-z0-9]+/); 
2

高管可以通过循环比赛,以除去非字(\ W)字符。

var A= [], str= "This @is a #long $string. Alright, let's split this.", 
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words; 

while((words= rx.exec(str))!= null){ 
    A.push(words[1]); 
} 
A.join(', ') 

/* returned value: (String) 
This, is, a, long, string, Alright, let's, split, this 
*/ 
0

为了提取纯字字符,我们使用\w符号。无论是否与Unicode字符匹配都取决于实现,您可以通过use this reference来查看您的语言/库的情况。

请参阅亚历山大Yezutov对如何应用到表达这个答案(更新2)。

0

这里是使用正则表达式组来tokenise使用不同类型的令牌的文本中的溶液。

你可以在这里测试代码https://jsfiddle.net/u3mvca6q/5/

/* 
Basic Regex explanation: 
/     Regex start 
(\w+)    First group, words  \w means ASCII letter with \w  + means 1 or more letters 
|     or 
(,|!)    Second group, punctuation 
|     or 
(\s)    Third group, white spaces 
/     Regex end 
g     "global", enables looping over the string to capture one element at a time 

Regex result: 
result[0] : default group : any match 
result[1] : group1 : words 
result[2] : group2 : punctuation , ! 
result[3] : group3 : whitespace 
*/ 
var basicRegex = /(\w+)|(,|!)|(\s)/g; 

/* 
Advanced Regex explanation: 
[a-zA-Z\u0080-\u00FF] instead of \w  Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex 

(\.\.\.|\.|,|!|\?)      Identify ellipsis (...) and points as separate entities 

You can improve it by adding ranges for special punctuation and so on 
*/ 
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g; 

var basicString = "Hello, this is a random message!"; 
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end."; 

console.log("------------------"); 
var result = null; 
do { 
    result = basicRegex.exec(basicString) 
    console.log(result); 
} while(result != null) 

console.log("------------------"); 
var result = null; 
do { 
    result = advancedRegex.exec(advancedString) 
    console.log(result); 
} while(result != null) 

/* 
Output: 
Array [ "Hello",  "Hello",  undefined, undefined ] 
Array [ ",",   undefined,  ",",  undefined ] 
Array [ " ",   undefined,  undefined, " "  ] 
Array [ "this",   "this",   undefined, undefined ] 
Array [ " ",   undefined,  undefined, " "  ] 
Array [ "is",   "is",   undefined, undefined ] 
Array [ " ",   undefined,  undefined, " "  ] 
Array [ "a",   "a",   undefined, undefined ] 
Array [ " ",   undefined,  undefined, " "  ] 
Array [ "random",  "random",  undefined, undefined ] 
Array [ " ",   undefined,  undefined, " "  ] 
Array [ "message",  "message",  undefined, undefined ] 
Array [ "!",   undefined,  "!",  undefined ] 
null 
*/