2013-04-16 185 views
0

我需要拆分String并获得单词的String[]。我试过这个:字符串到单词的字符串[]

String[] plain = plainText.split(" ,;<>/[(!)*=]"); 

但在我的情况下,这是行不通的。拆分后,数组plain仍然只有一个值,它是字符串plainText中的整个字符串。我的字符串如下所示:

<table class="content" border="0" cellpadding="0" cellspacing="0" style="width:540px;" bgcolor="#ffffff"> 
      <tr> 
       <td align="left" valign="top"> 
        <font color="#666666" face="Arial, Verdana" size="1"> 
        eBay Inc.<br /> 
        2145 Hamilton Avenue<br /> 
        San Jose, California 95125<br /><br /> 

        Designated trademarks and brands are the property of their respective owners. eBay and the eBay logo are trademarks of eBay Inc. 
        <br /><br /> 

        <strong>&copy; 2013 eBay Inc. All Rights Reserved</strong><br /><br /> 


        eBay Inc. sent this e-mail to you at [email protected] because you opted in to the eBay Deals Daily Alert campaign by signing up at ebay.com/deals.<br /><br /> 


        Pricing: We compared the selling price for the featured Deals items on eBay to the List Price for the item. The List price is the price (excluding shipping and handling fees) the seller of the item has provided at which the same item, or one that is nearly identical to it, is being offered for sale or has been offered for sale in the recent past. The price may be the seller's own price elsewhere or another seller's price. The "% off" simply signifies the calculated percentage difference between seller-provided List Price and the seller's price for the eBay Deals item. If you have any questions related to the pricing and/or discount offered in eBay Deals, please contact the seller. All items subject to availability.<br /><br /> 

        If you wish to unsubscribe from eBay Deals email alerts, please <a href="http://dailydeal.ebay.com/unsubscribe.jsp?s=4IwA&i=883690252203">click here</a>. 
        Please note that you are only opting out of the eBay Deals email alerts. If you are an eBay customer and wish to change your other eBay Notification Preferences, please log in to My eBay by <a href="http://l.deals.ebay.com/u.d?R4GrxGghJ4SpZccF_r3SS=21801">clicking here</a>. Please note that it may take up to 10 days to process changes to your eBay Notification Preferences. <br /><br /> 

        Visit our <a href="http://l.deals.ebay.com/u.d?f4GrxGghJ4SpZccF_r3Sf=21811">Privacy Policy</a> and <a href="http://l.deals.ebay.com/u.d?KYGrxGghJ4SpZccF_r3SY=21821">User Agreement</a> if you have any questions.<br /><br /> 

        </font> 
       </td> 

这是分析的电子邮件的一部分。那么,我该如何将这些文字转换成一系列文字呢?

+2

你想包含哪些单词? –

+1

字符串数组应该是什么样子?什么是预期的输出? –

+0

你的意思是你想要忽略html标签的文字? – NewUser

回答

3

此正则表达式是错误的,因为它的一些字符是正则表达式控制字符(例如[(*等),并已被转义以用作分裂分离器,还整个字符组必须被内包装a []:

String[] plain = plainText.split("[ ,;<>/\\[\\(!\\)\\*=\\]]"); 

Java regex here上阅读更多信息。

编辑:从CPerkins跟进评论,你也可以使用这个表达式:

String[] plain = plainText.split("[\\s^\\W]+"); 

它所做的是它分裂的所有空格字符和所有非单词字符,这是有点儿我想,你想要什么。

NB:以上只是对您的问题的直接回答,有很多更好的方法来读取/解析HTML。

+0

一个改进,但这将产生空白的多个空数组条目,新行将被视为文本,而不是被剥夺。 – CPerkins

+0

@CPerkins谢谢,用更短/清洁的正则表达式更新了答案。 – maksimov

+0

不用担心。好答案。很好,很干净。 – CPerkins

0

您可以使用Scanner类。你可以阅读使用文字

while(scanner.hasNext()){} 

类型构造。

链接:Scanner

0
String noTags = htmlString.replaceAll("\\<.*?\\>", ""); 
    String clearTxt = noTags.replaceAll("[ \t\n.,!;\\(\\)]+", " "); 
    String[] words = clearTxt.split(" "); 
+2

如果标签中的文本(像urls可以被忽略),我喜欢这种方法,但是我把noTags.replaceAll()切换到'“[^ \\ w] +”'放置出更多的非alpha字符。如果需要标记文本,@ maksimov的正则表达式可以被修改为两遍来清除它。 – n0741337

+1

@rebeliagamer好得多,但你的条目有换行符。 n0741337的补充更正了这一点。 – CPerkins

相关问题