更改难以字符串与未知的子串

-2

更新：我使用Jsoup来解析文本
解析一个网站时，我遇到了问题：当我得到HTML文本时，一些链接随机空间损坏。例如：更改难以字符串与未知的子串

What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a>

正如你可能会注意到，在空间中的位置完全是随机的，但有一点是肯定的：它是一个href标签内。当然，我可以使用replace(" ", "")方法，但可能有两个或多个链接。我该如何解决这个问题？

来源

2014-02-21 Groosha

在所有href值上使用replace（“”，“”）'有什么问题？另外，为什么试图修复返回垃圾网站的数据？ –

也有正则表达式，你可以用它来识别你的链接，如果你只想使用'replace'就可以了。或[JSoup]（http://jsoup.org/）（请参阅[此问题]（http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java）） – eebbesen

是的，我使用Jsoup解析，但改变substring不会改变初始字符串，对吧？ – Groosha

这是一个古老的解决方案，但我会尝试使用旧的退役apache ECS来解析您的html，然后，只有对于href链接，您可以删除空格，然后重新创建所有内容:-)如果我没记错的话，有一种方法可以从html解析ECS“DOM”。

http://svn.apache.org/repos/asf/jakarta/ecs/branches/ecs/src/java/org/apache/ecs/html2ecs/Html2Ecs.java

另一种选择是使用类似XPath的选择让您的HREF，但你必须处理畸形的HTML（你可以给整洁的机会 - http://infohound.net/tidy/）

来源

2014-02-21 19:04:58 Leo

我会试试看，thnx – Groosha

你可以使用正则表达式找到并“提炼”网址：

public class URLRegex { 

    /** 
    * @param args the command line arguments 
    */ 
    public static void main(String[] args) { 

     final String INPUT = "Hello World <a href=\"http://ww w.google.com\">Google</a> Second " + 
          "Hello World <a href=\"http://www.wiki pedia.org\">Wikipedia</a> Test" + 
          "<a href=\"https://www.example.o rg\">Example</a> Test Test"; 
     System.out.println(INPUT); 

     // This pattern matches a sequence of one or more spaces. 
     // Precompile it here, so we don't have to do it in every iteration of the loop below. 
     Pattern SPACES_PATTERN = Pattern.compile("\\u0020+"); 

     // The regular expression below is very primitive and does not really check whether the URL is valid. 
     // Moreover, only very simple URLs are matched. If an URL includes different protocols, account credentials, ... it is not matched. 
     // For more sophisticated regular expressions have a look at: http://stackoverflow.com/questions/161738/ 
     Pattern PATTERN_A_HREF = Pattern.compile("https?://[A-Za-z0-9\\.\\-\\u0020\\?&\\=#/]+"); 
     Matcher m = PATTERN_A_HREF.matcher(INPUT); 

     // Iterate through all matching strings: 
     while (m.find()) { 
      String urlThatMightContainSpaces = m.group(); // Get the current match 
      Matcher spaceMatcher = SPACES_PATTERN.matcher(urlThatMightContainSpaces); 
      System.out.println(spaceMatcher.replaceAll("")); // Replaces all spaces by nothing. 
     } 

    } 
}

来源

2014-02-21 19:32:19 MrSnrub

嗯..看起来很有前途 – Groosha

更改难以字符串与未知的子串

回答

相关问题