爪哇 - 解析HTML - 获取文本

我特林得到来自网站的文本;当你改变语言时，html网址里面有一个“/ en”，但是包含我想要的信息的页面没有。爪哇 - 解析HTML - 获取文本

http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92 

html tags: (the text contains the description of the photo) 
<div id="redx_gallery_pic_title"> text text </div>

的问题是，该网站是在德国，我想在英语的文字，我的脚本只获得了德语版

任何想法我怎么能做到这一点？

java code: 
... 
URL oracle = new URL(x); 
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream())); 
    String inputLine=null; 
    StringBuffer theText = new StringBuffer(); 
    while ((inputLine = in.readLine()) != null) 
      theText.append(inputLine+"\n"); 
    String html = theText.toString(); 
    in.close(); 

String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");

来源

2011-08-03 Bogdan S

你使用什么编程语言？你用什么语言API来解析HTML？显示您到目前为止获取HTML内容的代码。 – BalusC

编程语言：Java –

我发布了一个答案，但是在将来，您应该真的提及并标记它。有一种巨大的方法来解析网站的HTML，你甚至都没有告诉它任何关于它的事情。 – BalusC

该网站默认为德语国际化。您需要通过在Accept-Language请求标头中指定所需的ISO 639-1语言代码来告诉服务器您接受的语言。

URLConnection connection = new URL(url).openConnection(); 
connection.setRequestProperty("Accept-Language", "en"); 
InputStream input = connection.getInputStream(); 
// ...

无关的具体问题，我可以建议你看一看Jsoup作为HTML解析器？它更方便，它的jQuery般的CSS选择器语法，因此比你尝试尽可能少得多臃肿：

String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92"; 
Document document = Jsoup.connect(url).header("Accept-Language", "en").get(); 
String title = document.select("#redx_gallery_pic_title").text(); 
System.out.println(title); // Beech, glazing V3

这就是全部。

来源

2011-08-03 19:23:46 BalusC

非常感谢你 –

不客气。 – BalusC

但是，如果我想获得罗马尼亚语的文字？如果我把“ro”而不是“en”，我没有得到特殊字符。 –

爪哇 - 解析HTML - 获取文本

回答

相关问题