使用Web收获刮去网页的内容

我想从网页上抓取特定内容，因为我正在使用网页收获。当我试图抓取内容但它不是为this URL刮取内容时，它对其他网站运行良好。使用Web收获刮去网页的内容

我的Java代码是在这里：

import org.webharvest.definition.ScraperConfiguration; 
import org.webharvest.runtime.Scraper; 
import org.webharvest.runtime.variables.Variable; 
import java.io.FileNotFoundException; 
public class App 
{ 
public static void main(String[] args) 
{ 
    try 
    { 
     ScraperConfiguration config = new ScraperConfiguration("twit88.xml"); 
     Scraper scraper = new Scraper(config, "c:/temp/"); 
     //scraper.getHttpClientManager().setHttpProxy("proxy-server", 8001); 
     scraper.addVariableToContext("url", "http://freesearch.naukri.com/preview/preview?uname=63017692f2b266780bfd20476cd67466001a4a17005b4a5355041f121b502e18514b4e4e43121c4151005&sid=73682841&LT=1339495252"); 
     scraper.setDebug(true); 
     scraper.execute(); 
     // takes variable created during execution 
     Variable article = (Variable)scraper.getContext().getVar("article"); 
     // do something with articles... 
     System.out.println(article.toString()); 
     //System.out.println("1234=====rtyu"); 
    } 
    catch (FileNotFoundException e) 
    { 
     System.out.println(e.getMessage()); 
    } 
} 
}

和我的XML是在这里：

<?xml version="1.0" encoding="UTF-8"?> 

<config charset="UTF-8"> 
<!-- 
<var-def name="url">http://twit88.com/blog/2008/01/02/java-encrypt-and-send-a- large-file-securely/</var-def>   
--> 

<!-- <file action="write" path="twit88/twit88${sys.date()}.xml" charset="UTF-8"> --> 

    <!-- 
    <template> 
     <![CDATA[ <twit88 date="${sys.datetime("dd.MM.yyyy")}"> ]]> 
    </template> 
    --> 
<var-def name="article"> 
    <xquery> 
     <xq-param name="doc"> 
      <html-to-xml outputtype="browser-compact" prunetags="yes"> 
       <http url="${url}"/> 
      </html-to-xml> 
     </xq-param> 

     <xq-expression><![CDATA[ 
     declare variable $doc as node() external;   
     let $title := data($doc//div[@class="bdrGry"]/div[@class="boxHD1"]/h1) 

     return 
      <article>     
       <title>{data($title)}</title> 
      </article>    
     ]]> 
     </xq-expression> 

    </xquery> 
    </var-def> 
    <!-- 
     <![CDATA[ </twit88> ]]> --> 
    <!-- </file> -->    

    </config>

我想这刮URL如候选人名字的第一个块，目前的指定，公司等，但我无法通过在XML文件中使用它的类来刮擦，例如（我只尝试了第一次只尝试刮候选名称）

但它不工作。任何人都可以告诉我我做错了什么？

来源

2012-06-12 kailash gaur

另一方面，这是'刮'或'刮'，而不是'废'废料'或'报废'。你可能会从上面看到的代码中找到一些提示。 –

感谢您的正确me.But这是什么解决方案？ –

拼写？编辑（我已经完成了）。在将来打字问题时要多加注意/多加小心。 –

..它不是为this URL刮取内容。

从Naukri.com的Terms & Conditions：

Naukri.com使用技术手段抓取网站和抄袭内容排除机器人等。用户承诺不要规避这些方法。

来源

2012-06-12 11:19:15

使用Web收获刮去网页的内容

回答

相关问题