2010-10-11 86 views
3

我想用Qt编写一个程序,每天从一个站点下载大量的HTML网页,大约5000个。下载完页面之后,我需要使用DOM Query,使用WebKit模块提取一些数据,然后将这些数据存储在数据库中。使用Qt批量下载网页

哪种方法是最好的/正确的/有效的方法,特别是下载和分析阶段?我如何处理这些请求以及如何创建“下载管理器”?

+1

考虑使用像'wget'这样的外部二进制文件进行下载 – 2010-10-11 21:33:45

回答

0

这已经回答了,但这里是使用你的要求的解决方案,并与QT这样做。

您可以使用QT(特别是QNetworkManager,QNetworkRequests,QNetworkReply)制作(网站爬虫)。我不确定这是否是处理此类任务的正确方法,但我发现利用多个线程可以最大限度地提高效率并节省时间。 (请有人告诉我,如果有另一种方式/或确认这是否是好的做法)

概念是工作的列表中排队,并且工人将执行工作,接到信息后/处理它,然后继续下一个项目。

类工人对象 类应该接受一个网址,程序和下载的URL的HTML数据,然后在接收时处理信息。

创建队列和管理器队列 我创建了一个QQueue <的QString> urlList以控制被处理并发物品的量和任务列表来完成的。

QQueue <String> workQueue; //First create somewhere a 
    int maxWorkers = 10; 


    //Then create the workers 
    void downloadNewArrivals::createWorkers(QString url){ 
checkNewArrivalWorker* worker = new checkNewArrivalWorker(url); 
workQueue.enqueue(worker); 
} 

    //Make a function to control the amount of workers, 
    //and process the workers after they are finished 

    void downloadNewArrivals::processWorkQueue(){ 
if (workQueue.isEmpty() && currentWorkers== 0){ 
    qDebug() << "Work Queue Empty" << endl; 
} else if (!workQueue.isEmpty()){ 
    //Create the maxWorkers and start them in seperate threads 
    for (int i = 0; i < currentWorkers && !workQueue.isEmpty(); i++){ 
     QThread* thread = new QThread; 
     checkNewArrivalWorker* worker = workQueue.dequeue(); 
     worker->moveToThread(thread); 
     connect(worker, SIGNAL(error(QString)), this, SLOT(errorString(QString))); 
     connect(thread, SIGNAL(started()), worker, SLOT(process())); 
     connect(worker, SIGNAL(finished()), thread, SLOT(quit())); 
     connect(worker, SIGNAL(finished()), worker, SLOT(deleteLater())); 
     connect(thread, SIGNAL(finished()), this, SLOT(reduceThreadCounterAndProcessNext())); 
     connect(thread, SIGNAL(finished()), thread, SLOT(deleteLater())); 
     thread->start(); 
     currentWorkers++; 
    } 
} 
} 

    //When finished, process the next worker 
    void downloadNewArrivals::reduceThreadCounterAndProcessNext(){ 
currentWorkers--; //This variable is to control amount of max workers 

processWorkQueue(); 
    } 


    //Now the worker 
    //The worker class important parts.. 
    void checkNewArrivalWorker::getPages(QString url){ 
QNetworkAccessManager *manager = new QNetworkAccessManager(this); 
QNetworkRequest getPageRequest = QNetworkRequest(url); //created on heap 
getPageRequest.setRawHeader("User-Agent", "Mozilla/5.0 (X11; U; Linux i686 (x86_64); " 
          "en-US; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1"); 
getPageRequest.setRawHeader("charset", "utf-8"); 
getPageRequest.setRawHeader("Connection", "keep-alive"); 
connect(manager, SIGNAL(finished(QNetworkReply*)), this, SLOT(replyGetPagesFinished(QNetworkReply*))); 
connect(manager, SIGNAL(finished(QNetworkReply*)), manager, SLOT(deleteLater())); 
manager->get(getPageRequest); 
} 

    void checkNewArrivalWorker::replyGetPagesFinished(QNetworkReply *reply){ 
QString data = reply->readAll(); //Here data will hold your html to process as needed... 
reply->deleteLater(); 
emit finished(); 


} 

后你会得到你的信息,我只是处理从QString的信息,但是我相信你能解决如何使用DOM解析器一旦你到这个阶段。

我希望这是一个足够的例子,足以帮助你。