7
A
回答
14
我们使用的算法是:
// If we are blocked by robots.txt
// Make sure it is obeyed.
// Our bots user-agent string contains a link to a html page explaining this.
// Also an email address to be added to so that we never even consider their domain in the future
// If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)
// Then we assume the domain is either under heavy load and does not need us adding to it.
// Or the URL we are crawling are completely wrong and causing problems
// Wither way we suspend crawling from this domain for 4 hours.
// There is a non-standard parameter in robots.txt that defines a min crawl delay
// If it exists then obey it.
//
// see: http://www.searchtools.com/robots/robots-txt-elements.html
double PolitenssFromRobotsTxt = getRobotPolitness();
// Work Size politeness
// Large popular domains are designed to handle load so we can use a
// smaller delay on these sites then for smaller domains (thus smaller domains hosted by
// mom and pops by the family PC under the desk in the office are crawled slowly).
//
// But the max delay here is 5 seconds:
//
// domainSize => Range 0 -> 10
//
double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)), 5);
//
// You can find out how important we think your site is here:
// http://www.opensiteexplorer.org
// Look at the Domain Authority and diveide by 10.
// Note: This is not exactly the number we use but the two numbers are highly corelated
// Thus it will usually give you a fair indication.
// Take into account the response time of the last request.
// If the server is under heavy load and taking a long time to respond
// then we slow down the requests. Note time-outs are handled above
double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime, 2);
// Use the slower of the calculated times
double result = std::max(workSizeTime, responseTime);
//Never faster than the crawl-delay directive
result = std::max(result, PolitenssFromRobotsTxt);
// Set a minimum delays
// So never hit a site more than every 10th of a second
result = std::max(result, 0.1);
// The maximum delay we have is every 2 minutes.
result = std::min(result, 120.0)
相关问题
- 1. 网络爬虫
- 2. C++网络爬虫
- 3. PHP网络爬虫
- 4. Python网络爬虫
- 5. java网络爬虫
- 6. 网络爬虫类
- 7. 网络爬虫的功能
- 8. 网络爬虫的Java
- 9. 简单的网络爬虫
- 10. Python中的网络爬虫
- 11. 自动网络爬虫
- 12. 网络爬虫,反馈?
- 13. 递归网络爬虫perl
- 14. 需要网络爬虫
- 15. 网络爬虫文本云
- 16. 硒与python网络爬虫
- 17. 网络爬虫从Android Market
- 18. 网络爬虫应用
- 19. 网络爬虫不打印
- 20. 网络爬虫提取
- 21. 针对windows的增量爬网支持的网络爬虫
- 22. 使用perl的网络爬虫
- 23. 在Scala中的网络爬虫算法
- 24. Python的网络爬虫:连接超时
- 25. 使用vb.net的网络爬虫/蜘蛛
- 26. Erlang中的并行HTTP网络爬虫
- 27. 网络爬虫的正则表达式
- 28. 在android上的简单网络爬虫?
- 29. 网络爬虫脚本不工作的
- 30. Python的多网络爬虫问题