项目作者: vishalzanzrukia

项目描述 :
This is open source web crawler example based on Java technologies
高级语言: Java
项目地址: git://github.com/vishalzanzrukia/java-web-crawler.git
创建时间: 2017-06-11T12:43:50Z
项目社区:https://github.com/vishalzanzrukia/java-web-crawler

开源协议:

下载


java-web-crawler

This is open source web crawler example based on Java technologies with following features.

  • Auto Restart after once cycle finished
  • Configuration to set time between two cycles
  • Capability to start crawling process with same state in case of JVM crash/down or Server crash/down where it left while crash/shutdown occurred.
  • Configuration to run crawler processes with different domains.
  • Configuration to set domain wise different set of url filters
  • Configuration to set domain wise different parsers
  • Configuration to set robots.txt rules enable/disable
  • Configuration to set maximum url visit per second
  • Configuration to set maximum depth to visit
  • Configuration to set maximum bytes per page to download
  • Sitemaps parsing support
  • Retry support with parsing

Technology Stack

  • Spring Boot
  • Spring Integration
  • Redis
  • Jsoup
  • ActiveMQ
  • ElasticSearch

NOTE : It’s still ongoing project, not ready to use yet.