项目作者: vishalzanzrukia
项目描述 :
This is open source web crawler example based on Java technologies
高级语言: Java
项目地址: git://github.com/vishalzanzrukia/java-web-crawler.git
java-web-crawler
This is open source web crawler example based on Java technologies with following features.
- Auto Restart after once cycle finished
- Configuration to set time between two cycles
- Capability to start crawling process with same state in case of JVM crash/down or Server crash/down where it left while crash/shutdown occurred.
- Configuration to run crawler processes with different domains.
- Configuration to set domain wise different set of url filters
- Configuration to set domain wise different parsers
- Configuration to set robots.txt rules enable/disable
- Configuration to set maximum url visit per second
- Configuration to set maximum depth to visit
- Configuration to set maximum bytes per page to download
- Sitemaps parsing support
- Retry support with parsing
Technology Stack
- Spring Boot
- Spring Integration
- Redis
- Jsoup
- ActiveMQ
- ElasticSearch
NOTE : It’s still ongoing project, not ready to use yet.