项目作者: joshkunz

项目描述 :
[Reference Only] An asynchronous, multiprocessed, python based spider framework.
高级语言: Python
项目地址: git://github.com/joshkunz/spider.py.git
创建时间: 2012-02-28T21:22:10Z
项目社区:https://github.com/joshkunz/spider.py

开源协议:

下载


WARNING This repository is no longer maintained and was never destined
for any kind of real-life usage. This was mainly written for me to learn more
about parallelism and multiplexed-IO, the code is meh and it likely
no longer works WARNING

spider.py

An asynchronous, multiprocessed, python spider framework.

Getting Started

The spider is seperated into two parts, the actuall engine and the extractors.
The engine submits the requests, and handles all of the processes and
connections. The extractors are functions that are registered to be called
after a page has been loaded and parsed.

The engine is represented as the Scour object.

  1. import spider
  2. scour = spider.Scour(seeds_urls=[])

Extractors are registed using the scour object and the extractor decorator.

  1. @scour.extractor
  2. def do_somthing(process, page, response):
  3. pass

Or they can be registed by passing the function to scour.extractor

  1. scour.extractor(lambda process, page, response: True)

After all of the extractors have been registed the actual spider can be run

  1. scour.run()

Which will start up multiple processes and begin downloading the pages in
its queue. Extractors can pass in urls using process.get

There’s also alot of documentation in spider.py, and it’s not very long
only ~300 lines.

Very Basic Example

More complete examples can be found in the /examples/ folder.
(see basic.py)

  1. import spider
  2. # Don't actually use google, your spider won't get far
  3. seeds = ["http://google.com"]
  4. scour = spider.Scour(seed_urls=seeds)
  5. @scour.extractor
  6. def churn_urls(process, page, response):
  7. """Put all of the urls on the page into the Queue.
  8. process: The process this callback is running in.
  9. process.log.{info,debug,warn, etc..} to write to the log file
  10. process.get(url) to add a url to the queue
  11. page: lxml.html representation of the page or None if no page
  12. could be parsed
  13. response: Tornado response object
  14. """
  15. urls = page.xpath("//a/@href") #get a list of the urls on a page
  16. for url in urls:
  17. process.get(url)
  18. scour.run()