项目作者: gaojiuli

项目描述 :
Web crawling framework based on asyncio.
高级语言: Python
项目地址: git://github.com/gaojiuli/gain.git
创建时间: 2017-05-31T08:56:04Z
项目社区:https://github.com/gaojiuli/gain

开源协议:GNU General Public License v3.0

下载


" class="reference-link">

Build
Python
Version
License

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

  • Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

  1. Write spider.py:
  1. from gain import Css, Item, Parser, Spider
  2. import aiofiles
  3. class Post(Item):
  4. title = Css('.entry-title')
  5. content = Css('.entry-content')
  6. async def save(self):
  7. async with aiofiles.open('scrapinghub.txt', 'a+') as f:
  8. await f.write(self.results['title'])
  9. class MySpider(Spider):
  10. concurrency = 5
  11. headers = {'User-Agent': 'Google Spider'}
  12. start_url = 'https://blog.scrapinghub.com/'
  13. parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
  14. Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]
  15. MySpider.run()

Or use XPathParser:

  1. from gain import Css, Item, Parser, XPathParser, Spider
  2. class Post(Item):
  3. title = Css('.breadcrumb_last')
  4. async def save(self):
  5. print(self.title)
  6. class MySpider(Spider):
  7. start_url = 'https://mydramatime.com/europe-and-us-drama/'
  8. concurrency = 5
  9. headers = {'User-Agent': 'Google Spider'}
  10. parsers = [
  11. XPathParser('//span[@class="category-name"]/a/@href'),
  12. XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
  13. XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
  14. ]
  15. proxy = 'https://localhost:1234'
  16. MySpider.run()

You can add proxy setting to spider as above.

  1. Run python spider.py

  2. Result:

Example

The examples are in the /example/ directory.

Contribution

  • Pull request.
  • Open issue.