项目作者: seerocode

项目描述 :
This web scraper is built in Python and being used to scrape article links from The Daily News search website.
高级语言: Python
项目地址: git://github.com/seerocode/daily-news-web-scraper.git
创建时间: 2017-04-06T07:42:12Z
项目社区:https://github.com/seerocode/daily-news-web-scraper

开源协议:

下载


daily-news-web-scraper

My first web scraper (using Python 3, please).

Use

Built for research project to compile data on gun violence in the Bronx and surrounding areas + contributing factors. This web scraper is for personal use and is a learning process. Please fork and use at your own discretion.

Search Terms

Bronx, Gun

Notes

This web scraper is built in Python 3 and being used to scrape article links from The Daily News search website. The scraper is needed so my research advisor and I do not have to manually go through articles in the TDN and figure out which one we should use.

To-Dos (in no certain order)

Whatever has a :construction: emoji is what I am currently working on as of last commit. Also, some of these to-dos are extensions built on top of this scraper.

  • [X] Fix up style of code as it’s all over the place (flesh out methods, remove redundant code, etc.) | EDIT: Completed 5/13/17
  • [X] Extend search to multiple pages (this is half working as pagination goes deeper than 13 pages but it only captures the first 13) |
    EDIT: 4/27/17 - Completed
  • Build web interface to type in any search URL from Daily News and return a list of articles from that search
  • [X] Iterate on interface to allow for custom search terms to be applied to the Daily News search from the interface and return full list of articles for that search | EDIT: 5/23/17 - Completed
  • :construction: Connect newspaper API to interface with web and display the articles in a readable format
  • Connect list to database to allow user removal of a document from web interface if it does not fit the specified criteria (e.g. article is vague, doesn’t relate to gun violence, dupe, etc.)
  • Allow user to add article to database
  • Highlight key words in extracted article text
  • [X] Extract titles of articles, dates, and author (from HTML tags) | EDIT: 4/27/17 Completed with the newspaper API
  • :construction: Connect Google Sheets to scraper
  • :construction: Display the article content on web