I am skilled worker on a average in size side plan that involves mining Reddit data. It fetches a record of all posts on converse subreddits and copies a accomplish information to a Google spreadsheet for benefit research (more on a plan later).
Reddit, striking majority websites, allows web scraping as extend as the crawler scripts make no extra request to every single seconds to the Reddit servers (see rules). You don’t even need a developer persnol account or an API determining to perform scraping on Reddit.
There are famous gadget like wget, Site Sucker (Mac) or HTTrack Website Copier (Windows) that can download complete websites for offline use yet they are primarily unusable for scraping Reddit details given a site doesn’t use page numbers and calm of pages is constantly changing. A post maybe listed on a opening page of a subreddit yet it could find itself on a third page a later second as other posts are voted to a topmost.
That’s it. The text will run in a framework automatically pulling article from Reddit into a Google spreadsheet. And it stops automatically once all a posts* of that Reddit have been fetched.
[*] All Subreddits on Reddit show a ceiling of 1000 posts – we can’t go behind that series even while manually browsing a subreddit.
Reddit, striking majority websites, allows web scraping as extend as the crawler scripts make no extra request to every single seconds to the Reddit servers (see rules). You don’t even need a developer persnol account or an API determining to perform scraping on Reddit.
There are famous gadget like wget, Site Sucker (Mac) or HTTrack Website Copier (Windows) that can download complete websites for offline use yet they are primarily unusable for scraping Reddit details given a site doesn’t use page numbers and calm of pages is constantly changing. A post maybe listed on a opening page of a subreddit yet it could find itself on a third page a later second as other posts are voted to a topmost.
- Open a Google Sheet and pick File – Make a matching to matching this bit in your Google Drive.
- Go to Tools – Script editor and copy-paste a Reddit Scraper Script. You can change “LifeProTips” to any other subreddit name.
- While in a text editor, pick Run – Run and sanction a script.
That’s it. The text will run in a framework automatically pulling article from Reddit into a Google spreadsheet. And it stops automatically once all a posts* of that Reddit have been fetched.
[*] All Subreddits on Reddit show a ceiling of 1000 posts – we can’t go behind that series even while manually browsing a subreddit.
No comments:
Post a Comment