If you’re a developer, chances are in your career you’ve written a web scraper before. You either did it for a personal project, learning exercise or you were asked by someone else to build a scraper. My first freelancer job was a scraper on Freelancer.com years and years ago.
There is often confusion around the legality of scraping and it’s not illegal to scrape public data. This was further reinforced by the case of hiQ vs LinkedIn which ruled it is legal to scrape public data.
Legality aside (because we already established it’s not illegal), let’s talk about the ethics. As a scraper, you’re not obligated to abide any of these ethics of scraping, but it is the right thing to do.
Furthermore, when it comes to scraping: it is a two way street. It’s unethical in my opinion for sites to intentionally hamper the ability for people to scrape their public data. Website owners need to accept that scraping can and will happen.
- If there is a publicly available API, you should use that instead. The only exception to this rule is when the provided API is lacking in features or places unfair limits on the data you can obtain.
- Adhere to the
robots.txtfile. Respect what it allows and disallows.
- Take only what you need. If you only need text on a recipe website, but not the imagery, just take that. Scraping data en mass and then throwing a lot of it away is wasteful and in poor taste.
- Don’t DDoS the site you’re scraping, rate limit your scraper and the number of connections (even if it means it’ll take longer). Hitting a site too hard could get you blacklisted as it might look like a DDoS attack.
- Provide attribution. If you’re using data from a website, don’t pass it off as your own.
- Identify yourself to make it easier for site owners to reach out if there are any problems and always be willing to talk and negotiate.
- Scrape outside of peak hours. This goes hand-in-hand with not DDoS attacking the site you’re scraping. Scraping late at night and early hours of the morning (where possible) will mean you won’t degrade the experience for others.
The important thing to note here is, the whole entire internet is billed on scraping. Google built its business on scraping, news aggregators built on scraping, ticketing and booking platforms (when they were starting out especially) used scraping. Have you noticed when you paste a link into a Facebook or LinkedIn post, it scraps the thumbnail image as well as title? That’s scraping.