• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

I Like Kill Nerds

The blog of Australian Front End / Aurelia Javascript Developer & brewing aficionado Dwayne Charrington // Aurelia.io Core Team member.

  • Home
  • Aurelia 2
  • Aurelia 1
  • About
  • Aurelia 2 Consulting/Freelance Work

The Ethics of Web Scraping

Opinion · September 2, 2020

Not that kind of scraping.

If you’re a developer, chances are in your career you’ve written a web scraper before. You either did it for a personal project, learning exercise or you were asked by someone else to build a scraper. My first freelancer job was a scraper on Freelancer.com years and years ago.

There is often confusion around the legality of scraping and it’s not illegal to scrape public data. This was further reinforced by the case of hiQ vs LinkedIn which ruled it is legal to scrape public data.

Legality aside (because we already established it’s not illegal), let’s talk about the ethics. As a scraper, you’re not obligated to abide any of these ethics of scraping, but it is the right thing to do.

Furthermore, when it comes to scraping: it is a two way street. It’s unethical in my opinion for sites to intentionally hamper the ability for people to scrape their public data. Website owners need to accept that scraping can and will happen.

  • If there is a publicly available API, you should use that instead. The only exception to this rule is when the provided API is lacking in features or places unfair limits on the data you can obtain.
  • Adhere to the robots.txt file. Respect what it allows and disallows.
  • Take only what you need. If you only need text on a recipe website, but not the imagery, just take that. Scraping data en mass and then throwing a lot of it away is wasteful and in poor taste.
  • Don’t DDoS the site you’re scraping, rate limit your scraper and the number of connections (even if it means it’ll take longer). Hitting a site too hard could get you blacklisted as it might look like a DDoS attack.
  • Provide attribution. If you’re using data from a website, don’t pass it off as your own.
  • Identify yourself to make it easier for site owners to reach out if there are any problems and always be willing to talk and negotiate.
  • Scrape outside of peak hours. This goes hand-in-hand with not DDoS attacking the site you’re scraping. Scraping late at night and early hours of the morning (where possible) will mean you won’t degrade the experience for others.

The important thing to note here is, the whole entire internet is billed on scraping. Google built its business on scraping, news aggregators built on scraping, ticketing and booking platforms (when they were starting out especially) used scraping. Have you noticed when you paste a link into a Facebook or LinkedIn post, it scraps the thumbnail image as well as title? That’s scraping.

Dwayne

Leave a Reply Cancel reply

0 Comments
Inline Feedbacks
View all comments

Primary Sidebar

Popular

  • I Joined Truth Social Using a VPN and Editing Some HTML to Bypass the Phone Verification
  • Testing Event Listeners In Jest (Without Using A Library)
  • How To Get The Hash of A File In Node.js
  • Thoughts on the Flipper Zero
  • Waiting for an Element to Exist With JavaScript
  • How To Paginate An Array In Javascript
  • How To Mock uuid In Jest
  • How To Decompile And Compile Android APK's On A Mac Using Apktool
  • How To Get Last 4 Digits of A Credit Card Number in Javascript
  • Wild Natural Deodorant Review

Recent Comments

  • CJ on Microsoft Modern Wireless Headset Review
  • Dwayne on Microsoft Modern Wireless Headset Review
  • CJ on Microsoft Modern Wireless Headset Review
  • john on Microsoft Modern Wireless Headset Review
  • Dwayne on Why You Should Be Using globalThis Instead of Window In Your Javascript Code

Copyright © 2023 · Dwayne Charrington · Log in

wpDiscuz