- October 9th, 2015
- No Comments
Top 3 Cost Drivers For Every Web Scraping Project
Continuing from my previous posts on Web Data Extraction, I’d like to explore what I see as the main cost drivers for every Web Data Extraction project. These drivers are relevant both in a Data-as-a-Service application and a Platform play, however there are a few additional factors that come into the picture when purchasing a platform. Each of these add multiples to the cost of delivering a project, so be measured when considering what you really “need” versus what your available budget is.
Volume– How much data is expected? Is this millions of rows? Hundreds of thousands? Tens of thousands?
The volume of a web scraping project will be a huge physical cost driver. The higher the volume of data, the more careful a provider will need to be about how they go about harvesting. They may need to bring to bear 3rd party services, like a Proxy Provider, or leverage a more expensive IP set. Also, depending on your Frequency, they also may need to bring more infrastructure online to support your extraction. These are both challenges you would need to overcome if this project is running as a platform in your enterprise. Why? Simple. The larger your footprint on the source, the more attention you draw. The more attention you draw, the more likely they are to block you. Once you are blocked, you can’t access the data you need without going about it in a whole new way. There are other implications that come with this as well, but that is an entirely different post!
Finally, the business of web harvesting can be murky on occasion, but this is an aspect of it that should not be. It is pretty easy to estimate the amount of data that the project hopes to produce. If you have a high quantity of sources, take a sample of them and explore the amount of data that might exist on each site. Then, multiply that across your full site list. Or, if you have just one or two high volume sources, it is more than likely that they will display the number of records that exist. If not, just take the number of records on a page and multiply that by the number of pages, or whatever similar logic exists based on the page structure.
Frequency– Will this be an every minute, hourly, daily, weekly, or monthly harvest?
Yet another large cost driver is the frequency with which sources need to be scraped. This creates a multiple when applied to either Volume or Quantity, or both! Much like volume, how frequently you extract from a source increases the likelihood of detection and blocking by that source. Plus, the more frequent an extraction, usually the more hardware it will take. So finding a sweet spot between getting the data you need and leaving a large footprint or having servers up the wazoo is key.
Quantity– How many sites are being targeted? Is it just one? Tens? Hundreds?
This is a no brainer! The more sources that are being targeted, the more it is going to cost. Why? Well, depending on the types of sources and the structure of the data, there is a very high likelihood that each source will need some extra level of customization. When talking about customization you have to think labor, and not just labor but skilled labor. I had a customer come to me not too long ago who wanted to extract data from 17,000 sources and only had $10K to spend… which basically means he was willing to spend $0.58 a source. Assuming only light customization is needed, say 15 minutes per source, the provider would spend roughly a little over 2-People-Years, or 4,250 hours, of effort to bring those 17,000 sources online… and this does not even address the low-quality data you should expect out of a system like this. So that shows you how much he valued the data versus what the actual cost to get that data would be.
Long story really short, we had to develop a new way to attack the same problem that did not take 15 years to pay back his original investment.
Too keep costs down, no matter if you are building a prototype or an MVP, consider how you can do more with less. Can you do this with a small amount of data to start with? If so, then think about hiring an individual off of Upwork, or the like, for a one-time extraction… even some smaller DaaS shops will pick up one-time work if the price is right. Once you have what you need and have proved out your product, move on to a more robust solution. The same goes for Frequency and Quantity. If your use-case calls for daily, try to get by with weekly or even less frequently until you prove out your product. On the flip side, if you want to build 17,000 sites, maybe you can bite off a smaller amount until the value prop is proven. Think, is there a 20/80? Or maybe a 5/95? Or even broaden your horizons, maybe this data exists in a smaller field of sources…