You Have a Product Idea that includes extracted Web Data… Now what?
You have struggled with the pain of a problem ripe to be solved, or you have identified an opportunity that needs to be seized and have now conceptualized the surefire way to solve it. There is just one problem: that idea will require you to utilize data from the Web in combination with your smarts to make it valuable.
Let’s run down the list of terms that you’ve Googled up until this point: Web Scraping, Screen Scraping, Web Crawling, Web Extraction, and if you really want to church it up, Web Harvesting. They all really mean the same thing, which is targeting a source of data on the web, capturing it by some means, and returning it to you in a form or fashion that is acceptable. Sure, each of the tools and services that exist out there go about it in different ways that span the spectrum from brute force to intelligent precision extraction, but at the end of the day, all that should matter is that you are receiving high-quality, accurate data on the frequency and schedule you expect, and, if it is mission critical, that the company/programmer/source does not go *poof* in the night.
These “Top 3” considerations I have found to be critical when assessing product ideas and their risks before engaging a vendor, bringing in your team, or buying software.
Creating Your Extraction Profile
1) # of Sources – I will just get in front of this at the get-go. If you are planning to extract data from only one source, then your project will be in the high risk category. Some of the riskiest choices are aggregators like Expedia, Kayak, Shopzilla, and Pronto, and moving sources like Zillow, Trulia, Autotrader, Cars.com, Yelp, and Craigslist. All these sources make their money based upon their data, usually pay-for-use and advertising $’s (sometimes they’ll even license their data set), so they spend a lot of money to acquire and ultimately protect that data from people of our ilk. Therefore it is a risky proposition to build a viable long-term product based on data from these sources. (Read: proceed with caution.)
On the flip side, if you need hundreds or thousands of sources before your product is viable then you might just be on to something. The more sources = the higher barrier to entry for new competition… of course that means a longer runway before you product hits its MVP, maybe. My suggestion here is to survey the list of sources and stack rank them based on ROI (think: the # of records available, or even the quality of content) and come up with your 80/20, or in this case in this case the 20/80. Can you come up with 20% of the sources that will give you 80% of the data functionality you need?
When moving the scale from 1-to-100’s, building the extraction from these sources stops being viable for a programming team and starts being a candidate to bring automation to bear. This decision should not be made in a vacuum, though, and it should include the frequency, the number of sources, their build times, break rates, and fix times – all of which I will cover in a future post.
2) Frequency – I will shamelessly proclaim that I have helped build every Web Data project variation that has ever existed, and a key component to understand before moving forward is how frequently you need the data in order to make your product idea valuable. While it might seem obvious, I will state it anyway: what it takes to extract data from 100 sources once a month can take drastically different resources than extracting from one source near-real-time on-demand. Resources can be several things: people, hardware, software, 3rd party support services, bandwidth, etc… I recommend mapping out exactly what you need to make your product viable and then build your tiers of value:frequency:costs from there.
3) Amount of Data – Finally some serious thought should be logged on the amount of data you will be extracting from every source. You should have a good idea upfront what you should expect from each source. Not only will this go a long way into projecting storage and DB configurations, it will also allow you to establish a baseline, troubleshoot if your data rates start falling below a set threshold, and even start calculating how much you can expand with your current processing power.
Bonus: When thinking about the Frequency and the Amount of Data you plan on extracting, just note as the time frequency increases and the amount of data you need to download continues to expand, footprint will only grow that much larger and with that you risk being identified and shut down by the source.
Finally, before moving forward with attempting to engage a vendor or purchase software, knowing your Extraction Profile will only save you time and money. If you don’t need to build a clone of the internet, don’t, Start small, prove your product, and expand organically.
Series Note: This is the first in a series of publications on web data extraction. The goal of this series is to help shed some light on the full life cycle of web data extraction. If you have any specific questions feel free to contact me email@example.com or visit my website Frigginyeah