• September 8th, 2015
  • No Comments
  • admin

Programmer, Data as a Service, Software, or Licensing… what is right for my Web Data based product?

Continuing on my previous post, “You Have a Product Idea That Includes Extracted Web Data… Now What?“ (http://tinyurl.com/pr9x5vg), I want to move on to explore the four main ways with which you can acquire the data necessary to fuel your product.

Much like every situation in business, not all solutions for a given problem are the right ones. What you need today for a product prototype may be drastically different in terms of time and dollars invested than what you may need when your product goes live (or if you’re lucky, not until your product is realizing some returns on your investment). It could even be that your idea cannot stand up for itself unless you go “all-in”, which makes what you want to do even riskier.

Below I have listed the four main ways that you can go about obtaining the web data necessary to fuel your product. How you move forward is dependent on many factors, including where you are at with your product, the total $’s you have to invest, the amount of control and predictability you may want, and in some cases, the desire to focus on your core product while allowing others to do the commodity work of web harvesting.

Programmer ($ – $$$) – C# / Perl / Python / PHP, and what have you to build scripts accessing source content. Your primary costs here are wages and whatever you need to keep your infrastructure up and humming along.

Pros 

  • You control your sources.
  • Extracting from new sources can be as easy as available keyboard time.
  • The build and maintenance process is more transparent. (This does not denote easier)
  • Everything is in-house.

Cons 

  • Will not leverage the power of Machine Learning (ML) algorithms that some software tools have built in that can increase the resilience of an extraction.
  • When a script breaks, more than likely hours will be spent debugging it.
  • The likelihood of investing similar amounts of time fixing a broken script as the original build is high.
  • Enterprise features such as scheduling, data cleaning, data transport, throttling, anonymization (proxy services), etc. will need to be cobbled together, where possible, to create a true data pipe from the web.

_

DaaS ($$ – $$$$) – No muss, no fuss. Tell them what you want, when you want it, and in what structure; then go back to building your core product and wait for the data to start coming in. Sounds great, no? Well it’s not always that rosy—it is web data after all—but it sure beats doing it yourself when you can afford a reputable DaaS provider. “No Names”, BPOs, and software companies that offer up their service teams As alluded to, this one can be all over the map; not just in regard to pricing but also in regard to reputable providers and data quality.  Be mindful that you will truly get what you pay for when going the DaaS route.

Pros 

  • Turn on the data spigot and just sit back with your feet up.
  • Someone else has the problem of building and maintaining access.
  • Normally higher-end services will offer a consolidated data schema (e.g. all data from all sources will be in one format and similar data types).
  • Staff only what you need to build your product, and you can forget about experts in web technologies, C#, JavaScript, Ajax, PHP, etc…

Cons 

  • You are not their only customer, so if “it” hits the proverbial fan then it may take some time to get the spigot flowing again.
  • In a lot of cases your service is only as good as your SLA, so build in a good one!
  • Hard to find a real, true proactive service when problems arise.
  • No real transparency into issues, what caused them, or how they are rectified. Let’s face it, this is still a dark art in some respects, and when things break it is usually easier to fix them then to figure out why they broke. So more often than not, you will be left with few answers and that is definitely frustrating.

_

Software ($$ – $$$$) – Don’t want to deal with building a fragile, cobbled-together infrastructure based solely on a programmer’s discretion that may or may not outlast their tenure, (breathe) but want to empower your team with a tool that potentially allows the power of markup, automation, ML, enterprise scheduling, data typing, error handling, throttling, reusability, proxy services, and survivability? Then purchasing a software tool may just be the route you need to go—but don’t expect the lightweight tools like Visual Web Ripper ($) to have the same suite of tools as a #Kapow or #Connotate.

Pros 

  • No matter where you fall in the spend spectrum, you are more than likely buying a tool that has had hundreds of thousands of hours of developer seat time.
  • Will bring to bear advance functions like; API Connectivity, Enterprise Scheduling, Data Transformations, Error Notification, Logins, CAPTCHA Breaking Connectors, Throttling (speed up or slow down connection requests) , Anonymization Proxy Services (Think: Hide me!)
  • Ease of Use, which requires either a lower programing skill set or none at all. But be warned, this not always a good thing!
  • Repeatability
  • Vendor Support: a lot of vendors in the space provide some level of no-cost support either by way of their support desk, discussion boards, or even Webinars on how to use their tools in different use-cases or against different web technologies. #Import.IO is not only one of the youngest platforms out there, but they are also one of the best in the business when it comes to supporting their user base.

Cons 

  • How much of the web do they purport to cover? Can they verify that claim? But more importantly, can they demonstrate the coverage you need for your use-case? I can’t stress this enough: if you are buying an enterprise piece of software that will cost you $$$$ then run them through their paces. See the sites, functionality, or target source technologies in action. Ask to speak to current references who have similarities to what you want to do. Don’t take a content aggregator reference when you are looking to build an on-demand application.
  • Why is this a con when buying an enterprise platform? Because it’s a risk. As much as I’d like to sit here and tell you that web data harvesting is a science, I can’t. Sure, it’s based on extracting programming code, but the sheer plethora of ways information is bolted together moves this from the science realm and plants it firmly into one of Art.
  • Support for un-harvestable, one-off sources lives in the world of “it’s on the road map”…a road map that, odds have it, will be completely new, and conspicuously different, from the last time you saw it. On that same point, be sure to understand what their release schedule looks like. No harm in asking for a snapshot of the previous year or two highlighting major and minor releases with dates.

_

Licensing ($$$$) – drinking right from the source may be just the data manna your product needs to solidify its future. This is the most costly course, but it is also the one that has the biggest rubber stamp of approval on it. That might not be all that big of a deal when you are targeting hundreds of sources, but if you are relying on one or two streams of data, this is the only way to build a viable long-term product.

Pros 

  • If you have a limited number of sources, this is option that is least likely to have the top of your funnel shut down. Huge!
  • There could potentially be partnership opportunities with the source(s), providing a little more street cred to your product idea.

Cons 

  • Costs! Depending on the source, and where their data comes from, you could be in for a rude awakening when you hear their price tag.
  • Quality: API data may not be the same quality as what they present on the web. This could be due to variations in structure, the data elements, or even the cleanliness of the data.
  • Terms of Use: There will be restrictions on how, where, and with whom you can use the data…sometimes even more restrictive than what they present on their site.
  • Throttling: They will control the spigot and deliver on a frequency that is best for them.
  • Potentially competitive situation. They might respond to your request, post your conversation with “we already have something like this in the works” or something similar to that tune. Or even worse yet, they leapfrog you and go further up the value chain to completely negate any value you were going to provide in the first place

Series Note: This is the second post in a series of publications on web data extraction. The goal of this series is to help shed some light on the full life cycle of web data extraction. If you have any specific questions feel free to contact me tom@frigginyeah.com or visit my website https://www.frigginyeah.com