Any Data from Any Website GUARANTEED!

Web data mining often goes under a variety of names (the most common being "web scraping") and usually involves pulling and organising data from a third party websites for a variety of purposes including but not limited to:

  • Content Development - replicating, integrating or leveraging data contained in other sites to develop your own content.
  • Marketing Intelligence - monitoring market and competitor activity across the internet to assist in making informed decisions.
  • Information Arbitrage - taking advantage of the time value of data by getting access to information from a variety of sources faster than competitors to make better time-dependant decisions.

The Communication Evolution team has over 10 years experience in this space crawling hundreds of websites across a variety of industries. Our experience spans the end-to-end process of Data Gathering > Information Processing > Decision Automation.

The Challenge of Existing Web Scraping Options

Common reasons why customers turn to us to provide them with our web scraping services:

  • Realise the time cost: When trying to use desktop software it takes a significant amount of time to learn and setup a crawl and needs constant manual checking of output. If getting one developed (particularly offshore) the time it takes to provide a detailed specification for a web crawler, manage its development and finally testing the output is always more than originally anticipated. Most desktop software and developers don't have the expertise to crawl all the types of content we are able to.
  • Get blocked: Without the benefit of running a crawl across many IP addresses the risk of getting blocked increases significantly. Few software and service providers clearly identify this serious risk because they don't offer an end-to-end service. One solution that might be suggested is using a proxy service. However, this significantly slows down your ability to crawl and increases the cost of running it.
  • Inadequate infrastructure: To minimize risk of being blocked by a webmaster you should ideally not crawl more than about 5,000 pages of content a day. If you need to crawl more you'll need to be able to run multiple servers in parallel and distribute traffic via geographically separate gateways. Few customers and even hosting providers have the ability to support this type of infrastructure.

Web Scraping is Much More Than Software

People often don't appreciate that a large portion of Google's popularity goes beyond just their algorithm for selecting the most relevant sites to display in response to a query. Just as important is the infrastructure that supports their ability to crawl large amounts of data reliably and in a timely fashion - that's why Google presents on search results how long it took to serve the results in the top right hand side.

Our fully managed service removes frustration, reduces risk and delivers consistently because it's built on two robust foundations:

  • Deep expertise in web crawling technology
  • Enterprise grade infrastructure

Deep Expertise in Web Crawling

Deep expertise in web crawling means:

  • We can extract any type of web based content that you need including text served through Ajax and Flash
  • Ensure consistent quality of the content extracted with extensive quality checking mechanisms
  • Through our proprietary processes and algorithms address web scraper blocking mechanisms such as:
    • IP address blocking
    • Excess traffic monitoring
    • Tools to verify that it is a real person accessing the site, such as the CAPTCHA project
    • Carefully crafted Javascript

Enterprise Grade Infrastructure

Enterprise grade infrastructure that allows us to easily run web scraping projects:

  • Over 100 servers in parallel
  • Across hundreds of IP addresses
  • On a world class high speed network

Unlike using desktop software or hiring less experienced developers our specialist managed service can deliver a block proof method of extracting any information you need from any website as quickly as you need GUARANTEED.

Do You Also Need Push Data?

Beyond extracting information we can also develop automated bots that can enter details into third party sites - useful for:

  • Automated data entry e.g. content syndication across a variety of networks
  • Aggregated booking engines e.g. consolidating pricing and availability data for specific dates across hotel, car and/or flight booking sites and processing subsequent booking
  • Aggregated quoting engines e.g. compiling insurance quotes and facilitating follow up purchase
  • Automated buying e.g. monitoring of auction sites and facilitating automated purchases based on pre-defined parameters

If you want to talk to us in more detail about your specific requirements please contact us and we'll talk them over with you.

To find out more please contact us.

Other sections of the site include:

After an extensive review of web crawling capabilities against competing services from the US and Europe we chose CE, not just because of their technical superiority but their understanding of our core business drivers and attention to detail.

CIO publicly listed Financial Services firm

customer 147 customer 035 customer 066 customer 039 customer 080 customer 168 customer 194 customer 097 customer 133 customer 123 customer 138 customer 016