Big Data has taken over the mainstream headlines for the last decade or so, and continues to be one of the most featured technologies in the headlines today.
Forget the attention-catching headlines and talk of a new dot com collapse for businesses that don’t adopt Big Data for a while. Think of the other implications of adopting Big Data for your business. Surely, a massive undertaking such as a whole network of servers dedicated to crunching your data isn’t going to be cheap, right? Well don’t be so sure about this as web scraper websites can fetch all this data for you in a fraction of time.
Like most questions related to technology, it’s not very simple to answer that question. After all, both enterprises and SMEs have gotten onto the Big Data train, and if the researchers are to be believed, the ramifications are going to be massive.
Big Data costs can largely be broken down into four categories: human resources, infrastructure, maintenance and miscellaneous expenses.
The upfront cost: infrastructure
Big Data infrastructure refers to the set of physical technologies that are used to host and operate the big data platforms. For example, your business needs tools to collect the data, store it, a software system for processing, and a network of computers to transfer it.
The largest expense businesses incur when dealing with Big Data is the cost of the analytical database. The cost of Big Data platforms such as Hadoop and Spark is going to scale proportionally to the amount of storage, computing and processing power the business uses.
Most times, the platform needs additional tools to actually perform Big Data analytics operations. One software you’re bound to run across before you settle is Hadoop.
A single Hadoop cluster can be composed of anywhere from a single node to a potentially indefinite number. The minimum recommended number of nodes is three, since Hadoop achieves fault tolerance by duplicating files across each of these nodes.
Each of these clusters is recommended to be at least a mid-range Intel server, which costs between $4,000 and $6,000 for 3TB and 6TB disk space. A good rule of thumb is to assume it will cost $1000 – $2000 per TB. A petabyte Hadoop cluster will thus cost about $1million, since it needs about 200 nodes.
The cost of management and maintenance
The upfront cost of acquiring a Big Data management system accounts for only a small portion of the cost of Big Data. That award goes to the management and maintenance of such a system.
Additional costs begin to pile onto the original amount as soon as the business feels the need to scale its operations. What was once a 6TB cluster may need to be vertically scaled to upwards of 200 petabytes of storage space, handling hundreds of thousands of nodes. This presents and even bigger problem than simply paying for the cost of the additional storage space and processing power, however.
More infrastructure means more people are needed to manage it, which brings us to the most variable cost implication of adopting a Big Data platform.
The cost of human capital
The tech industry is one of the fastest evolving sectors in the world. As a direct result, data science has quickly exploded into one of the most popular career options for new university graduates. However, the fact that it’s a relatively new field means that there aren’t as many professionals in the field to make it as affordable as other sectors.
A full-time Hadoop expert will cost anywhere between $70,000 and $150,000 a year, while outsourced work costs an average of $81-$100 an hour. The cost of development varies greatly depending on the experience of the developer, their location and the size of the project.
Miscellaneous costs of big data
The above factors provide a good representation of what Big Data costs, but it doesn’t stop there, for the most part. The cost of factors that fall under this section relies mostly on an individual business’ requirements.
Legacy technology and migration costs: A good example of this would be to consider the fact that many experts believe Hadoop is falling out of favor with new Big Data enterprises. However, companies that already have Hadoop as an important part of their data pipelines will have a hard time migrating to new solutions.
Alternatively, businesses that rely on legacy technology might find themselves in a position where they have new business requirements. This calls for the adoption of new software solutions. If a business needs the real-time processing capabilities offered by Spark but doesn’t have the expertise to adopt it, they can opt for Hive instead.
Networking costs: For the average business, the cost of transferring data over the internet seems so low that it’s negligible. However, for a company dealing with terabytes to petabytes worth of data at a go, the cost of bandwidth adds up quickly.
Proxy Providers: If web scraped data is part of your big data mix, then the monthly proxy provider bill can add up fast. This Web Scraping Proxy Provider tool will help you choose the right provider according to your needs and budget.
Data preservation costs: The final cost that most people fail to account for is the price of making regular snapshots of your data, or whichever method is preferable for the same.
Alleviating the big cost of big data
A big data platform is pretty expensive to manage for the average business, but this cost can be greatly reduced in a number of ways. The most important of these is leveraging open source and managed big data platforms.
One of the consequences of the popularity of cloud-based software is a proliferation of managed proprietary and open-source software. This is opposed to licensed and on-premises solutions, respectively.
The most recent development in the big data sector that has seen mass adoption from traditional systems is Google BigQuery. It’s meant to be an alternative to more expensive data silos and systems like Hadoop.
All the software is hosted and runs on Google’s state-of-the-art hardware, and all the maintenance work is abstracted away from the end-user. The greatest noticeable difference between BigQuery and companies like MapR and Cloudera is the pay-as-you-go business model. In the long run, you end up paying only for the amount of server space and computing power you’ve consumed.