How to Build a Data Platform – Alex Antonison’s Blog

When building a data platform, it’s tempting and admittedly, a lot of fun, to dive straight into the latest and greatest technologies, running proof-of-concepts and benchmarking tools to find the top performer. But over the course of my career, I’ve seen this approach fall short time and time again.

At the end of the day, the role of a Data team is to deliver value to the business through data products. While it’s easy to get caught up in building a platform that aims to “do it all,” this often results in projects that take significantly more time and budget than the business is willing to support. Focus should always start with delivering business value, not chasing perfection in infrastructure.

Business Questions

The first step you should take is to answer the following business questions:

Clear Business Problem Statement: Without a clear problem statement, a data platform can quickly spiral into trying to solve every possible problem. A well-defined problem sharpens the project’s focus and helps identify practical requirements, such as whether data needs to be updated hourly, daily, or in real time. As with any successful project, clearly understanding the business problem is essential before any work begins.
Team skillsets: Is the team made up of experienced Data Platform Engineers familiar with running open-source big data solutions, or is it a smaller team, perhaps composed of Data Analysts with some engineering knowledge? If the team lacks deep engineering experience or bandwidth, it’s often wiser to lean toward user-friendly managed services. And when bringing in external consultants, be sure that any recommended solution is maintainable by the existing team. Otherwise, you risk becoming dependent on specialized hires just to keep the platform running.
Existing Technology: What tools are currently in use across the company? Is there already a Cloud Data Warehouse in place? Are other teams using an orchestrator? Before introducing new solutions, it’s important to first assess the existing tooling. Other teams may have already invested time and effort into setting up tools, managing deployments, and negotiating pricing; leveraging this existing work can save significant time and resources.
Timeline: What is the timeline for implementing a solution? If the deadline is tight, it may be necessary to start with a more expensive, narrowly focused managed service to meet immediate needs, while keeping a longer-term plan in place for building a more scalable and cost-efficient solution. Even with a longer timeline, it’s still important to start small and gather iterative feedback from stakeholders to ensure the solution evolves in the right direction.
Data Infrastructure Budget: Establishing a realistic budget is essential, though often challenging. The necessary funding will be directly influenced by factors like team capabilities, project timelines, and leveraging existing technologies. To operate within budget constraints while delivering initial value, you may need to strategically narrow the project’s scope, such as addressing a subset of key markets initially. This focused strategy ensures the platform delivers tangible value early, within budget, while building a foundation for future scaling.

Data Questions

Next, once the key business questions are sufficiently answered (and documented), you can move onto the 5Vs + S of data. This topic has been thoroughly covered so I will just highlight them below:

Volume: How much data is needed?
Velocity: How frequent does the data need to be processed?
Variety: What kind of data are you working with? Structured? Unstructured?
Veracity: How trustworthy is the data source? Do you need to build data quality checks into the ingestion pipeline to catch bad data coming in?
Value: How valuable is the data source? It is important to balance the value of a dataset versus how much it costs to ingest and manage.
(Extra) Sensitive: When you are dealing with sensitive data, it is crucial to ensure that it is sufficiently protected at motion and at rest.

POC Time!

Congratulations! With this information in hand, you can now start considering what tools will make sense to start doing POCs for a data platform.

Business Questions

Data Questions

POC Time!

Other data topics to consider