How to Build a Data Platform

Author

Alex Antonison

Published

April 6, 2025

When building a data platform, it’s tempting and admittedly, a lot of fun, to dive straight into the latest and greatest technologies, running proof-of-concepts and benchmarking tools to find the top performer. But over the course of my career, I’ve seen this approach fall short time and time again.

At the end of the day, the role of a Data team is to deliver value to the business through data products. While it’s easy to get caught up in building a platform that aims to “do it all,” this often results in projects that take significantly more time and budget than the business is willing to support. Focus should always start with delivering business value, not chasing perfection in infrastructure.

Business Questions

The first step you should take is to answer the following business questions:

  1. Clear Business Problem Statement: Without a clear problem statement, a data platform can quickly spiral into trying to solve every possible problem. A well-defined problem sharpens the project’s focus and helps identify practical requirements, such as whether data needs to be updated hourly, daily, or in real time. As with any successful project, clearly understanding the business problem is essential before any work begins.
  2. Team skillsets: Is the team made up of experienced Data Platform Engineers familiar with running open-source big data solutions, or is it a smaller team, perhaps composed of Data Analysts with some engineering knowledge? If the team lacks deep engineering experience or bandwidth, it’s often wiser to lean toward user-friendly managed services. And when bringing in external consultants, be sure that any recommended solution is maintainable by the existing team. Otherwise, you risk becoming dependent on specialized hires just to keep the platform running.
  3. Existing Technology: What tools are currently in use across the company? Is there already a Cloud Data Warehouse in place? Are other teams using an orchestrator? Before introducing new solutions, it’s important to first assess the existing tooling. Other teams may have already invested time and effort into setting up tools, managing deployments, and negotiating pricing; leveraging this existing work can save significant time and resources.
  4. Timeline: What is the timeline for implementing a solution? If the deadline is tight, it may be necessary to start with a more expensive, narrowly focused managed service to meet immediate needs, while keeping a longer-term plan in place for building a more scalable and cost-efficient solution. Even with a longer timeline, it’s still important to start small and gather iterative feedback from stakeholders to ensure the solution evolves in the right direction.
  5. Data Infrastructure Budget: Establishing a realistic budget is essential, though often challenging. The necessary funding will be directly influenced by factors like team capabilities, project timelines, and leveraging existing technologies. To operate within budget constraints while delivering initial value, you may need to strategically narrow the project’s scope, such as addressing a subset of key markets initially. This focused strategy ensures the platform delivers tangible value early, within budget, while building a foundation for future scaling.

Data Questions

Next, once the key business questions are sufficiently answered (and documented), you can move onto the 5Vs + S of data. This topic has been thoroughly covered so I will just highlight them below:

  1. Volume: How much data is needed?
  2. Velocity: How frequent does the data need to be processed?
  3. Variety: What kind of data are you working with? Structured? Unstructured?
  4. Veracity: How trustworthy is the data source? Do you need to build data quality checks into the ingestion pipeline to catch bad data coming in?
  5. Value: How valuable is the data source? It is important to balance the value of a dataset versus how much it costs to ingest and manage.
  6. (Extra) Sensitive: When you are dealing with sensitive data, it is crucial to ensure that it is sufficiently protected at motion and at rest.

POC Time!

Congratulations! With this information in hand, you can now start considering what tools will make sense to start doing POCs for a data platform.

Other data topics to consider

  1. Manage only data that matters: Storing data in low-cost storage is perfectly acceptable, but active data management should be reserved for data that directly supports specific business problems. Resist the urge to “manage everything just in case”, this often leads to scope creep, missed deadlines, and budget overruns. Instead, prioritize managing high-impact data and keep the rest in accessible storage. Raw data can still be made available to analysts and stakeholders to help uncover what’s truly valuable over time.
  2. Data Development Workflow: Creating a streamlined data development workflow is essential. When the feedback loop for testing and iterating on ideas is too slow, it becomes a major bottleneck, dramatically reducing productivity and slowing innovation.
  3. Data Testing: Just like in software engineering, testing is essential, and in data work, it’s all about trust. Data teams must ensure the reliability of the data they deliver. Without proper testing, it’s nearly impossible to catch issues introduced by pipeline changes that could break downstream data products. Data tests not only catch technical failures but also surface incorrect assumptions that might otherwise go unnoticed. When thoughtfully designed, they can flag issues early, well before they impact stakeholders or end users.
  4. Development Standards: In the absence of development standards, whether it be naming conventions, code formatting, or rules around documentation, even simple data platforms can become challenging to maintain. Fixing these problems down the road can also take a significant amount of work and requires rigorous testing to ensure nothing is broken during the refactor.
  5. Dataset Release Cycle: Whenever a dataset is released, the absence of clear release standards can make it difficult to iterate, improve, or even safely remove columns, since such changes may break downstream dashboards or integrations. To avoid this, it’s important to establish clear standards from the start for labeling datasets as alpha, beta, or production-grade. These classifications set expectations for stability and help teams manage change more effectively.
  6. Data Governance: Data should only be shared with the appropriate users across the organization. It’s far easier to establish the right access controls from the start, rather than trying to restrict access later, which can disrupt business processes and potentially lead to compliance or regulatory issues.
  7. Dataset Discoverability: Ensuring datasets are well-documented and effectively communicated across the organization is essential. In the early stages, this might simply mean using your team’s existing documentation tools. However, as the data platform scales, adopting a dedicated data catalog can help maintain discoverability and ensure that datasets remain accessible and understandable to all stakeholders.

If you have gotten this far, I am sure you have noticed that I have not mentioned any specific technologies throughout this post largely because the tools themselves do not matter. What is important is figuring out how to break down business problems into solvable parts and identifying the right tool based on the above constraints.

Happy building!