Datafold raises $20 million Series A to help data teams deliver reliable products faster

Datafold raises $20 million Series A to help data teams deliver reliable products faster

Published: 09-11-2021 17:19:51 | By: Bob Koigi | hits: 1363 | Tags:

Datafold, a data reliability platform that automates the most tedious parts of data engineering workflows, has announced its successful $20 million Series A funding round.

Backed by NEA and Amplify Partners, Datafold seeks to expand its proactive approach to data reliability to help companies unlock more growth using high-quality data.

Gleb Mezhanskiy, founder and CEO, Datafold: “Poor data quality is the primary challenge for companies to become data-driven and a constant source of stress and overwork for data professionals. Top tech companies pour millions of dollars into creating internal data reliability tools and processes, while the vast majority of data teams have to rely on tedious manual testing or risk shipping incorrect data to their stakeholders. We founded Datafold to enable every team that leverages data to make better decisions with tools that help them move fast with high confidence.”

Processing data at scale is more affordable than ever thanks to the modern data stack, but data teams are grappling with an explosion of data pipelines and BI assets — and the resulting lack of understanding, trust, and reliability of data. Data quality has become the number one impediment to leveraging analytics and expanding AI/ML for data-driven companies. This problem is exacerbated by the lack of adequate tools for data testing, monitoring, and observability, along with chronic understaffing of data teams.

As evidenced by top unicorn customers including Thumbtack, Patreon, Faire, and Dutchie, Datafold’s proactive approach to data quality is fundamental to building and maintaining the highest quality data for data-driven organizations. Datafold is on a mission to ensure that no data engineer faces sleepless nights worrying about a hotfix that broke the data or cost their company millions.

Contrarian Approach to Data Reliability

While data quality and observability tooling has been evolving for years, alternative solutions focused primarily on detecting data anomalies in production. That certainly has been an improvement over no observability at all, but post-factum detection of issues has limited value given its reactive nature. The damage is likely already done by the time you learn about broken data, with executives making decisions based on wrong dashboard numbers or ML models retrained with bias.

Datafold’s contrarian approach stems from a different question: How can data practitioners prevent data bugs in the first place? By introducing automated data testing in the change management workflow and integrating it in the CI/CD and code repositories, data creators can catch most issues before they get into production. This is also when data developers have the most time and attention to fix those bugs.

Mezhanskiy: “What started with Data Diff as a tool to prevent breaking changes from merging into production has evolved into a proactive philosophy for data reliability engineering. The Datafold platform is designed to be an end-to-end solution to the biggest bottleneck for data teams delivering high-quality data products. When teams can develop quickly and with confidence, they are free to create truly revolutionary data insights.”

Datafold is built on the premise of integrating into the daily workflows of data professionals while shifting reliability “to the left,” catching issues as early in the process as possible. As data pipelines can vary as much as data team roles, the Datafold platform proactively mitigates issues across a variety of tools. For example, column-level lineage gives visibility in dependencies. This aids root-cause analysis during incidents and can be used to map out potential problems from changing data models or pipeline refactoring.

Peter Sonsini, general partner, NEA and incoming board member, Datafold: “Data-driven organizations are becoming the norm across all verticals — every company is a data company now. The proactive approach to reliability is standard best practice in software but is still nascent in the data space. I’m excited to be a part of Datafold’s future disruptions in the market.”

Long gone are the days when moving fast and breaking things was an option for modern, data-driven companies. Every aspect of business decision-making comes back to high-quality, reliable data. In order to move fast with confidence in the data, the best teams are going beyond best-effort code reviews and 2 a.m. hotfixes, focusing on comprehensive engineering solutions.

Sarah Catanzaro, partner, Amplify Partners: “Data quality is critical to make the right decisions and products. However, improving data quality is tedious and challenging. In contrast, Datafold enables data teams to build reliable data products fast and well. We invested in the company because with their platform, data teams can maximize their impact by iterating quickly without compromising quality.”

End-to-End Data Reliability Platform

As the process of developing data products spans multiple workflows and is often shared by multiple teams, Datafold covers each step.

Change management is one of the slowest and most error-prone workflows for data teams. Datafold’s flagship feature, Data Diff, clearly shows data practitioners how a change in the data processing code will impact the resulting data and downstream products, such as BI dashboards and ML models. Such information is very hard to obtain manually and typically requires hours or even days to avoid breaking changes in production.

When integrated into the CI/CD process, Data Diff automates the data QA process to ensure that every proposed change (pull request) is tested before it gets shipped to production. This saves hundreds of hours that would otherwise be spent on manual testing, creates a standardized testing process across all code changes, and expands the productivity of data teams. It also facilitates data democratization — every organization’s desire to have people outside the specifically trained and always shorthanded data teams to build data products themselves.

Dave Wallace, staff data engineer, Dutchie: “Datafold’s Data Diff is the missing piece of the puzzle for data quality assurance. When I first heard about Datafold, all I could think was ‘Finally!’ It’s an unspoken problem that we all know about and no one wants to talk about.”

Another fundamental problem that Datafold solves is the ability for data teams and data users to understand the dependencies in data. Simple questions of “Where does this number come from?” or “Will anything anywhere break if we rename that column?” were difficult to impossible to answer given that it’s not uncommon for analytical warehouses to count tens of thousands of tables and over a million columns, all intricately connected.

Using its own SQL compiler, Datafold analyzes every query ever executed in the data warehouse to produce a graph of dependencies to see how data is produced and consumed, with even correlated subqueries, CASE WHEN statements, and other complex queries covered. Plus, numerous Datafold clients use these features during new data practitioner onboarding to let them explore the data more quickly and easily, without additional resources or complicated knowledge transfer.

www.datafold.com