Data Analysis in Early Stage Startups

This is intended to be a short post. The topic is data as it relates to early-stage startups. There are two types of data sets that people want. One, they want to know about financings, valuations, and acquisition prices. Two, they want to know about metrics, everything from daily active users, monthly active users, and a range of other emerging engagement metrics. People want this data so that we can all make more sense of what is happening in the early stages. Rather than get caught up in the hype, we can trust the data.

As nice it would be to have these, the cold reality is that these data sets are nearly impossible to get. “If” every startup honest contributed their financing-related data to CrunchBase, from start to end, we’d have some rich data, but that ain’t happening. There’s little incentive for founders and investors to disclose this data, and for currently early-stage startups, we won’t know the financing particulars for a long while, if ever. And, “if” every startup properly collected their own usage and engagement data, we’d be able to better decipher which metrics are for vanity and which are for value. As it stands, only a handful of people know the metrics at growing early-stage startups and have little to no incentive to share them.

Therefore, we don’t have good data, and whatever is there is far from clean. And, double therefore, making inferences from the data is a dangerous exercise in extrapolation. This is why I *never* try to cite data to back up any arguments I’m making. Data can always be manipulated or misused, reshaped to advance any argument. Perceptive readers will cut through that b.s. and it reduces credibility all around, not to mention trust in the source and respect of the reader’s time.

Brendan Baker has a great way of explaining this, saying that any communication or analysis around early-stage data should include the following language:

Here’s what we found, here’s what I think it means, and here are the limitations.


Let me go one step further on Brendan’s suggestion and say that this disclaimer should be appended to any data collection and analysis of early-stage companies. This clearly presents the realities to the audience, protects the author from some inevitable doubts, respects the reader’s time, and hopefully creates a good enough atmosphere for discussion around what should be an important topic. The next decade is going to slap us in the face with all sorts of data, so we must start establishing these groundrules now. Please comment with more, and thanks in advance.