< May 2018 >
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203

Thursday, 10 May 2018

02:00 PM

Scientific debt [Variance Explained] 02:00 PM, Thursday, 10 May 2018 05:20 PM, Thursday, 09 January 2020

A very useful concept in software engineering is technical debt.

Technical debt occurs when engineers choose a quick but suboptimal solution to a problem, or don’t spend time to build sustainable infrastructure. Maybe they’re using an approach that doesn’t scale well as the team and codebase expand (such as hardcoding “magic numbers”), or using a tool for reasons of convenience rather than appropriateness (“we’ll write the DevOps infrastructure in PHP since that’s what our team already knows”). Either way, it’s something that seems like it’s working at first but causes real challenges in the long-term, in the form of delayed feature launches and hard-to-fix bugs.

In my new job, I’ve been thinking about the role of data science within a business, and discussing this with other professionals in the field. On a panel earlier this year, I realized that data scientists have a rough equivalent to this concept: “scientific debt.”

Scientific debt is when a team takes shortcuts in data analysis, experimental practices, and monitoring that could have long-term negative consequences. When you hear a statement like:

  • “We don’t have enough time to run a randomized test, let’s launch it”
  • “To a first approximation this effect is probably linear”
  • “This could be a confounding factor, but we’ll look into that later”
  • “It’s directionally accurate at least”

you’re hearing a little scientific debt being “borrowed”.

Example: WidgetCorp

Most engineers have a sense of what it’s like for a company to struggle with technical debt. What would a company struggling with scientific debt look like?

Imagine a small startup WidgetCorp is developing a B2B product, and deciding on their sales strategy. One year they decide to start focusing their sales efforts on larger corporate clients. They notice that as they take on this new strategy, their monthly revenue increases. They’re encouraged by this, and in the following years hire half a dozen salespeople with experience working with large clients, and spend marketing and design effort building that as part of their brand.

Years later, the strategy doesn’t seem to be paying off: their revenue is struggling and the early successes aren’t repeating themselves. They hire an analyst who looks at their sales data, and finds that in fact, it had never been the case that they’d had a higher return-on-investment selling to large companies. In that early year, their revenue had been rising because of a seasonal effect (the demand for widgets goes up in the fall and winter), which was compounded with some random noise and anecdotes (e.g. “SmallCompany.com was a waste of our time, but we just closed a huge deal with Megabiz!”)

WidgetCorp took on too much scientific debt.

 

Some ways this might have happened:

They made irreversible decisions based on flawed analyses. It’s reasonable to take a quick look at metrics and be happy that they’re going in the right direction. But once the company made product, sales and marketing changes, it became difficult to reverse them. Before making a major shift in business, it’s worth making sure that the data supports it: that they’ve accounted for seasonal effects and applied proper statistical tests.

Lack of monitoring. Early on, there may not have been enough data to tell whether larger clients were a better investment. But as more data was collected, it would be worth continually testing this assumption, in the form of a dashboard or a quarterly report. If this isn’t tracked, no one will notice that the hypothesis was falsified even once they have the data.

Lack of data infrastructure: Maybe early in the company the leads were locked in a sales CRM, while accounting data was stored in Excel spreadsheets that were emailed around. Even if there were a dedicated analyst within the company, they may not have easy access to the relevant data (linking together sales sucess and company size). Even if it were theoretically possible to combine the datasets with some effort, schlep blindness might have made everyone avoid the analysis entirely. This is an area where technical debt and scientific debt often appear together, since it takes engineering effort to make scientific problems easy to solve.

Spreading inaccurate lore. Suppose that the WidgetCorp CEO had given a series of company-wide talks and public blog posts with the message “The future of WidgetCorp is serving big companies!” Product teams got into the habit of prioritizing features in this direction, and every failure got blamed on “I guess we weren’t focused enough on big clients”. This kind of “cultural inertia” can be very difficult to reverse, even if the executive team is willing to publicly admit their mistake (which isn’t guaranteed!)

Just about every experienced data scientist has at least a few of these stories, even from otherwise successful companies. They are to scientific debt what the Daily WTF is to technical debt.

Is scientific debt always bad?

Not at all!

 

I often take shortcuts in my own analyses. Running a randomized experiment for a feature launch is sometimes too expensive, especially if the number of users is fairly small or the change pretty uncontroversial (you wouldn’t A/B test a typo fix). And while correlation doesn’t imply causation, it’s usually better than nothing when making business decisions.

The comparison to technical debt is useful here: a small engineering team’s first goal is typically to build an minimum viable product quickly, rather than overengineer a system that they think will be robust in the distant future. (The equivalent in scientific debt is typically called overthinking, e.g. “Yes I suppose we could control for weather when we examine what sales deals succeed, but I’m pretty sure you’re overthinking this”). And the comparison to financial debt is meaningful too: companies typically take on debt (or, similarly, give up equity) while they’re growing. Just like you can’t build a company without borrowing money, you can’t build a company while being certain every decision is thoroughly supported by data.

What’s important in both technical and scientific debt is to keep the long-term cost in mind.

Wrong decisions are expensive, and not paying attention to data is a risk. We can do a cost-benefit analysis of whether the risk is worth it, but we shouldn’t write it off as “data scientists always find something to complain about”.

Why even call it “debt”?

To a data scientist or analyst, this post might sound pretty obvious. Of course there are downsides to ignoring statistical rigor, so why bother giving it a “buzzword-y” name? Because it puts the concept in terms executives and managers can easily understand.

Again, let’s go back to technical debt. There are lots of reasons individual engineers may want to write “clean code”: they appreciate its elegance, they want to impress their peers, or they’re perfectionists procrastinating on other work. These reasons don’t generally matter to non-technical employees, who care about product features and reliability. The framing of technical debt helps emphasize what the company loses by not investing in architecture: the idea that even if a product looks like it’s working, the flaws have a long-term cost in actual dollars and time.


Engineer: It bothers me that different internal projects use different naming conventions.

CTO: Sorry it annoys you, but code is code, I don’t see why you should waste time on this.


Engineer: Our inconsistent naming conventions are technical debt: they make it harder for new developers to learn the system.

CTO: I’ve been looking for ways to reduce our onboarding time! Great idea, let me know what you need to fix it.


Similarly, scientists, especially from an academic background, often have a particular interest in discovering truths about reality. So the idea of “I’d like to analyze whether X is a confounding factor here” can sound like an indulgence rather than an immediate business need. Statisticians in particular are often excited by finding flaws in mathematical methods. So when a data scientist says something like “We can’t use that method, Jones et al 2012 proved that it is asymptotically inconsistent,” non-technical colleagues might assume they’re overthinking it or even showing off. Framing it in terms of what we’re actually risking helps communicate why it’s worth spending time on.

How can we manage scientific debt well?

  • Let data scientists “pay interest” on it. Just as not every engineering project will lead to a new feature, not every analysis will lead to an exciting discovery or novel algorithm. Some time needs to be spent confirming or invalidating existing assumptions. Jacqueline Nolis has a great article about prioritizing data science work, where she describes this quadrant as “providing proof”.

  • Build data engineering processes: As described earlier, one reason a company might fall into scientific debt is that analysts may not have easy access to the data they need. It could be locked away in a platform that hasn’t been ingested, or in Google sheets that are edited by hand. Ingesting relevant data into a data warehouse or a data lake makes it more likely data scientists can make relevant discoveries.

  • Revisit old analyses: One common reason early-stage companies go into scientific debt is that they don’t yet have enough data to draw robust conclusions. Even if you don’t have enough data yet, that doesn’t mean you should forget about the problem. Sometimes I put time on my calendar to run an analysis once I expect enough data to be available, even if it’s a few months away. This can also help confirm an important analysis is still relevant: just like you’d keep track of a KPI over time, you want to keep track of whether a conclusion remains true.

  • Have data expertise spread throughout the company. Just as someone who can’t program may not recognize technical debt, someone who doesn’t have experience analyzing and understanding data may not recognize scientific debt. This is yet another reason to democratize data science within your company, as we do at DataCamp.

Feeds

FeedRSSLast fetchedNext fetched after
XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Bits of DNA XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
blogs.perl.org XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
Blue Collar Bioinformatics XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Boing Boing XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Epistasis Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Futility Closet XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
gCaptain XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Hackaday XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
In between lines of code XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
InciWeb Incidents for California XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LeafSpring XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Living in an Ivory Basement XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LWN.net XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Mastering Emacs XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Debian XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Emacsen XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
RNA-Seq Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RStudio Blog - Latest Comments XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RWeekly.org - Blogs to Learn R from the Community XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Adventure Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Allium XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Variance Explained XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
January 2022
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
December 2021
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
November 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
October 2021
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
September 2021
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
August 2021
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
July 2021
MonTueWedThuFriSatSun
28293001020304
05060708091011
12131415161718
19202122232425
26272829303101
June 2021
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
May 2021
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
April 2021
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
March 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
February 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
November 2020
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
September 2020
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
July 2020
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
June 2020
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
May 2020
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
April 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
February 2020
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282901
January 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
December 2019
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
October 2019
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
August 2019
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829303101
July 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
June 2019
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
May 2019
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
April 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
March 2019
MonTueWedThuFriSatSun
25262728010203
04050607080910
11121314151617
18192021222324
25262728293031
February 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728010203
January 2019
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
December 2018
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
November 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
October 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
September 2018
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
August 2018
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
July 2018
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
June 2018
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
May 2018
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
April 2018
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
February 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272801020304
January 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
December 2017
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
November 2017
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
September 2017
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
August 2017
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
March 2017
MonTueWedThuFriSatSun
27280102030405
06070809101112
13141516171819
20212223242526
27282930310102
January 2017
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2016
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
October 2016
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
September 2016
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
August 2016
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
July 2016
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
May 2016
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
April 2016
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
December 2014
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
October 2014
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102