Balancing Innovation and Technical Debt

Let’s explore the delicate balance between innovation and technical debt. 

We will look into actionable strategies for managing debt effectively while optimising our infrastructure for resilience and agility.

Balancing acts and trade-offs

I was having this conversation with a close acquaintance not long ago. He’s setting up his new startup, filling a market gap he’s found, rushed before the gap closes in. It’s a common starting point for many entrepreneurs. You have an idea you need to implement, and until it is implemented and (hopefully) sold, there is no revenue, all while someone else can close the gap before you do. Time-to-market is key.

While there’s no revenue, you acquire debt. But while reasonably careful to keep it under control, you pay the Financial Debt off with a different kind of debt: Technical Debt. You choose to make a trade-off here, a trade-off that all too often is taken without awareness. This trade-off between debts requires careful thinking too, just as much as financial debt is an obvious risk, so is a technical one.

Let’s define these debts. Technical is the accumulated cost of shortcuts or deferred maintenance in software development and IT infrastructure. Financial is the borrowing of funds to finance business operations or investments. They share a common thread: the trade-off between short-term gains and long-term sustainability.

Just like financial debt can provide immediate capital for growth, it can also drag the business into financial inflexibility and burdensome interest rates. Technical debt expedites product development or reduces time-to-market, at the expense of increased maintenance, reduced scalability, and decreased agility. It is an often overlooked aspect of a technological investment, whose prompt care can have a huge impact on the lifespan of the business. As an enterprise must manage its financial leverage to maintain solvency and liquidity, it must also manage its technical debt to ensure the reliability, scalability, and maintainability of their systems and software.

The Economics of Technical Debt

Consider the example of a rapidly growing e-commerce platform: appeal attracts demand, demand requires resources, and resources mean increased vulnerability: the increasing user data and resources attract threats, aiming to disrupt services, steal sensitive data, or cause reputational harm. In this environment, the platform’s success is determined by its ability to strike a delicate balance between serving legitimate customers and thwarting malicious actors, where both play ever-increasing proportions.

Early on, the platform prioritised rapid development and deployment of new features; however, in their haste to innovate, the technical team accumulated debt by taking shortcuts and deferring critical maintenance tasks. What results from this is a platform that is increasingly fragile and inflexible, leaving it vulnerable to disruptive attacks and more agile competitors. Meanwhile, reasonably, the platform’s financial team kept allocating capital to funding marketing campaigns, product launches, and strategic acquisitions, under pressure to maximise profitability and shareholder value; however, they neglected to allocate sufficient resources towards cybersecurity initiatives, viewing them as discretionary expenses rather than critical investments in risk mitigation and resilience.

Technical currencies

If we’re talking about debt, and drawing a parallel with financial terms, let’s complete the parallel. By establishing the concept of currencies, we can build quantifiable metrics of value that reflect the health and resilience of digital assets. Code coverage, for instance, measures the proportion of codebase exercised by automated tests, providing insights into the potential presence of untested or under-tested code paths. In this line, tests and documentation are the two assets that pay the highest technical debt. 

See for example how coverage for MongooseIM has been continuously trending higher.

Similarly, Continuous Integration and Continuous Deployment (CI/CD) pipelines automate the process of integrating code changes, running automated tests, verifying engineering work, and deploying applications to diverse environments, enabling teams to deliver software updates frequently and with confidence. By streamlining the development workflow and reducing manual intervention, CI/CD pipelines enhance productivity, accelerate time-to-market, and minimise the risk of human error. Humans have bad days and sleepless nights, well-developed automation doesn’t.

Additionally, valuations on code quality that are diligently tracked on the organisation’s ticketing system provide valuable insights into the evolution of software assets and the effectiveness of ongoing efforts to address technical debt and improve code maintainability. These valuations enable organisations to prioritise repayment efforts, allocating resources effectively.

Repaying Technical Debt

The longer any debt remains unpaid, the greater its impact on the organisation — (technical) debt accrues “interest” over time. But, much like in finances, a debt is paid with available capital, and choosing a payment strategy can make a difference in whether capital is wasted or successfully (re)invested:

  1. Priorities and Plans: Identify and prioritise areas of technical debt based on their impact on the system’s performance, stability, and maintainability. Develop a plan that outlines the steps needed to address each aspect of technical debt systematically.
  2. Refactoring: Allocate time and resources to refactor code and systems to improve their structure, readability, and maintainability. Break down large, complex components into smaller, more manageable units, and eliminate duplicate or unnecessary code. See for example how we battled technical debt in MongooseIM.
  3. Automated Testing: Invest in automated testing frameworks and practices to increase test coverage and identify regression issues early in the development process. Implement continuous integration and continuous deployment (CI/CD) pipelines to automate the testing and deployment of code changes. Establishing this pipeline is always the first step into any new project we join and we’ve become familiar with diverse CI technologies like GitHub Actions, CircleCI, GitlabCI, or Jenkins.
  4. Documentation: Enhance documentation efforts to improve understanding and reduce ambiguity in the codebase. Document design decisions, architectural patterns, and coding conventions to facilitate collaboration and knowledge sharing among team members. Choose technologies that facilitate and enhance documentation work.

Repayment assets

Repayment assets are resources or strategies that can be leveraged to make debt repayment financially viable. Here are some key repayment assets to consider:

  1. Training and Education: Provide training and education opportunities for developers to enhance their skills and knowledge in areas such as software design principles, coding best practices, and emerging technologies. Encourage continuous learning and professional development to empower developers to make informed decisions and implement effective solutions.
  2. Technical Debt Reviews: Conduct regular technical debt reviews to assess the current state of the codebase, identify areas of concern, and track progress in addressing technical debt over time. Use metrics and KPIs to measure the impact of technical debt reduction efforts and inform decision-making.
  3. Collaboration and Communication: Foster a culture of collaboration and communication among development teams, stakeholders, and other relevant parties. Encourage open discussions about technical debt, its implications, and potential strategies for repayment, and involve stakeholders in decision-making processes.
  4. Incremental Improvement: Break down technical debt repayment efforts into smaller, manageable tasks and tackle them incrementally. Focus on making gradual improvements over time rather than attempting to address all technical debt issues at once, prioritising high-impact and low-effort tasks to maximise efficiency and effectiveness.

Don’t acquire more debt than you have to

While debt is a quintessential aspect of entrepreneurship, acquiring it unwisely is obviously shooting in one’s foot. You’ll have to make many decisions and choose over many trade-offs, so you better be well-informed before putting your finger on the red buttons.

Your service will require infrastructure

Whether you choose one vendor over another or decide to go self-hosted, use containerised technologies, so that future changes to better infrastructures are possible. Containers also provide a consistent environment for development, testing and production. Choose technologies that are good citizens in containerised environments.

Your service will require hardware resources

Whether you choose one or another hardware architecture or any amount of memory, use runtimes that can efficiently use and adapt to any given hardware, so that future changes to better hardware are fruitful. For example Erlang’s concurrency model is famous for automatically taking advantage of any number of cores, and with technologies like Elixir’s Nx you can take advantage of esoteric GPUs and TPUs hardware for your machine learning tasks.

Your service will require agility

The market will push your offerings to its limit, in a never-ending stream of requests for new functionality and changes to your service. Your code will need to change, and respond to changes. From Elixir‘s metaprogramming and language extensibility to Gleam‘s strong type-safety, prioritise tools that likewise aid your developers to change things safely and powerfully.

Your service will require resiliency

There are two philosophies in the culture of error handling: either it is mathematically proven that errors cannot happen – Haskell’s approach – or it is assumed they can’t always be avoided and we need to learn to handle them – Erlang’s approach. Wise technologies take one starting point as an a-priori foundation of the technology and, a-posteriori, deal with the other end. Choose wisely your point on the scale, and be wary of technologies that don’t take a safe stance. Errors can happen: electricity goes down, cables are cut, and attackers attack. Programmers have bad sleepless nights or get sick. Take a stance, before errors bite your service.

Your service will require availability

No fancy unique idea will sell if it can’t be bought, and no service will be used if it is not there to begin with. Unavailability takes an exponential toll on your revenue, so prioritise availability. Choose technologies that can handle not just failure, but even upgrades (!), without downtime. And to have real availability, you always need at least two computers, in case one dies: choose technologies that make many independent computers cooperate easily and can take over another’s work transparently.

A Case Study: A Balancing Act in Traffic Management

A chat system, like many web services, handles a countably infinite number of independent users. It is a heavily network-based application that needs to respond to requests that are independent of each other in a timely and fair manner. It is an embarrassingly parallel problem, messages can be processed independently of each other, but it is also a challenge of soft real-time properties, where messages should be processed sufficiently soon for a human to have a good user experience. It also faces the challenge of bad actors, which makes requests blacklisting and throttling necessary.

MongooseIM is one such system. It is written in Erlang, and in its architecture, every user is handled by one actor.

It is containerised, and easily uses all available resources efficiently and smoothly, adapting to any change of hardware, from small embedded systems to massive mainframes. Its architecture uses the Publish-Subscribe programming pattern heavily, and because Erlang is a functional language, functions are first-class citizens, and therefore functions are installed to handle all sorts of events extensively because we never know what new functionality we will need to implement in the future.

One important event is a new session starting: mechanisms for blacklisting are plenty, whether they’re based on specific identifiers, IP regions, or even modern AI-based behaviour analysis, we can’t predict the future,  so we simply publish the “session opened” event and leave for future us to install the right handler when is needed.

Another important event is that of a simple message being sent. What if bad actors have successfully opened sessions and start flooding the system, consuming the CPU and Database unnecessarily? Again, changing requirements might dictate the system is to handle some users with preferential treatment. One default option is to slow down all message processing within some reasonable rate, for which we use a traffic shaping mechanism called the Token Bucket algorithm, implemented in our library Opuntia – named that way because if you touch it too fast, it stings you.

You can read more about how scalable MongooseIM is in this article, where we pushed it to its limit. And while we continuously load-test our server, we haven’t done another round of limit-pushing since then, stay tuned for a future blog when we do just this!

Lessons Learned

Technical Debt has an inherent value akin to Financial Debt. Choosing the right tool for the job means acquiring the right Technical Debt when needed – leveraging strategies, partnerships, and solutions, that prioritise resilience, agility, and long-term sustainability.

Keep reading

Advent of Code 2024

Advent of Code 2024

Join Lorena in this years Advent of Code 2024. She'll be solving daily puzzles throughout the month of December.

Optimising for Concurrency: Comparing and contrasting the BEAM and JVM virtual machines

Optimising for Concurrency: Comparing and contrasting the BEAM and JVM virtual machines

Attila Sragli explores the BEAM VM's inner workings, comparing them to the JVM to highlight their importance.

MongooseIM 6.3: Prometheus, CockroachDB and more

MongooseIM 6.3: Prometheus, CockroachDB and more

Pawel Chrząszcz introduces MongooseIM 6.3.0 with Prometheus monitoring and CockroachDB support for greater scalability and flexibility.