Everyone wants to build bug-free software from the get-go, but in practice, that’s nearly impossible. The complexity of software systems, combined with the diverse hardware and software environments in which they operate, makes it challenging to eliminate all potential defects. So, when it eventually appears, the only solution is to address it promptly and efficiently. In this article, I present a bug alchemy to help you master the art of software remediation.
Bug triage
When a bug surfaces, it’s important to assess its priority to better manage fixes. But how do you effectively evaluate “priority”? While organizations and projects have their own bug triage frameworks, there’s a commonly used approach which can be very effective.
Assessing bug severity
Bug severity refers to the impact or seriousness of the bug on the software's functionality or user experience. Common categories of bug severity are:
Critical/High: These are bugs that cause a complete failure of critical functions and result in data loss or pose severe security risks.
Example:
Bug description |
Impact |
The login page of a web application is vulnerable to SQL injection. |
Allows unauthorized access, compromising sensitive user information. |
Major/Medium: These are bugs that affect important features and functionality but don’t cause a complete system failure.
Example:
Bug description |
Impact |
In an e-commerce website, pagination on item review pages fails to load. |
Intermittent issues affect UX and frustrates customers, potentially leading to a drop in sales. |
Minor/Low: These are bugs that have a minimal impact on functionality or can be easily worked around by users.
Example:
Bug description |
Impact |
A mobile app displays a minor formatting issue in the footer on some devices. |
Cosmetic and does not hinder critical functionality, UX or business operations. |
Assessing bug frequency
Once you’ve assigned severity, you then need to ascertain how frequently the failing case is occurring. A bug in Facebook might occur only occasionally, but when it does occur, it will impact millions of users. This is why you need to consider your user base when assessing frequency.
Assessing bug priority based on severity and frequency
Bug priority refers to the order in which bugs should be addressed and fixed. Common categories of bug priority are:
P0/Red: Bugs that need to be fixed immediately as they severely impact users or business operations.
P1/Amber: Important bugs that do not require immediate attention. They may be fixed in the next regular development cycle.
P2/Yellow: Bugs that have minimal impact or can be deferred for future releases without affecting the core functionality.
Below is a custom framework to prioritize bugs based on severity and frequency.
Severity / Frequency | High | Medium | Low |
Always | 1 |
1 | 2 |
Sometimes | 1 | 2 | 3 |
Rarely | 2 | 3 | 3 |
Levels: 1 - high priority, 2 - medium priority, 3 - low priority
Based on the needs of your organization, project and capabilities, you might incorporate further levels of severity (Critical/Major/Minor/Cosmetic) and/or frequency (Always/Frequently/Sometimes/Rarely) and proceed accordingly.
Bug court
By the time all bugs are triaged, the P0s/Red bugs will have been picked and fixed. P1s need a different approach, however. One method is something called a “bug court” where development, testing and product management stakeholders connect to advocate for the various bugs in the backlog.
QAs should advocate so as to provide insights into the technical aspects of a bug and its potential implications on the overall system. BAs and POs, meanwhile, can evaluate its impact on business and UX. By conducting bug courts on a regular cadence, teams can prioritize and schedule bugs to be resolved in upcoming releases.
Bug war
A bug war is set up when a substantial backlog of bugs accumulates, and the business prioritizes bug fixes before introducing new features. In a bug war, developers go into a hackathon-style mode, selecting bugs based on priority and resolving them through Desk Checks. As developers become available, they pick up bugs in the priority list.
Gamifying bug resolution with bug-fix-bounties and leaderboards can foster healthy competition and motivate developers. Adding weightings for bug complexity will add fairness to the competition, because some bugs might inevitably be much more challenging to fix than others. This is illustrated in the table below:
JIRA ID | Bug Summary | Component / Microservice to fix | Bug Priority | Weightage / Story points | Developer |
PROJECT1-123 | Performance degradation in orders page | Component A | 1 | 2 | Robin |
PROJECT1-124 | Performance degradation in orders page | Component B | 1 | 1 | Daisy |
PROJECT1-125 | Performance degradation in orders page | Component A | 1 | 2 | Violet |
PROJECT1-126 | Field validation not happening for certain inputs | Component C | 2 | 3 | Daisy |
In this above example, Daisy becomes the Bug War winner and receives the well-deserved bug-fix-bounty for her exceptional contributions.
Bugs in production
Get it sorted out
Despite your best efforts, critical production issues might sneak in from time to time. At such times, stay calm and don’t panic. Then, prioritize. If it needs an immediate fix, assign a pair of skilled developers to work on it. In a DevOps 'you build it, you run it' setting, this may involve the same team responsible for development also taking responsibility for operational aspects. Once the fix is implemented, the QA in the team can conduct a thorough and rigorous examination. And remember to test not only the fix but also its impact on the surrounding codebase to avoid introducing new issues. After deploying the fix, perform a sanity check in the production environment to ensure everything functions as expected.
RCA and the path forward
Mistakes are meant for learning, not repeating, they say. Root Cause Analysis (RCA) enables precisely that. RCA is the process of investigating an issue to identify its underlying causes and understand why it wasn’t discovered in earlier phases of development and testing. The key steps in conducting a root cause analysis and remediating them are:
Identify the root cause: Analyze the data. Look for patterns, dependencies and potential factors contributing to the problem.
Assess testing gaps: Review the testing processes at each phase (analysis, kickoff, implementation, DevBox, testing and regression) to understand why the issue wasn’t caught earlier.
Process improvement: Based on the above findings, identify areas for process improvement, such as enhancing testing strategies, introducing additional test cases or improving collaboration.
Automation: As part of the solution, consider introducing automated tests to cover the specific case that led to the production issue.
Documentation: Document the RCA findings, recommended actions and preventive measures to share with the team and stakeholders.
I’ve outlined best practices and processes here, but it’s important to remember that promptly addressing defects depends on a well-coordinated effort between developers, testers and product managers. This is particularly critical in a DevOps setting where teams have end-to-end responsibility. Effective collaboration and communication is essential to ensure defects are addressed and that software in production operates smoothly. Building a culture of transparency — fundamental to the success of the bug courts — and collaboration — production issues often need multiple developers working together in real time and RCAs often find that problems begin with communication between team members — can’t be done just by following a step-by-step process. So, use the bug alchemy framework to build a remediation practice, but remember to communicate, adapt and collaborate as a team.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.