Handling Flaky NX Tasks
Status | Date | Author |
---|---|---|
Accepted | 2025-03-21 | Claude |
Context
Several tasks in our NX monorepo have been identified as flaky, causing intermittent failures in CI/CD pipelines and local development. The affected tasks include:
commons:install
tools:build:vite:production
chainpyon:install
gleam:install
risk:build
These flaky tasks waste development time, reduce confidence in our CI processes, and can lead to false negatives in quality assurance.
Decision
We've implemented several strategies to make these tasks more robust:
-
Pinned Python Versions: Instead of allowing a range of Python versions (
>=3.9,<3.11
), we're now using a more specific version constraint (~3.10.0
) to ensure consistency across environments. -
Enhanced Install Commands: Replaced the direct Poetry executor with custom commands that include:
-
Explicit lock file updates
- Retry logic with timeouts
- Verbose logging
-
Clean installation options
-
Split Complex Tasks: Separated complex tasks (such as the Vite build) into smaller, more manageable steps:
-
Separate actual execution from wrapper tasks
- Dedicated retry commands
-
Optional linting during builds
-
Resource Optimization: Added explicit memory limits where needed:
-
Increased Node heap size for Vite builds (
--max-old-space-size=8192
) -
Added timeouts to prevent hanging processes
-
Sequential Dependency Resolution: Ensured complex dependencies are properly ordered:
- Fixed dependency chains for Python projects
- Added explicit dependency tasks
- Created dedicated commands for cleaning and rebuilding
Implementation Details
Python Package Tasks
For Python projects (commons, chainpyon, gleam, risk), we've:
- Updated Python version constraints in
pyproject.toml
files from>=3.9,<3.11
to~3.10.0
- Created explicit, resilient installation commands:
"install": {
"executor": "nx:run-commands",
"dependsOn": ["lock"],
"options": {
"commands": [
"poetry lock --no-update",
"poetry install --no-interaction --verbose || (sleep 5 && poetry install --no-interaction --verbose)"
],
"cwd": "{projectRoot}"
}
}
- Added clean installation options for complete environment resets:
"install:clean": {
"executor": "nx:run-commands",
"options": {
"commands": [
"rm -rf .venv || true",
"poetry env remove --all || true",
"poetry lock --no-update",
"python -m venv .venv",
"source .venv/bin/activate && pip install -U pip && poetry install --no-interaction --verbose"
],
"cwd": "{projectRoot}"
}
}
Tools Build Task
For the tools build task, we've:
-
Split the Vite build into separate steps:
-
A wrapper task with resource limits
- The actual build task
-
A retry mechanism
-
Simplified the Vite configuration:
- Made linting plugins optional (only in development mode)
- Enhanced error reporting without failing builds
- Increased Node memory limits
"build:vite": {
"executor": "nx:run-commands",
"options": {
"commands": ["NODE_OPTIONS='--max-old-space-size=8192' nx run tools:build:vite:actual"],
"cwd": "{projectRoot}",
"parallel": false
}
}
Risk Build Task
For the risk build task, which depends on multiple local packages:
- Added explicit dependency ordering:
"build": {
"executor": "nx:run-commands",
"dependsOn": [
{
"projects": ["commons", "gleam", "datahub"],
"target": "install"
},
"install"
],
"options": {
"commands": [
"poetry build -vvv"
],
"cwd": "{projectRoot}"
}
}
- Created a dedicated sequential dependency installation command:
"install:deps": {
"executor": "nx:run-commands",
"options": {
"commands": [
"poetry install --no-interaction --verbose -v",
"cd ../commons && poetry install --no-interaction",
"cd ../gleam && poetry install --no-interaction --no-extras",
"cd ../datahub && poetry install --no-interaction"
],
"cwd": "{projectRoot}",
"parallel": false
}
}
Consequences
Positive
- More reliable builds and installations
- Reduced developer frustration
- More predictable CI/CD pipeline execution
- Better failure diagnostics with verbose logging
- Safety nets with retry mechanisms
Negative
- Slightly more complex configuration
- Some tasks may take longer to execute (but fail less often)
- Additional targets to maintain
Usage Guidelines
For flaky Python tasks:
- Use
nx run <project>:install:clean
when you encounter persistent issues - Try
nx run <project>:install:retry
when a normal install fails
For flaky build tasks:
- Use
nx run tools:build:vite:retry
for more resilient builds - Consider adding
NODE_OPTIONS='--max-old-space-size=8192'
when running complex builds
For risk:build and similar tasks with complex dependencies:
- Run
nx run risk:install:deps
to ensure all dependencies are properly installed - Use
nx run risk:build:clean
for a clean rebuild
Future Considerations
- Monitor these improvements to verify their effectiveness
- Consider implementing similar patterns for other flaky tasks
- Explore containerization to further reduce environment-based flakiness
- Add automated cache cleaning when builds repeatedly fail
- Consider implementing a global "retry" wrapper for any NX command