Skip to content

Handling Flaky NX Tasks

Status Date Author
Accepted 2025-03-21 Claude

Context

Several tasks in our NX monorepo have been identified as flaky, causing intermittent failures in CI/CD pipelines and local development. The affected tasks include:

  • commons:install
  • tools:build:vite:production
  • chainpyon:install
  • gleam:install
  • risk:build

These flaky tasks waste development time, reduce confidence in our CI processes, and can lead to false negatives in quality assurance.

Decision

We've implemented several strategies to make these tasks more robust:

  1. Pinned Python Versions: Instead of allowing a range of Python versions (>=3.9,<3.11), we're now using a more specific version constraint (~3.10.0) to ensure consistency across environments.

  2. Enhanced Install Commands: Replaced the direct Poetry executor with custom commands that include:

  3. Explicit lock file updates

  4. Retry logic with timeouts
  5. Verbose logging
  6. Clean installation options

  7. Split Complex Tasks: Separated complex tasks (such as the Vite build) into smaller, more manageable steps:

  8. Separate actual execution from wrapper tasks

  9. Dedicated retry commands
  10. Optional linting during builds

  11. Resource Optimization: Added explicit memory limits where needed:

  12. Increased Node heap size for Vite builds (--max-old-space-size=8192)

  13. Added timeouts to prevent hanging processes

  14. Sequential Dependency Resolution: Ensured complex dependencies are properly ordered:

  15. Fixed dependency chains for Python projects
  16. Added explicit dependency tasks
  17. Created dedicated commands for cleaning and rebuilding

Implementation Details

Python Package Tasks

For Python projects (commons, chainpyon, gleam, risk), we've:

  1. Updated Python version constraints in pyproject.toml files from >=3.9,<3.11 to ~3.10.0
  2. Created explicit, resilient installation commands:
"install": {
  "executor": "nx:run-commands",
  "dependsOn": ["lock"],
  "options": {
    "commands": [
      "poetry lock --no-update",
      "poetry install --no-interaction --verbose || (sleep 5 && poetry install --no-interaction --verbose)"
    ],
    "cwd": "{projectRoot}"
  }
}
  1. Added clean installation options for complete environment resets:
"install:clean": {
  "executor": "nx:run-commands",
  "options": {
    "commands": [
      "rm -rf .venv || true",
      "poetry env remove --all || true",
      "poetry lock --no-update",
      "python -m venv .venv",
      "source .venv/bin/activate && pip install -U pip && poetry install --no-interaction --verbose"
    ],
    "cwd": "{projectRoot}"
  }
}

Tools Build Task

For the tools build task, we've:

  1. Split the Vite build into separate steps:

  2. A wrapper task with resource limits

  3. The actual build task
  4. A retry mechanism

  5. Simplified the Vite configuration:

  6. Made linting plugins optional (only in development mode)
  7. Enhanced error reporting without failing builds
  8. Increased Node memory limits
"build:vite": {
  "executor": "nx:run-commands",
  "options": {
    "commands": ["NODE_OPTIONS='--max-old-space-size=8192' nx run tools:build:vite:actual"],
    "cwd": "{projectRoot}",
    "parallel": false
  }
}

Risk Build Task

For the risk build task, which depends on multiple local packages:

  1. Added explicit dependency ordering:
"build": {
  "executor": "nx:run-commands",
  "dependsOn": [
    {
      "projects": ["commons", "gleam", "datahub"],
      "target": "install"
    },
    "install"
  ],
  "options": {
    "commands": [
      "poetry build -vvv"
    ],
    "cwd": "{projectRoot}"
  }
}
  1. Created a dedicated sequential dependency installation command:
"install:deps": {
  "executor": "nx:run-commands",
  "options": {
    "commands": [
      "poetry install --no-interaction --verbose -v",
      "cd ../commons && poetry install --no-interaction",
      "cd ../gleam && poetry install --no-interaction --no-extras",
      "cd ../datahub && poetry install --no-interaction"
    ],
    "cwd": "{projectRoot}",
    "parallel": false
  }
}

Consequences

Positive

  • More reliable builds and installations
  • Reduced developer frustration
  • More predictable CI/CD pipeline execution
  • Better failure diagnostics with verbose logging
  • Safety nets with retry mechanisms

Negative

  • Slightly more complex configuration
  • Some tasks may take longer to execute (but fail less often)
  • Additional targets to maintain

Usage Guidelines

For flaky Python tasks:

  • Use nx run <project>:install:clean when you encounter persistent issues
  • Try nx run <project>:install:retry when a normal install fails

For flaky build tasks:

  • Use nx run tools:build:vite:retry for more resilient builds
  • Consider adding NODE_OPTIONS='--max-old-space-size=8192' when running complex builds

For risk:build and similar tasks with complex dependencies:

  • Run nx run risk:install:deps to ensure all dependencies are properly installed
  • Use nx run risk:build:clean for a clean rebuild

Future Considerations

  1. Monitor these improvements to verify their effectiveness
  2. Consider implementing similar patterns for other flaky tasks
  3. Explore containerization to further reduce environment-based flakiness
  4. Add automated cache cleaning when builds repeatedly fail
  5. Consider implementing a global "retry" wrapper for any NX command