VM Recommendation System

Intelligent recommendation engine - checker-based architecture, analysis types, data sources, and automated optimization suggestions

VM Recommendation System

Introduction

VM recommendation system analyzes VM health data, system metrics, and configuration to generate optimization suggestions. Uses 10 specialized checkers organized by category: Storage, Resource Optimization, Security, Maintenance, Performance.

System Architecture

Architecture

Layers:

  • API: VMRecommendationResolver in backend/app/graphql/resolvers/VMRecommendationResolver.ts
  • Service: VMRecommendationService in backend/app/services/VMRecommendationService.ts
  • Checkers: 10 analyzers implementing RecommendationChecker interface in backend/app/services/recommendations/
  • Data: Context from VMHealthSnapshot, SystemMetrics, ProcessSnapshot, PortUsage, Machine
  • Storage: PostgreSQL via Prisma

Key decisions:

  • Modular checker design (extend BaseRecommendationChecker, register with service)
  • Two-tier caching (recommendation + context caches)
  • SHA-256 hash-based change detection (prevents duplicate writes)
  • Per-checker performance tracking

Recommendation Types and Categories

Recommendation Types

  • DISK_SPACE_LOW, HIGH_CPU_APP, HIGH_RAM_APP
  • PORT_BLOCKED, OVER_PROVISIONED, UNDER_PROVISIONED
  • OS_UPDATE_AVAILABLE, APP_UPDATE_AVAILABLE
  • DEFENDER_DISABLED, DEFENDER_THREAT, OTHER

Categories

  • Storage: Disk space monitoring
  • Resource Optimization: CPU/RAM/disk allocation analysis
  • Security: Defender status, threat detection, firewall
  • Maintenance: OS/app updates
  • Performance: I/O bottlenecks, process optimization

VMRecommendationService

File: backend/app/services/VMRecommendationService.ts

Central orchestrator managing checker execution, caching, and performance monitoring in VMRecommendationService class.

Key methods:

  • generateRecommendations(): Build context → execute checkers → aggregate → detect changes → persist
  • getRecommendations(): Retrieve with optional refresh, staleness check (>24h)
  • Safe wrappers: *Safe() versions prevent error leakage
  • buildContext(): Fetch health snapshots, metrics, processes, ports, machine config
  • buildContextWithCaching(): Context building with cache layer

Configuration

Parameter Default Description
cacheTTLMinutes 15 Recommendation cache TTL
maxCacheSize 100 Max cache entries
enableContextCaching true Controls BOTH caches
contextCacheTTLMinutes 5 Context cache TTL
maxRetries 3 Retry attempts
retryDelayMs 1000 Linear backoff base delay

Performance Metrics

totalGenerations, averageGenerationTime, cacheHitRate, cacheHits/Misses, contextBuildTime, checkerTimes, errorCount, lastError

BaseRecommendationChecker

File: backend/app/services/recommendations/BaseRecommendationChecker.ts

Abstract base class for all checkers with common interface and utilities.

Abstract methods:

  • analyze(context): Returns RecommendationResult[]
  • getName(): Checker identifier
  • getCategory(): Category string
  • isApplicable(context): Optional, default true

Utilities:

  • parseAndCalculateDaysSince(): Date parsing
  • extractDiskSpaceData(): Disk usage extraction
  • looksLikeDiskUsageData(): Data validation

Interfaces:

RecommendationContext: { vmId, latestSnapshot, historicalMetrics, recentProcessSnapshots, portUsage, machineConfig }
RecommendationResult: { type, text, actionText, data? }

Checkers

DiskSpaceChecker

File: backend/app/services/recommendations/DiskSpaceChecker.ts

Monitors disk usage, generates alerts when >95% (critical) or >85% (warning).

Logic: Parse diskSpaceInfo JSON → calculate (used/total) × 100 → compare thresholds → generate recommendations

Env: ENABLE_DISK_SPACE_CHECKER, DISK_SPACE_CRITICAL_THRESHOLD (95), DISK_SPACE_WARNING_THRESHOLD (85)

UnderProvisionedChecker

File: backend/app/services/recommendations/UnderProvisionedChecker.ts

Detects resource constraints via time-based analysis:

CPU: >85% for >2h cumulative → recommend +50% (1.5x) Memory: >90% for >1h OR swap >10% OR available <512MB → recommend +50% to +150% based on severity **Disk**: >90% usage → recommend +30% (1.3x)

Uses time-weighted analysis, not data point counts.

Env: ENABLE_UNDER_PROVISIONED_CHECKER

Other Resource Checkers

OverProvisionedChecker (ENABLE_OVER_PROVISIONED_CHECKER): Detects low utilization, recommends reducing allocation ResourceOptimizationChecker (ENABLE_RESOURCE_OPTIMIZATION_CHECKER): Identifies imbalanced ratios, inefficient layouts DiskIOBottleneckChecker (ENABLE_DISK_IO_BOTTLENECK_CHECKER): Detects high latency, queue depth issues

Security Checkers

DefenderThreatChecker

File: backend/app/services/recommendations/DefenderThreatChecker.ts

Analyzes Windows Defender threats:

  • Active threats (severity ≥4) → critical
  • Quarantined → review needed
  • Recent (<7 days) → increased vigilance

Parse defenderStatus.recent_threats → filter by status/severity → generate recommendations

Env: ENABLE_DEFENDER_THREAT_CHECKER

Other Security Checkers

DefenderDisabledChecker (ENABLE_DEFENDER_DISABLED_CHECKER): Checks defenderStatus.enabled field PortConflictChecker (ENABLE_PORT_BLOCKED_CHECKER): Analyzes BlockedConnection records, identifies blocked legitimate traffic

Maintenance Checkers

OsUpdateChecker

File: backend/app/services/recommendations/OsUpdateChecker.ts

Monitors Windows Update status:

  • Update check >7 days → stale
  • Pending updates by severity (Critical/Important/Security/Optional)
  • Reboot pending duration
  • Auto-updates disabled

Severity: Critical (critical updates OR reboot >7d), High (important OR reboot 3-7d), Medium (security/optional OR reboot <3d)

Parse windowsUpdateInfo → count by classification → check reboot → build flags → generate recommendations

Env: ENABLE_OS_UPDATE_CHECKER

AppUpdateChecker

File: backend/app/services/recommendations/AppUpdateChecker.ts

Compares applicationInventory versions against latest releases, recommends updates for critical apps (browsers, etc.)

Env: ENABLE_APP_UPDATE_CHECKER

Data Sources

  • VMHealthSnapshot: Latest or specific snapshot with diskSpaceInfo, defenderStatus, windowsUpdateInfo, applicationInventory
  • SystemMetrics: 7-day window (configurable), 1000 rows max, includes CPU, memory, swap, disk I/O, network stats
  • ProcessSnapshot: Last 60min, process-level CPU/memory usage
  • PortUsage: Current port usage and connection state (100 rows max)
  • Machine: Current allocation (cpuCores, ramGB, diskSizeGB) + department

Context caching: buildContextWithCaching() wraps with cache (key: context:${vmId}:${snapshotId||'latest'}). Reduces 5 DB queries to 0 on cache hit, 50-70% faster generation.

Generation Flow

generateRecommendations() pipeline:

  1. Disposal check: Verify service not disposed
  2. Cache check: Return if cached (key: recommendations:${vmId}:${snapshotId||'latest'})
  3. Build context: Fetch data with caching
  4. Execute checkers: Run applicable checkers with retry logic (linear backoff: retryDelayMs × attempt)
  5. Aggregate results: Collect all recommendations
  6. Hash comparison: Generate SHA-256 hash, compare with previous
  7. Database persist: Save only if changed
  8. Update cache: Store results (TTL: cacheTTLMinutes)
  9. Log performance: Total time, context build time, per-checker times

Retrieval Flow

getRecommendations():

  1. If refresh=true: Call generateRecommendations()
  2. Validate machine exists
  3. Find latest snapshot
  4. Build filter from RecommendationFilterInput (types, createdAfter/Before, limit)
  5. Query database with filters
  6. Staleness check: If >24h and no filter, regenerate
  7. Return recommendations

Caching Strategy

Two-tier caching (both controlled by single enableContextCaching flag):

Recommendation cache:

  • Key: recommendations:${vmId}:${snapshotId||'latest'}
  • TTL: cacheTTLMinutes (default: 15min)
  • Max size: maxCacheSize (default: 100)
  • Hit rate: 90-95% reduction in response time (2-5ms vs 150-250ms)

Context cache:

  • Key: context:${vmId}:${snapshotId||'latest'}
  • TTL: contextCacheTTLMinutes (default: 5min)
  • Max size: 50 entries
  • Impact: 50-70% faster generation (eliminates 5 DB queries)

Invalidation: TTL-based expiration, manual refresh via refresh: true, staleness check (>24h)

Error Handling

Safe wrappers: *Safe() methods catch exceptions, log details, update metrics, return { success, recommendations?, error? }

Checker isolation: Individual failures don't stop process, logged with context

Retry logic: runCheckerWithRetry() with linear backoff (retryDelayMs × attempt), max maxRetries (default: 3)

AppError integration: Standardized errors with codes (INTERNAL_ERROR, NOT_FOUND, DATABASE_ERROR)

Configuration Options

Environment Variables

Cache:

  • RECOMMENDATION_CACHE_TTL_MINUTES (15)
  • RECOMMENDATION_MAX_CACHE_SIZE (100)
  • RECOMMENDATION_CONTEXT_CACHE_TTL_MINUTES (5)
  • RECOMMENDATION_CONTEXT_CACHING (true) - controls BOTH caches

Performance:

  • RECOMMENDATION_PERFORMANCE_MONITORING (true)
  • RECOMMENDATION_PERFORMANCE_THRESHOLD (5000ms)
  • RECOMMENDATION_MAX_RETRIES (3)
  • RECOMMENDATION_RETRY_DELAY_MS (1000)

Data windows:

  • RECOMMENDATION_METRICS_WINDOW_DAYS (7)
  • RECOMMENDATION_METRICS_MAX_ROWS (1000)
  • RECOMMENDATION_PORT_USAGE_MAX_ROWS (100)

Checker toggles (all default to enabled unless set to 'false'):

  • ENABLE_DISK_SPACE_CHECKER
  • ENABLE_RESOURCE_OPTIMIZATION_CHECKER
  • ENABLE_OVER_PROVISIONED_CHECKER
  • ENABLE_UNDER_PROVISIONED_CHECKER
  • ENABLE_DISK_IO_BOTTLENECK_CHECKER
  • ENABLE_DEFENDER_DISABLED_CHECKER
  • ENABLE_DEFENDER_THREAT_CHECKER
  • ENABLE_PORT_BLOCKED_CHECKER
  • ENABLE_OS_UPDATE_CHECKER
  • ENABLE_APP_UPDATE_CHECKER

Checker thresholds:

  • DISK_SPACE_CRITICAL_THRESHOLD (95)
  • DISK_SPACE_WARNING_THRESHOLD (85)

GraphQL API

Query: getVMRecommendations

query {
  getVMRecommendations(
    vmId: ID!
    refresh: Boolean
    filter: RecommendationFilterInput
  ): [VMRecommendationType!]!
}

Authorization: @Authorized(['USER']) - ownership check (user.id === machine.userId OR user.role === 'ADMIN')

Returns: { id, machineId, snapshotId, type, text, actionText, data, createdAt }

RecommendationFilterInput

input RecommendationFilterInput {
  types: [RecommendationType!]
  createdAfter: DateTime
  createdBefore: DateTime
  limit: Int  # default: 20, max: 100
}

Best Practices

Creating New Checkers

  1. Extend BaseRecommendationChecker
  2. Implement analyze(), getName(), getCategory()
  3. Use utility methods (parseAndCalculateDaysSince(), extractDiskSpaceData())
  4. Return detailed data in RecommendationResult.data
  5. Add env config (ENABLE_MY_CHECKER, MY_CHECKER_THRESHOLD)
  6. Register in VMRecommendationService constructor

Error Handling

  • Never throw in analyze() - return empty array
  • Log with context (checker name, VM ID)
  • Validate data structures before access
  • Check for null/undefined

Frontend Integration

const { data, refetch } = useGetVMRecommendationsQuery({
  variables: { vmId, refresh: false, filter: { limit: 20 } },
  pollInterval: 60000
});

// Force refresh
const handleRefresh = () => refetch({ refresh: true });

Troubleshooting

No Recommendations

  • Check health snapshots exist
  • Verify checkers enabled (ENABLE_*_CHECKER)
  • Check InfiniService running in guest
  • Review service logs for errors

Stale Recommendations

  • Reduce RECOMMENDATION_CACHE_TTL_MINUTES
  • Use refresh: true
  • Check staleness trigger (24h)

Slow Generation

  • Enable context caching (RECOMMENDATION_CONTEXT_CACHING=true)
  • Reduce metrics window (RECOMMENDATION_METRICS_WINDOW_DAYS=3)
  • Reduce max rows (RECOMMENDATION_METRICS_MAX_ROWS=500)
  • Disable unused checkers
  • Check logs for slow checkers

Missing Data

  • Verify InfiniService running (systemctl status infinibay-agent)
  • Check health snapshot data structure
  • Review data collection errors in logs

Diagrams

System Architecture

graph TD
    A[VMRecommendationResolver] --> B[VMRecommendationService]
    B --> C[Checkers]
    B --> D[Context Builder]
    D --> E[PostgreSQL]
    B --> F[Caches]
    C --> B

Generation Flow

sequenceDiagram
    User->>Resolver: getVMRecommendations
    Resolver->>Service: getRecommendations
    Service->>Cache: Check cache
    alt Cache Miss
        Service->>DB: Build context
        Service->>Checkers: Execute
        Checkers-->>Service: Results
        Service->>DB: Save if changed
        Service->>Cache: Store
    end
    Service-->>User: Recommendations