Docs
/
Backend
/
Features
/
VM Recommendation System

VM Recommendation System

Intelligent recommendation engine - checker-based architecture, analysis types, data sources, and automated optimization suggestions

VM Recommendation System

Introduction

VM recommendation system analyzes VM health data, system metrics, and configuration to generate optimization suggestions. Uses 10 specialized checkers organized by category: Storage, Resource Optimization, Security, Maintenance, Performance.

System Architecture

Architecture

Layers:

API: VMRecommendationResolver in backend/app/graphql/resolvers/VMRecommendationResolver.ts
Service: VMRecommendationService in backend/app/services/VMRecommendationService.ts
Checkers: 10 analyzers implementing RecommendationChecker interface in backend/app/services/recommendations/
Data: Context from VMHealthSnapshot, SystemMetrics, ProcessSnapshot, PortUsage, Machine
Storage: PostgreSQL via Prisma

Key decisions:

Modular checker design (extend BaseRecommendationChecker, register with service)
Two-tier caching (recommendation + context caches)
SHA-256 hash-based change detection (prevents duplicate writes)
Per-checker performance tracking

Recommendation Types and Categories

Recommendation Types

DISK_SPACE_LOW, HIGH_CPU_APP, HIGH_RAM_APP
PORT_BLOCKED, OVER_PROVISIONED, UNDER_PROVISIONED
OS_UPDATE_AVAILABLE, APP_UPDATE_AVAILABLE
DEFENDER_DISABLED, DEFENDER_THREAT, OTHER

VMRecommendationService

File: backend/app/services/VMRecommendationService.ts

Central orchestrator managing checker execution, caching, and performance monitoring in VMRecommendationService class.

Key methods:

generateRecommendations(): Build context → execute checkers → aggregate → detect changes → persist
getRecommendations(): Retrieve with optional refresh, staleness check (>24h)
Safe wrappers: *Safe() versions prevent error leakage
buildContext(): Fetch health snapshots, metrics, processes, ports, machine config
buildContextWithCaching(): Context building with cache layer

Configuration

Parameter	Default	Description
`cacheTTLMinutes`	15	Recommendation cache TTL
`maxCacheSize`	100	Max cache entries
`enableContextCaching`	true	Controls BOTH caches
`contextCacheTTLMinutes`	5	Context cache TTL
`maxRetries`	3	Retry attempts
`retryDelayMs`	1000	Linear backoff base delay

Performance Metrics

totalGenerations, averageGenerationTime, cacheHitRate, cacheHits/Misses, contextBuildTime, checkerTimes, errorCount, lastError

BaseRecommendationChecker

File: backend/app/services/recommendations/BaseRecommendationChecker.ts

Abstract base class for all checkers with common interface and utilities.

Abstract methods:

analyze(context): Returns RecommendationResult[]
getName(): Checker identifier
getCategory(): Category string
isApplicable(context): Optional, default true

Utilities:

parseAndCalculateDaysSince(): Date parsing
extractDiskSpaceData(): Disk usage extraction
looksLikeDiskUsageData(): Data validation

Interfaces:

RecommendationContext: { vmId, latestSnapshot, historicalMetrics, recentProcessSnapshots, portUsage, machineConfig }
RecommendationResult: { type, text, actionText, data? }

Checkers

DiskSpaceChecker

File: backend/app/services/recommendations/DiskSpaceChecker.ts

Monitors disk usage, generates alerts when >95% (critical) or >85% (warning).

Logic: Parse diskSpaceInfo JSON → calculate (used/total) × 100 → compare thresholds → generate recommendations

Env: ENABLE_DISK_SPACE_CHECKER, DISK_SPACE_CRITICAL_THRESHOLD (95), DISK_SPACE_WARNING_THRESHOLD (85)

UnderProvisionedChecker

File: backend/app/services/recommendations/UnderProvisionedChecker.ts

Detects resource constraints via time-based analysis:

CPU: >85% for >2h cumulative → recommend +50% (1.5x) Memory: >90% for >1h OR swap >10% OR available <512MB → recommend +50% to +150% based on severity **Disk**: >90% usage → recommend +30% (1.3x)

Uses time-weighted analysis, not data point counts.

Env: ENABLE_UNDER_PROVISIONED_CHECKER

Other Resource Checkers

OverProvisionedChecker (ENABLE_OVER_PROVISIONED_CHECKER): Detects low utilization, recommends reducing allocation ResourceOptimizationChecker (ENABLE_RESOURCE_OPTIMIZATION_CHECKER): Identifies imbalanced ratios, inefficient layouts DiskIOBottleneckChecker (ENABLE_DISK_IO_BOTTLENECK_CHECKER): Detects high latency, queue depth issues

Security Checkers

DefenderThreatChecker

File: backend/app/services/recommendations/DefenderThreatChecker.ts

Analyzes Windows Defender threats:

Active threats (severity ≥4) → critical
Quarantined → review needed
Recent (<7 days) → increased vigilance

Parse defenderStatus.recent_threats → filter by status/severity → generate recommendations

Env: ENABLE_DEFENDER_THREAT_CHECKER

Other Security Checkers

DefenderDisabledChecker (ENABLE_DEFENDER_DISABLED_CHECKER): Checks defenderStatus.enabled field PortConflictChecker (ENABLE_PORT_BLOCKED_CHECKER): Analyzes BlockedConnection records, identifies blocked legitimate traffic

Maintenance Checkers

OsUpdateChecker

File: backend/app/services/recommendations/OsUpdateChecker.ts

Monitors Windows Update status:

Update check >7 days → stale
Pending updates by severity (Critical/Important/Security/Optional)
Reboot pending duration
Auto-updates disabled

Severity: Critical (critical updates OR reboot >7d), High (important OR reboot 3-7d), Medium (security/optional OR reboot <3d)

Parse windowsUpdateInfo → count by classification → check reboot → build flags → generate recommendations

Env: ENABLE_OS_UPDATE_CHECKER

AppUpdateChecker

File: backend/app/services/recommendations/AppUpdateChecker.ts

Compares applicationInventory versions against latest releases, recommends updates for critical apps (browsers, etc.)

Env: ENABLE_APP_UPDATE_CHECKER

Data Sources

VMHealthSnapshot: Latest or specific snapshot with diskSpaceInfo, defenderStatus, windowsUpdateInfo, applicationInventory
SystemMetrics: 7-day window (configurable), 1000 rows max, includes CPU, memory, swap, disk I/O, network stats
ProcessSnapshot: Last 60min, process-level CPU/memory usage
PortUsage: Current port usage and connection state (100 rows max)
Machine: Current allocation (cpuCores, ramGB, diskSizeGB) + department

Context caching: buildContextWithCaching() wraps with cache (key: context:${vmId}:${snapshotId||'latest'}). Reduces 5 DB queries to 0 on cache hit, 50-70% faster generation.

Generation Flow

generateRecommendations() pipeline:

Disposal check: Verify service not disposed
Cache check: Return if cached (key: recommendations:${vmId}:${snapshotId||'latest'})
Build context: Fetch data with caching
Execute checkers: Run applicable checkers with retry logic (linear backoff: retryDelayMs × attempt)
Aggregate results: Collect all recommendations
Hash comparison: Generate SHA-256 hash, compare with previous
Database persist: Save only if changed
Update cache: Store results (TTL: cacheTTLMinutes)
Log performance: Total time, context build time, per-checker times

Retrieval Flow

getRecommendations():

If refresh=true: Call generateRecommendations()
Validate machine exists
Find latest snapshot
Build filter from RecommendationFilterInput (types, createdAfter/Before, limit)
Query database with filters
Staleness check: If >24h and no filter, regenerate
Return recommendations

Caching Strategy

Two-tier caching (both controlled by single enableContextCaching flag):

Recommendation cache:

Key: recommendations:${vmId}:${snapshotId||'latest'}
TTL: cacheTTLMinutes (default: 15min)
Max size: maxCacheSize (default: 100)
Hit rate: 90-95% reduction in response time (2-5ms vs 150-250ms)

Context cache:

Key: context:${vmId}:${snapshotId||'latest'}
TTL: contextCacheTTLMinutes (default: 5min)
Max size: 50 entries
Impact: 50-70% faster generation (eliminates 5 DB queries)

Invalidation: TTL-based expiration, manual refresh via refresh: true, staleness check (>24h)

Error Handling

Safe wrappers: *Safe() methods catch exceptions, log details, update metrics, return { success, recommendations?, error? }

Checker isolation: Individual failures don't stop process, logged with context

Retry logic: runCheckerWithRetry() with linear backoff (retryDelayMs × attempt), max maxRetries (default: 3)

AppError integration: Standardized errors with codes (INTERNAL_ERROR, NOT_FOUND, DATABASE_ERROR)

Configuration Options

Environment Variables

Cache:

RECOMMENDATION_CACHE_TTL_MINUTES (15)
RECOMMENDATION_MAX_CACHE_SIZE (100)
RECOMMENDATION_CONTEXT_CACHE_TTL_MINUTES (5)
RECOMMENDATION_CONTEXT_CACHING (true) - controls BOTH caches

Performance:

RECOMMENDATION_PERFORMANCE_MONITORING (true)
RECOMMENDATION_PERFORMANCE_THRESHOLD (5000ms)
RECOMMENDATION_MAX_RETRIES (3)
RECOMMENDATION_RETRY_DELAY_MS (1000)

Data windows:

RECOMMENDATION_METRICS_WINDOW_DAYS (7)
RECOMMENDATION_METRICS_MAX_ROWS (1000)
RECOMMENDATION_PORT_USAGE_MAX_ROWS (100)

Checker toggles (all default to enabled unless set to 'false'):

ENABLE_DISK_SPACE_CHECKER
ENABLE_RESOURCE_OPTIMIZATION_CHECKER
ENABLE_OVER_PROVISIONED_CHECKER
ENABLE_UNDER_PROVISIONED_CHECKER
ENABLE_DISK_IO_BOTTLENECK_CHECKER
ENABLE_DEFENDER_DISABLED_CHECKER
ENABLE_DEFENDER_THREAT_CHECKER
ENABLE_PORT_BLOCKED_CHECKER
ENABLE_OS_UPDATE_CHECKER
ENABLE_APP_UPDATE_CHECKER

Checker thresholds:

DISK_SPACE_CRITICAL_THRESHOLD (95)
DISK_SPACE_WARNING_THRESHOLD (85)

GraphQL API

Query: getVMRecommendations

query {
  getVMRecommendations(
    vmId: ID!
    refresh: Boolean
    filter: RecommendationFilterInput
  ): [VMRecommendationType!]!
}

Authorization: @Authorized(['USER']) - ownership check (user.id === machine.userId OR user.role === 'ADMIN')

Returns: { id, machineId, snapshotId, type, text, actionText, data, createdAt }

RecommendationFilterInput

input RecommendationFilterInput {
  types: [RecommendationType!]
  createdAfter: DateTime
  createdBefore: DateTime
  limit: Int  # default: 20, max: 100
}

Best Practices

Creating New Checkers

Extend BaseRecommendationChecker
Implement analyze(), getName(), getCategory()
Use utility methods (parseAndCalculateDaysSince(), extractDiskSpaceData())
Return detailed data in RecommendationResult.data
Add env config (ENABLE_MY_CHECKER, MY_CHECKER_THRESHOLD)
Register in VMRecommendationService constructor

Error Handling

Never throw in analyze() - return empty array
Log with context (checker name, VM ID)
Validate data structures before access
Check for null/undefined

Frontend Integration

const { data, refetch } = useGetVMRecommendationsQuery({
  variables: { vmId, refresh: false, filter: { limit: 20 } },
  pollInterval: 60000
});

// Force refresh
const handleRefresh = () => refetch({ refresh: true });

Troubleshooting

No Recommendations

Check health snapshots exist
Verify checkers enabled (ENABLE_*_CHECKER)
Check InfiniService running in guest
Review service logs for errors

Stale Recommendations

Reduce RECOMMENDATION_CACHE_TTL_MINUTES
Use refresh: true
Check staleness trigger (24h)

Slow Generation

Enable context caching (RECOMMENDATION_CONTEXT_CACHING=true)
Reduce metrics window (RECOMMENDATION_METRICS_WINDOW_DAYS=3)
Reduce max rows (RECOMMENDATION_METRICS_MAX_ROWS=500)
Disable unused checkers
Check logs for slow checkers

Missing Data

Verify InfiniService running (systemctl status infinibay-agent)
Check health snapshot data structure
Review data collection errors in logs

Diagrams

System Architecture

graph TD
    A[VMRecommendationResolver] --> B[VMRecommendationService]
    B --> C[Checkers]
    B --> D[Context Builder]
    D --> E[PostgreSQL]
    B --> F[Caches]
    C --> B

Generation Flow

sequenceDiagram
    User->>Resolver: getVMRecommendations
    Resolver->>Service: getRecommendations
    Service->>Cache: Check cache
    alt Cache Miss
        Service->>DB: Build context
        Service->>Checkers: Execute
        Checkers-->>Service: Results
        Service->>DB: Save if changed
        Service->>Cache: Store
    end
    Service-->>User: Recommendations

Firewall System

VM Management