VM Recommendation System
Intelligent recommendation engine - checker-based architecture, analysis types, data sources, and automated optimization suggestions
VM Recommendation System
Introduction
VM recommendation system analyzes VM health data, system metrics, and configuration to generate optimization suggestions. Uses 10 specialized checkers organized by category: Storage, Resource Optimization, Security, Maintenance, Performance.
System Architecture
Architecture
Layers:
- API:
VMRecommendationResolverinbackend/app/graphql/resolvers/VMRecommendationResolver.ts - Service:
VMRecommendationServiceinbackend/app/services/VMRecommendationService.ts - Checkers: 10 analyzers implementing
RecommendationCheckerinterface inbackend/app/services/recommendations/ - Data: Context from
VMHealthSnapshot,SystemMetrics,ProcessSnapshot,PortUsage,Machine - Storage: PostgreSQL via Prisma
Key decisions:
- Modular checker design (extend
BaseRecommendationChecker, register with service) - Two-tier caching (recommendation + context caches)
- SHA-256 hash-based change detection (prevents duplicate writes)
- Per-checker performance tracking
Recommendation Types and Categories
Recommendation Types
- DISK_SPACE_LOW, HIGH_CPU_APP, HIGH_RAM_APP
- PORT_BLOCKED, OVER_PROVISIONED, UNDER_PROVISIONED
- OS_UPDATE_AVAILABLE, APP_UPDATE_AVAILABLE
- DEFENDER_DISABLED, DEFENDER_THREAT, OTHER
Categories
- Storage: Disk space monitoring
- Resource Optimization: CPU/RAM/disk allocation analysis
- Security: Defender status, threat detection, firewall
- Maintenance: OS/app updates
- Performance: I/O bottlenecks, process optimization
VMRecommendationService
File: backend/app/services/VMRecommendationService.ts
Central orchestrator managing checker execution, caching, and performance monitoring in VMRecommendationService class.
Key methods:
generateRecommendations(): Build context → execute checkers → aggregate → detect changes → persistgetRecommendations(): Retrieve with optional refresh, staleness check (>24h)- Safe wrappers:
*Safe()versions prevent error leakage buildContext(): Fetch health snapshots, metrics, processes, ports, machine configbuildContextWithCaching(): Context building with cache layer
Configuration
| Parameter | Default | Description |
|---|---|---|
cacheTTLMinutes |
15 | Recommendation cache TTL |
maxCacheSize |
100 | Max cache entries |
enableContextCaching |
true | Controls BOTH caches |
contextCacheTTLMinutes |
5 | Context cache TTL |
maxRetries |
3 | Retry attempts |
retryDelayMs |
1000 | Linear backoff base delay |
Performance Metrics
totalGenerations, averageGenerationTime, cacheHitRate, cacheHits/Misses, contextBuildTime, checkerTimes, errorCount, lastError
BaseRecommendationChecker
File: backend/app/services/recommendations/BaseRecommendationChecker.ts
Abstract base class for all checkers with common interface and utilities.
Abstract methods:
analyze(context): ReturnsRecommendationResult[]getName(): Checker identifiergetCategory(): Category stringisApplicable(context): Optional, defaulttrue
Utilities:
parseAndCalculateDaysSince(): Date parsingextractDiskSpaceData(): Disk usage extractionlooksLikeDiskUsageData(): Data validation
Interfaces:
RecommendationContext: { vmId, latestSnapshot, historicalMetrics, recentProcessSnapshots, portUsage, machineConfig }
RecommendationResult: { type, text, actionText, data? }
Checkers
DiskSpaceChecker
File: backend/app/services/recommendations/DiskSpaceChecker.ts
Monitors disk usage, generates alerts when >95% (critical) or >85% (warning).
Logic: Parse diskSpaceInfo JSON → calculate (used/total) × 100 → compare thresholds → generate recommendations
Env: ENABLE_DISK_SPACE_CHECKER, DISK_SPACE_CRITICAL_THRESHOLD (95), DISK_SPACE_WARNING_THRESHOLD (85)
UnderProvisionedChecker
File: backend/app/services/recommendations/UnderProvisionedChecker.ts
Detects resource constraints via time-based analysis:
CPU: >85% for >2h cumulative → recommend +50% (1.5x) Memory: >90% for >1h OR swap >10% OR available <512MB → recommend +50% to +150% based on severity **Disk**: >90% usage → recommend +30% (1.3x)
Uses time-weighted analysis, not data point counts.
Env: ENABLE_UNDER_PROVISIONED_CHECKER
Other Resource Checkers
OverProvisionedChecker (ENABLE_OVER_PROVISIONED_CHECKER): Detects low utilization, recommends reducing allocation
ResourceOptimizationChecker (ENABLE_RESOURCE_OPTIMIZATION_CHECKER): Identifies imbalanced ratios, inefficient layouts
DiskIOBottleneckChecker (ENABLE_DISK_IO_BOTTLENECK_CHECKER): Detects high latency, queue depth issues
Security Checkers
DefenderThreatChecker
File: backend/app/services/recommendations/DefenderThreatChecker.ts
Analyzes Windows Defender threats:
- Active threats (severity ≥4) → critical
- Quarantined → review needed
- Recent (<7 days) → increased vigilance
Parse defenderStatus.recent_threats → filter by status/severity → generate recommendations
Env: ENABLE_DEFENDER_THREAT_CHECKER
Other Security Checkers
DefenderDisabledChecker (ENABLE_DEFENDER_DISABLED_CHECKER): Checks defenderStatus.enabled field
PortConflictChecker (ENABLE_PORT_BLOCKED_CHECKER): Analyzes BlockedConnection records, identifies blocked legitimate traffic
Maintenance Checkers
OsUpdateChecker
File: backend/app/services/recommendations/OsUpdateChecker.ts
Monitors Windows Update status:
- Update check >7 days → stale
- Pending updates by severity (Critical/Important/Security/Optional)
- Reboot pending duration
- Auto-updates disabled
Severity: Critical (critical updates OR reboot >7d), High (important OR reboot 3-7d), Medium (security/optional OR reboot <3d)
Parse windowsUpdateInfo → count by classification → check reboot → build flags → generate recommendations
Env: ENABLE_OS_UPDATE_CHECKER
AppUpdateChecker
File: backend/app/services/recommendations/AppUpdateChecker.ts
Compares applicationInventory versions against latest releases, recommends updates for critical apps (browsers, etc.)
Env: ENABLE_APP_UPDATE_CHECKER
Data Sources
- VMHealthSnapshot: Latest or specific snapshot with
diskSpaceInfo,defenderStatus,windowsUpdateInfo,applicationInventory - SystemMetrics: 7-day window (configurable), 1000 rows max, includes CPU, memory, swap, disk I/O, network stats
- ProcessSnapshot: Last 60min, process-level CPU/memory usage
- PortUsage: Current port usage and connection state (100 rows max)
- Machine: Current allocation (cpuCores, ramGB, diskSizeGB) + department
Context caching: buildContextWithCaching() wraps with cache (key: context:${vmId}:${snapshotId||'latest'}). Reduces 5 DB queries to 0 on cache hit, 50-70% faster generation.
Generation Flow
generateRecommendations() pipeline:
- Disposal check: Verify service not disposed
- Cache check: Return if cached (key:
recommendations:${vmId}:${snapshotId||'latest'}) - Build context: Fetch data with caching
- Execute checkers: Run applicable checkers with retry logic (linear backoff:
retryDelayMs × attempt) - Aggregate results: Collect all recommendations
- Hash comparison: Generate SHA-256 hash, compare with previous
- Database persist: Save only if changed
- Update cache: Store results (TTL:
cacheTTLMinutes) - Log performance: Total time, context build time, per-checker times
Retrieval Flow
getRecommendations():
- If
refresh=true: CallgenerateRecommendations() - Validate machine exists
- Find latest snapshot
- Build filter from
RecommendationFilterInput(types, createdAfter/Before, limit) - Query database with filters
- Staleness check: If >24h and no filter, regenerate
- Return recommendations
Caching Strategy
Two-tier caching (both controlled by single enableContextCaching flag):
Recommendation cache:
- Key:
recommendations:${vmId}:${snapshotId||'latest'} - TTL:
cacheTTLMinutes(default: 15min) - Max size:
maxCacheSize(default: 100) - Hit rate: 90-95% reduction in response time (2-5ms vs 150-250ms)
Context cache:
- Key:
context:${vmId}:${snapshotId||'latest'} - TTL:
contextCacheTTLMinutes(default: 5min) - Max size: 50 entries
- Impact: 50-70% faster generation (eliminates 5 DB queries)
Invalidation: TTL-based expiration, manual refresh via refresh: true, staleness check (>24h)
Error Handling
Safe wrappers: *Safe() methods catch exceptions, log details, update metrics, return { success, recommendations?, error? }
Checker isolation: Individual failures don't stop process, logged with context
Retry logic: runCheckerWithRetry() with linear backoff (retryDelayMs × attempt), max maxRetries (default: 3)
AppError integration: Standardized errors with codes (INTERNAL_ERROR, NOT_FOUND, DATABASE_ERROR)
Configuration Options
Environment Variables
Cache:
RECOMMENDATION_CACHE_TTL_MINUTES(15)RECOMMENDATION_MAX_CACHE_SIZE(100)RECOMMENDATION_CONTEXT_CACHE_TTL_MINUTES(5)RECOMMENDATION_CONTEXT_CACHING(true) - controls BOTH caches
Performance:
RECOMMENDATION_PERFORMANCE_MONITORING(true)RECOMMENDATION_PERFORMANCE_THRESHOLD(5000ms)RECOMMENDATION_MAX_RETRIES(3)RECOMMENDATION_RETRY_DELAY_MS(1000)
Data windows:
RECOMMENDATION_METRICS_WINDOW_DAYS(7)RECOMMENDATION_METRICS_MAX_ROWS(1000)RECOMMENDATION_PORT_USAGE_MAX_ROWS(100)
Checker toggles (all default to enabled unless set to 'false'):
ENABLE_DISK_SPACE_CHECKERENABLE_RESOURCE_OPTIMIZATION_CHECKERENABLE_OVER_PROVISIONED_CHECKERENABLE_UNDER_PROVISIONED_CHECKERENABLE_DISK_IO_BOTTLENECK_CHECKERENABLE_DEFENDER_DISABLED_CHECKERENABLE_DEFENDER_THREAT_CHECKERENABLE_PORT_BLOCKED_CHECKERENABLE_OS_UPDATE_CHECKERENABLE_APP_UPDATE_CHECKER
Checker thresholds:
DISK_SPACE_CRITICAL_THRESHOLD(95)DISK_SPACE_WARNING_THRESHOLD(85)
GraphQL API
Query: getVMRecommendations
query {
getVMRecommendations(
vmId: ID!
refresh: Boolean
filter: RecommendationFilterInput
): [VMRecommendationType!]!
}
Authorization: @Authorized(['USER']) - ownership check (user.id === machine.userId OR user.role === 'ADMIN')
Returns: { id, machineId, snapshotId, type, text, actionText, data, createdAt }
RecommendationFilterInput
input RecommendationFilterInput {
types: [RecommendationType!]
createdAfter: DateTime
createdBefore: DateTime
limit: Int # default: 20, max: 100
}
Best Practices
Creating New Checkers
- Extend
BaseRecommendationChecker - Implement
analyze(),getName(),getCategory() - Use utility methods (
parseAndCalculateDaysSince(),extractDiskSpaceData()) - Return detailed data in
RecommendationResult.data - Add env config (
ENABLE_MY_CHECKER,MY_CHECKER_THRESHOLD) - Register in
VMRecommendationServiceconstructor
Error Handling
- Never throw in
analyze()- return empty array - Log with context (checker name, VM ID)
- Validate data structures before access
- Check for null/undefined
Frontend Integration
const { data, refetch } = useGetVMRecommendationsQuery({
variables: { vmId, refresh: false, filter: { limit: 20 } },
pollInterval: 60000
});
// Force refresh
const handleRefresh = () => refetch({ refresh: true });
Troubleshooting
No Recommendations
- Check health snapshots exist
- Verify checkers enabled (
ENABLE_*_CHECKER) - Check InfiniService running in guest
- Review service logs for errors
Stale Recommendations
- Reduce
RECOMMENDATION_CACHE_TTL_MINUTES - Use
refresh: true - Check staleness trigger (24h)
Slow Generation
- Enable context caching (
RECOMMENDATION_CONTEXT_CACHING=true) - Reduce metrics window (
RECOMMENDATION_METRICS_WINDOW_DAYS=3) - Reduce max rows (
RECOMMENDATION_METRICS_MAX_ROWS=500) - Disable unused checkers
- Check logs for slow checkers
Missing Data
- Verify InfiniService running (
systemctl status infinibay-agent) - Check health snapshot data structure
- Review data collection errors in logs
Diagrams
System Architecture
graph TD
A[VMRecommendationResolver] --> B[VMRecommendationService]
B --> C[Checkers]
B --> D[Context Builder]
D --> E[PostgreSQL]
B --> F[Caches]
C --> B
Generation Flow
sequenceDiagram
User->>Resolver: getVMRecommendations
Resolver->>Service: getRecommendations
Service->>Cache: Check cache
alt Cache Miss
Service->>DB: Build context
Service->>Checkers: Execute
Checkers-->>Service: Results
Service->>DB: Save if changed
Service->>Cache: Store
end
Service-->>User: Recommendations