Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions charts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ helm install hyperfleet-api oci://REGISTRY/hyperfleet-api \
| ports.api | int | `8000` | API server port |
| ports.health | int | `8080` | Health check endpoint port |
| ports.metrics | int | `9090` | Prometheus metrics endpoint port |
| config | object | `{"adapters":{"required":{"cluster":[],"nodepool":[]}},"database":{"debug":false,"dialect":"postgres","host":"","name":"hyperfleet","pool":{"conn_max_idle_time":"1m","conn_max_lifetime":"5m","conn_retry_attempts":10,"conn_retry_interval":"3s","max_connections":50,"max_idle_connections":10,"request_timeout":"30s"},"port":5432,"ssl":{"mode":"disable","root_cert_file":""}},"existingConfigMap":"","health":{"db_ping_timeout":"2s","host":"0.0.0.0","port":8080,"shutdown_timeout":"20s","tls":{"enabled":false}},"logging":{"format":"json","level":"info","masking":{"enabled":true,"fields":["password","secret","token","api_key","access_token","refresh_token","client_secret"],"headers":["Authorization","X-API-Key","Cookie","X-Auth-Token","X-Forwarded-Authorization","X-HyperFleet-Identity"]},"otel":{"enabled":false},"output":"stdout"},"metrics":{"deletion_stuck_threshold":"30m","host":"0.0.0.0","label_metrics_inclusion_duration":"168h","port":9090,"tls":{"enabled":false}},"server":{"host":"0.0.0.0","hostname":"","identity_header":"","jwk":{"cert_file":"","cert_url":""},"jwt":{"audience":"","enabled":false,"identity_claim":"email","issuer_url":""},"port":8000,"timeouts":{"read":"5s","write":"30s"},"tls":{"cert_file":"","enabled":false,"key_file":""}}}` | Application configuration. All settings in this section generate the ConfigMap consumed by the API server. Set `config.existingConfigMap` to use a pre-existing ConfigMap instead. |
| config | object | `{"adapters":{"required":{"cluster":[],"nodepool":[]}},"database":{"debug":false,"dialect":"postgres","host":"","name":"hyperfleet","pool":{"conn_max_idle_time":"1m","conn_max_lifetime":"5m","conn_retry_attempts":10,"conn_retry_interval":"3s","max_connections":50,"max_idle_connections":10,"request_timeout":"30s"},"port":5432,"ssl":{"mode":"disable","root_cert_file":""}},"existingConfigMap":"","health":{"db_ping_timeout":"2s","host":"0.0.0.0","port":8080,"shutdown_timeout":"20s","tls":{"enabled":false}},"logging":{"format":"json","level":"info","masking":{"enabled":true,"fields":["password","secret","token","api_key","access_token","refresh_token","client_secret"],"headers":["Authorization","X-API-Key","Cookie","X-Auth-Token","X-Forwarded-Authorization","X-HyperFleet-Identity"]},"otel":{"enabled":false},"output":"stdout"},"metrics":{"host":"0.0.0.0","label_metrics_inclusion_duration":"168h","port":9090,"reconciliation_stuck_threshold":"10m","tls":{"enabled":false}},"server":{"host":"0.0.0.0","hostname":"","identity_header":"","jwk":{"cert_file":"","cert_url":""},"jwt":{"audience":"","enabled":false,"identity_claim":"email","issuer_url":""},"port":8000,"timeouts":{"read":"5s","write":"30s"},"tls":{"cert_file":"","enabled":false,"key_file":""}}}` | Application configuration. All settings in this section generate the ConfigMap consumed by the API server. Set `config.existingConfigMap` to use a pre-existing ConfigMap instead. |
| config.existingConfigMap | string | `""` | Use an existing ConfigMap instead of generating one. When set, all other `config.*` values are ignored. |
| config.server | object | `{"host":"0.0.0.0","hostname":"","identity_header":"","jwk":{"cert_file":"","cert_url":""},"jwt":{"audience":"","enabled":false,"identity_claim":"email","issuer_url":""},"port":8000,"timeouts":{"read":"5s","write":"30s"},"tls":{"cert_file":"","enabled":false,"key_file":""}}` | HTTP server settings |
| config.server.hostname | string | `""` | Public hostname advertised by the API (leave empty for auto-detect) |
Expand Down Expand Up @@ -93,13 +93,13 @@ helm install hyperfleet-api oci://REGISTRY/hyperfleet-api \
| config.logging.masking.enabled | bool | `true` | Enable log masking |
| config.logging.masking.headers | list | `["Authorization","X-API-Key","Cookie","X-Auth-Token","X-Forwarded-Authorization","X-HyperFleet-Identity"]` | HTTP headers whose values are redacted in logs |
| config.logging.masking.fields | list | `["password","secret","token","api_key","access_token","refresh_token","client_secret"]` | Field names whose values are redacted in logs |
| config.metrics | object | `{"deletion_stuck_threshold":"30m","host":"0.0.0.0","label_metrics_inclusion_duration":"168h","port":9090,"tls":{"enabled":false}}` | Prometheus metrics endpoint settings |
| config.metrics | object | `{"host":"0.0.0.0","label_metrics_inclusion_duration":"168h","port":9090,"reconciliation_stuck_threshold":"10m","tls":{"enabled":false}}` | Prometheus metrics endpoint settings |
| config.metrics.host | string | `"0.0.0.0"` | Listen address (must be `0.0.0.0` for in-cluster access) |
| config.metrics.port | int | `9090` | Listen port (must match `ports.metrics`) |
| config.metrics.tls | object | `{"enabled":false}` | TLS configuration for the metrics endpoint |
| config.metrics.tls.enabled | bool | `false` | Enable TLS on the metrics endpoint |
| config.metrics.label_metrics_inclusion_duration | string | `"168h"` | Duration window for label-based metric inclusion |
| config.metrics.deletion_stuck_threshold | string | `"30m"` | Threshold after which a deletion is considered stuck |
| config.metrics.reconciliation_stuck_threshold | string | `"10m"` | Threshold after which a pending reconciliation is considered stuck |
| config.health | object | `{"db_ping_timeout":"2s","host":"0.0.0.0","port":8080,"shutdown_timeout":"20s","tls":{"enabled":false}}` | Health check endpoint settings |
| config.health.host | string | `"0.0.0.0"` | Listen address (must be `0.0.0.0` for probe access) |
| config.health.port | int | `8080` | Listen port (must match `ports.health`) |
Expand Down
22 changes: 11 additions & 11 deletions charts/templates/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,30 +15,30 @@ metadata:
{{- end }}
spec:
groups:
- name: hyperfleet-api-deletion
- name: hyperfleet-api-reconciliation
rules:
- alert: HyperFleetResourceDeletionStuckWarning
expr: max by (namespace, resource_type)(hyperfleet_api_resource_pending_deletion_stuck) > 0
- alert: HyperFleetResourceReconciliationStuckWarning
expr: max by (namespace, resource_type, is_delete)(hyperfleet_api_resource_pending_reconciliation_stuck) > 0
for: {{ .Values.monitoring.prometheusRule.rules.deletionStuck.for | default "5m" }}
labels:
severity: warning
annotations:
summary: "HyperFleet resources stuck in Pending Deletion state"
summary: "HyperFleet resources stuck pending reconciliation"
description: >-
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in
Pending Deletion state for more than {{ .Values.config.metrics.deletion_stuck_threshold | default "30m" }}
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been
pending reconciliation for more than {{ .Values.config.metrics.reconciliation_stuck_threshold | default "10m" }}
(stuck threshold) + {{ .Values.monitoring.prometheusRule.rules.deletionStuck.for | default "5m" }} (alert delay).
runbook_url: {{ .Values.monitoring.prometheusRule.rules.deletionStuck.runbookUrl | default "" | quote }}
- alert: HyperFleetResourceDeletionStuckCritical
expr: max by (namespace, resource_type)(hyperfleet_api_resource_pending_deletion_stuck) > 0
- alert: HyperFleetResourceReconciliationStuckCritical
expr: max by (namespace, resource_type, is_delete)(hyperfleet_api_resource_pending_reconciliation_stuck) > 0
for: {{ .Values.monitoring.prometheusRule.rules.deletionTimeout.for | default "30m" }}
labels:
severity: critical
annotations:
summary: "HyperFleet resources timed out in Pending Deletion state"
summary: "HyperFleet resources timed out pending reconciliation"
description: >-
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in
Pending Deletion state for more than {{ .Values.config.metrics.deletion_stuck_threshold | default "30m" }}
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been
pending reconciliation for more than {{ .Values.config.metrics.reconciliation_stuck_threshold | default "10m" }}
(stuck threshold) + {{ .Values.monitoring.prometheusRule.rules.deletionTimeout.for | default "30m" }} (alert delay). Immediate investigation required.
runbook_url: {{ .Values.monitoring.prometheusRule.rules.deletionTimeout.runbookUrl | default "" | quote }}
{{- end }}
4 changes: 2 additions & 2 deletions charts/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -351,9 +351,9 @@
"type": "string",
"description": "Duration window for including label-based metrics (Go duration, e.g. 168h)"
},
"deletion_stuck_threshold": {
"reconciliation_stuck_threshold": {
"type": "string",
"description": "Duration after which a pending deletion is considered stuck (Go duration, e.g. 30m)"
"description": "Duration after which a pending reconciliation is considered stuck (Go duration, e.g. 10m)"
}
}
},
Expand Down
4 changes: 2 additions & 2 deletions charts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,8 @@ config:

# -- Duration window for label-based metric inclusion
label_metrics_inclusion_duration: 168h
# -- Threshold after which a deletion is considered stuck
deletion_stuck_threshold: 30m
# -- Threshold after which a pending reconciliation is considered stuck
reconciliation_stuck_threshold: 10m

# -- Health check endpoint settings
health:
Expand Down
6 changes: 3 additions & 3 deletions cmd/hyperfleet-api/servecmd/cmd.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,11 +131,11 @@ func runServe(cmd *cobra.Command, args []string) {
).Info("Logger initialized")

if sf := environments.Environment().Database.SessionFactory; sf != nil {
if err := metrics.RegisterCollector(
if err := metrics.RegisterReconciliationCollector(
sf.DirectDB(),
environments.Environment().Config.Metrics.DeletionStuckThreshold,
environments.Environment().Config.Metrics.ReconciliationStuckThreshold,
); err != nil {
logger.WithError(ctx, err).Error("Failed to register pending deletion collector")
logger.WithError(ctx, err).Error("Failed to register reconciliation collector")
}
}

Expand Down
2 changes: 1 addition & 1 deletion cmd/hyperfleet-api/server/metrics_middleware.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ func ResetMetricCollectors() {
requestCountMetric.Reset()
requestDurationMetric.Reset()
db_metrics.ResetMetrics()
metrics.ResetMetrics()
metrics.ResetReconciliationMetrics()
buildInfoMetric.Reset()
buildInfoMetric.With(prometheus.Labels{
metricsComponentLabel: metricsComponentValue,
Expand Down
4 changes: 2 additions & 2 deletions pkg/config/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,8 @@ func AddMetricsFlags(cmd *cobra.Command) {
cmd.Flags().String("metrics-tls-key-file", defaults.TLS.KeyFile, "Path to TLS key file for metrics")
cmd.Flags().Duration("metrics-label-metrics-inclusion-duration", defaults.LabelMetricsInclusionDuration,
"Duration for cluster telemetry label inclusion")
cmd.Flags().Duration("metrics-deletion-stuck-threshold", defaults.DeletionStuckThreshold,
"Duration after which a pending deletion resource is considered stuck")
cmd.Flags().Duration("metrics-reconciliation-stuck-threshold", defaults.ReconciliationStuckThreshold,
"Duration after which a pending reconciliation resource is considered stuck")
}

// AddHealthFlags adds health check configuration flags following standard naming
Expand Down
6 changes: 3 additions & 3 deletions pkg/config/loader.go
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ func (l *ConfigLoader) bindAllEnvVars() {
l.bindEnv("metrics.port")
l.bindEnv("metrics.tls.enabled")
l.bindEnv("metrics.label_metrics_inclusion_duration")
l.bindEnv("metrics.deletion_stuck_threshold")
l.bindEnv("metrics.reconciliation_stuck_threshold")

// Health config
l.bindEnv("health.host")
Expand Down Expand Up @@ -421,8 +421,8 @@ func (l *ConfigLoader) bindFlags(cmd *cobra.Command) {
l.bindPFlag("metrics.tls.key_file", cmd.Flags().Lookup("metrics-tls-key-file"))
l.bindPFlag("metrics.label_metrics_inclusion_duration",
cmd.Flags().Lookup("metrics-label-metrics-inclusion-duration"))
l.bindPFlag("metrics.deletion_stuck_threshold",
cmd.Flags().Lookup("metrics-deletion-stuck-threshold"))
l.bindPFlag("metrics.reconciliation_stuck_threshold",
cmd.Flags().Lookup("metrics-reconciliation-stuck-threshold"))

// Health flags: --health-* -> health.*
l.bindPFlag("health.host", cmd.Flags().Lookup("health-host"))
Expand Down
8 changes: 4 additions & 4 deletions pkg/config/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ type MetricsConfig struct {
TLS TLSConfig `mapstructure:"tls" json:"tls" validate:"required"`
Port int `mapstructure:"port" json:"port" validate:"required,min=1,max=65535"`
LabelMetricsInclusionDuration time.Duration `mapstructure:"label_metrics_inclusion_duration" json:"label_metrics_inclusion_duration" validate:"required"` //nolint:lll
DeletionStuckThreshold time.Duration `mapstructure:"deletion_stuck_threshold" json:"deletion_stuck_threshold" validate:"required"` //nolint:lll
ReconciliationStuckThreshold time.Duration `mapstructure:"reconciliation_stuck_threshold" json:"reconciliation_stuck_threshold" validate:"required"` //nolint:lll

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Keep the old stuck-threshold key as a deprecated alias for one release.

Line 17 changes the persisted config key, and pkg/config/loader.go/pkg/config/flags.go drop the old env/flag bindings at the same time. Existing metrics.deletion_stuck_threshold overrides will be ignored on upgrade and silently fall back to 10m, which changes stuck classification and alert timing for running deployments. Keep the old key/env/flag as a deprecated alias, or fail fast when the legacy name is still set. As per path instructions, **/config/**: Configuration changes affect all deployments. Review for: - Backward compatibility of config changes.

Also applies to: 30-37

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/config/metrics.go` at line 17, The stuck-threshold config rename is
dropping support for the existing deletion_stuck_threshold key/env/flag, which
will break upgrades for users with persisted overrides. Update the relevant
config wiring in metrics.go, loader.go, and flags.go so
ReconciliationStuckThreshold still accepts the legacy name as a deprecated alias
for one release, or explicitly fail fast if the old name is still present. Make
sure the old env/flag binding and mapstructure/json compatibility are preserved
alongside the new metrics key.

Source: Path instructions

}

// NewMetricsConfig returns default MetricsConfig values
Expand All @@ -27,14 +27,14 @@ func NewMetricsConfig() *MetricsConfig {
Enabled: false,
},
LabelMetricsInclusionDuration: 168 * time.Hour, // 7 days
DeletionStuckThreshold: 30 * time.Minute,
ReconciliationStuckThreshold: 10 * time.Minute,
}
}

// Validate validates MetricsConfig fields that struct tags cannot enforce
func (m *MetricsConfig) Validate() error {
if m.DeletionStuckThreshold <= 0 {
return fmt.Errorf("DeletionStuckThreshold must be positive, got %v", m.DeletionStuckThreshold)
if m.ReconciliationStuckThreshold <= 0 {
return fmt.Errorf("ReconciliationStuckThreshold must be positive, got %v", m.ReconciliationStuckThreshold)
}
return nil
}
Expand Down
169 changes: 0 additions & 169 deletions pkg/metrics/deletion.go

This file was deleted.

Loading