SLA feature #22

Open
opened 2026-01-14 02:23:18 +00:00 by scott · 0 comments
Member

Some AI brainstorming below, need to refine.

Key Design Decisions

Time Period Granularity

I'd recommend weekly as the sweet spot:

  • Monthly is too coarse—a node could be down for days and still "pass"
  • Daily is operationally noisy and creates too many on-chain transactions
  • Weekly balances fairness to customers with practical enforcement

What Makes an SLA Enforceable On-Chain?

You need:

  1. Clear, measurable metrics (uptime percentage, response time)
  2. Objective measurement mechanism (oracle/prover network doing health checks)
  3. Deterministic violation calculation (no ambiguity in "did they violate?")
  4. Pre-committed stake (funds locked, automatically slashable)

Proposed Data Structure

/// Core SLA commitment made by a node operator
struct SlaCommitment {
    // Identity
    node_id: NodeId,
    operator: AccountId,
    
    // The SLA tier they're committing to
    tier: SlaTier,
    
    // Financial backing
    staked_amount: Balance,
    
    // Time boundaries
    period_type: PeriodType,  // Weekly recommended
    effective_from: Timestamp,
    
    // Optional: auto-renewal or explicit periods
    auto_renew: bool,
}

/// Predefined SLA tiers (simpler than custom values)
enum SlaTier {
    /// 99% uptime, best-effort response
    Basic {
        uptime_percent: u8,  // 99
        compensation_percent: u8,  // 10% of period fees refunded per violation
    },
    /// 99.9% uptime, <500ms health check response  
    Standard {
        uptime_percent_thousandths: u16,  // 999 = 99.9%
        max_response_ms: u32,
        compensation_percent: u8,  // 25%
    },
    /// 99.99% uptime, <200ms response, priority support
    Premium {
        uptime_percent_thousandths: u16,  // 9999 = 99.99%
        max_response_ms: u32,
        compensation_percent: u8,  // 50%
    },
}

enum PeriodType {
    Weekly,
    Monthly,
}

/// Record of a single health check (stored off-chain, merkle root on-chain)
struct HealthCheckProof {
    node_id: NodeId,
    timestamp: Timestamp,
    checker_id: CheckerId,  // Which oracle/prover performed this
    result: CheckResult,
    signature: Signature,  // Checker signs their attestation
}

enum CheckResult {
    Healthy { response_time_ms: u32 },
    Unhealthy { reason: UnhealthyReason },
    Unreachable,
}

enum UnhealthyReason {
    Timeout,
    ConnectionRefused,
    InvalidResponse,
    TlsError,
}

/// Aggregated period summary (this goes on-chain)
struct PeriodReport {
    node_id: NodeId,
    period_start: Timestamp,
    period_end: Timestamp,
    
    // Aggregated metrics
    total_checks: u32,
    successful_checks: u32,
    failed_checks: u32,
    
    // Derived
    uptime_thousandths: u16,  // e.g., 9985 = 99.85%
    avg_response_ms: u32,
    max_response_ms: u32,
    
    // Proof that this summary is valid
    checks_merkle_root: Hash,  // Root of all individual HealthCheckProofs
    
    // Multi-sig from checker network
    attestations: Vec<CheckerAttestation>,
}

/// Violation determination and compensation
struct SlaViolation {
    node_id: NodeId,
    period: PeriodReport,
    commitment: SlaCommitment,
    
    // What was violated
    violation_type: ViolationType,
    
    // Compensation calculation
    affected_customers: Vec<AffectedCustomer>,
    total_compensation: Balance,
}

enum ViolationType {
    UptimeBelowThreshold {
        required_thousandths: u16,
        actual_thousandths: u16,
    },
    ResponseTimeExceeded {
        max_allowed_ms: u32,
        actual_avg_ms: u32,
    },
    Both {
        uptime_violation: Box<ViolationType>,
        response_violation: Box<ViolationType>,
    },
}

struct AffectedCustomer {
    customer_id: AccountId,
    // Their usage during this period
    usage_start: Timestamp,
    usage_end: Option<Timestamp>,  // None = still active
    fees_paid_this_period: Balance,
    // What they're owed
    compensation_amount: Balance,
}

Compensation Logic

Here's how I'd structure the payout logic:

fn calculate_compensation(
    violation: &SlaViolation,
    customer: &AffectedCustomer,
) -> Balance {
    let base_compensation = match violation.commitment.tier {
        SlaTier::Basic { compensation_percent, .. } |
        SlaTier::Standard { compensation_percent, .. } |
        SlaTier::Premium { compensation_percent, .. } => {
            customer.fees_paid_this_period * compensation_percent as u128 / 100
        }
    };
    
    // Could add severity multiplier
    let severity = calculate_severity(&violation.violation_type);
    
    // Cap at stake available
    min(base_compensation * severity, available_stake_per_customer)
}

fn calculate_severity(violation: &ViolationType) -> u128 {
    match violation {
        // Minor miss: 1x compensation
        ViolationType::UptimeBelowThreshold { required, actual } 
            if required - actual < 10 => 1,  // <0.1% miss
        
        // Moderate miss: 2x
        ViolationType::UptimeBelowThreshold { required, actual }
            if required - actual < 50 => 2,  // <0.5% miss
            
        // Severe miss: 3x (capped at stake)
        _ => 3,
    }
}

Key Architectural Points

1. Predefined Tiers vs. Custom Values

I strongly recommend predefined tiers because:

  • Easier for customers to compare nodes
  • Reduces smart contract complexity
  • Prevents gaming with weird edge-case values
  • Standard tiers can map to standard stake requirements

2. Stake Requirements

fn minimum_stake_for_tier(tier: &SlaTier, node_capacity: Capacity) -> Balance {
    let base = match tier {
        SlaTier::Basic { .. } => 100,      // 100 tokens
        SlaTier::Standard { .. } => 500,   // 500 tokens  
        SlaTier::Premium { .. } => 2000,   // 2000 tokens
    };
    
    // Scale with capacity (bigger nodes = more at stake)
    base * node_capacity.compute_units
}

3. Health Check Frequency

For weekly periods:

  • Basic tier: Check every 15 minutes (672 checks/week)
  • Standard tier: Check every 5 minutes (2,016 checks/week)
  • Premium tier: Check every 1 minute (10,080 checks/week)

This gives statistical significance for uptime calculations.

4. Grace Periods & Maintenance Windows

struct SlaCommitment {
    // ... other fields ...
    
    // Operator can declare maintenance windows (limited)
    maintenance_windows: Vec<MaintenanceWindow>,
    max_maintenance_hours_per_period: u8,  // e.g., 4 hours/week for Basic
}

struct MaintenanceWindow {
    start: Timestamp,
    end: Timestamp,
    announced_at: Timestamp,  // Must announce 24h+ in advance
}

Questions to Resolve

  1. Who runs the health checkers? Decentralized oracle network? Staked checkers? Both customer and random checkers?

  2. Dispute mechanism? What if a node operator disagrees with a violation? Appeal period?

  3. Partial periods? If a customer joins mid-week, how do you prorate?

  4. Cascading failures? If the checker network itself has issues, do you pause SLA enforcement?

Would you like me to dive deeper into any of these areas, or shall I write out the actual Rust/Solidity implementation for the core data types?

Some AI brainstorming below, need to refine. ## Key Design Decisions ### Time Period Granularity I'd recommend **weekly** as the sweet spot: - Monthly is too coarse—a node could be down for days and still "pass" - Daily is operationally noisy and creates too many on-chain transactions - Weekly balances fairness to customers with practical enforcement ### What Makes an SLA Enforceable On-Chain? You need: 1. **Clear, measurable metrics** (uptime percentage, response time) 2. **Objective measurement mechanism** (oracle/prover network doing health checks) 3. **Deterministic violation calculation** (no ambiguity in "did they violate?") 4. **Pre-committed stake** (funds locked, automatically slashable) ## Proposed Data Structure ```rust /// Core SLA commitment made by a node operator struct SlaCommitment { // Identity node_id: NodeId, operator: AccountId, // The SLA tier they're committing to tier: SlaTier, // Financial backing staked_amount: Balance, // Time boundaries period_type: PeriodType, // Weekly recommended effective_from: Timestamp, // Optional: auto-renewal or explicit periods auto_renew: bool, } /// Predefined SLA tiers (simpler than custom values) enum SlaTier { /// 99% uptime, best-effort response Basic { uptime_percent: u8, // 99 compensation_percent: u8, // 10% of period fees refunded per violation }, /// 99.9% uptime, <500ms health check response Standard { uptime_percent_thousandths: u16, // 999 = 99.9% max_response_ms: u32, compensation_percent: u8, // 25% }, /// 99.99% uptime, <200ms response, priority support Premium { uptime_percent_thousandths: u16, // 9999 = 99.99% max_response_ms: u32, compensation_percent: u8, // 50% }, } enum PeriodType { Weekly, Monthly, } /// Record of a single health check (stored off-chain, merkle root on-chain) struct HealthCheckProof { node_id: NodeId, timestamp: Timestamp, checker_id: CheckerId, // Which oracle/prover performed this result: CheckResult, signature: Signature, // Checker signs their attestation } enum CheckResult { Healthy { response_time_ms: u32 }, Unhealthy { reason: UnhealthyReason }, Unreachable, } enum UnhealthyReason { Timeout, ConnectionRefused, InvalidResponse, TlsError, } /// Aggregated period summary (this goes on-chain) struct PeriodReport { node_id: NodeId, period_start: Timestamp, period_end: Timestamp, // Aggregated metrics total_checks: u32, successful_checks: u32, failed_checks: u32, // Derived uptime_thousandths: u16, // e.g., 9985 = 99.85% avg_response_ms: u32, max_response_ms: u32, // Proof that this summary is valid checks_merkle_root: Hash, // Root of all individual HealthCheckProofs // Multi-sig from checker network attestations: Vec<CheckerAttestation>, } /// Violation determination and compensation struct SlaViolation { node_id: NodeId, period: PeriodReport, commitment: SlaCommitment, // What was violated violation_type: ViolationType, // Compensation calculation affected_customers: Vec<AffectedCustomer>, total_compensation: Balance, } enum ViolationType { UptimeBelowThreshold { required_thousandths: u16, actual_thousandths: u16, }, ResponseTimeExceeded { max_allowed_ms: u32, actual_avg_ms: u32, }, Both { uptime_violation: Box<ViolationType>, response_violation: Box<ViolationType>, }, } struct AffectedCustomer { customer_id: AccountId, // Their usage during this period usage_start: Timestamp, usage_end: Option<Timestamp>, // None = still active fees_paid_this_period: Balance, // What they're owed compensation_amount: Balance, } ``` ## Compensation Logic Here's how I'd structure the payout logic: ```rust fn calculate_compensation( violation: &SlaViolation, customer: &AffectedCustomer, ) -> Balance { let base_compensation = match violation.commitment.tier { SlaTier::Basic { compensation_percent, .. } | SlaTier::Standard { compensation_percent, .. } | SlaTier::Premium { compensation_percent, .. } => { customer.fees_paid_this_period * compensation_percent as u128 / 100 } }; // Could add severity multiplier let severity = calculate_severity(&violation.violation_type); // Cap at stake available min(base_compensation * severity, available_stake_per_customer) } fn calculate_severity(violation: &ViolationType) -> u128 { match violation { // Minor miss: 1x compensation ViolationType::UptimeBelowThreshold { required, actual } if required - actual < 10 => 1, // <0.1% miss // Moderate miss: 2x ViolationType::UptimeBelowThreshold { required, actual } if required - actual < 50 => 2, // <0.5% miss // Severe miss: 3x (capped at stake) _ => 3, } } ``` ## Key Architectural Points ### 1. Predefined Tiers vs. Custom Values I strongly recommend **predefined tiers** because: - Easier for customers to compare nodes - Reduces smart contract complexity - Prevents gaming with weird edge-case values - Standard tiers can map to standard stake requirements ### 2. Stake Requirements ```rust fn minimum_stake_for_tier(tier: &SlaTier, node_capacity: Capacity) -> Balance { let base = match tier { SlaTier::Basic { .. } => 100, // 100 tokens SlaTier::Standard { .. } => 500, // 500 tokens SlaTier::Premium { .. } => 2000, // 2000 tokens }; // Scale with capacity (bigger nodes = more at stake) base * node_capacity.compute_units } ``` ### 3. Health Check Frequency For weekly periods: - **Basic tier**: Check every 15 minutes (672 checks/week) - **Standard tier**: Check every 5 minutes (2,016 checks/week) - **Premium tier**: Check every 1 minute (10,080 checks/week) This gives statistical significance for uptime calculations. ### 4. Grace Periods & Maintenance Windows ```rust struct SlaCommitment { // ... other fields ... // Operator can declare maintenance windows (limited) maintenance_windows: Vec<MaintenanceWindow>, max_maintenance_hours_per_period: u8, // e.g., 4 hours/week for Basic } struct MaintenanceWindow { start: Timestamp, end: Timestamp, announced_at: Timestamp, // Must announce 24h+ in advance } ``` ## Questions to Resolve 1. **Who runs the health checkers?** Decentralized oracle network? Staked checkers? Both customer and random checkers? 2. **Dispute mechanism?** What if a node operator disagrees with a violation? Appeal period? 3. **Partial periods?** If a customer joins mid-week, how do you prorate? 4. **Cascading failures?** If the checker network itself has issues, do you pause SLA enforcement? Would you like me to dive deeper into any of these areas, or shall I write out the actual Rust/Solidity implementation for the core data types?
Sign in to join this conversation.
No labels
urgent
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_ledger#22
No description provided.