System Design - Distributed Log Ingestion

Posted: 2025-07-20 | Categories: tech | Tags: hld, system-design

🧠 System Design Interview Summary: Log Ingestion & Query System

Interviewer: Vega
Topic: Design a system for log ingestion, storage, and querying across multi-tenant agents

✅ High-Level Architecture

Agents send logs (JSON) via HTTP to a rate-limited ingress service
Ingress service writes logs to Kafka (HA cluster)
Two main Kafka consumers:
- Object Store Consumer: Stores raw logs in GCS/S3
- Indexing Consumer: Pushes structured logs to Elasticsearch
Elasticsearch Cluster (with snapshots) holds searchable logs
Query Layer exposes APIs (or Kibana) to end users
Metadata DB stores user info, tenant configs, RBAC rules
Telemetry pipeline for usage and system health insights

💡 Key Design Decisions

🔹 Data Format

JSON for ingestion (readable, schema-tolerant)
Protobuf or compressed archives for long-term storage

🔹 Schema Evolution

Agent schemas versioned per tenant
Schema registry to ensure backward compatibility
Only expected fields are accepted/processed

🔹 Indexing & Querying

Indexed fields include timestamp, log level, service name, etc.
Optional full-text search on message field
Search APIs backed by RBAC and tenant-scoped filters

🔹 ElasticSearch Resilience

High availability setup with shard replication
Snapshots taken periodically for disaster recovery
Index Lifecycle Management (ILM) policies to manage TTL

🏢 Multi-Tenancy Strategy

Type	Description
Pseudo Multi-Tenancy	Shared Kafka/Elastic infra, tenant isolation via schema/index prefixes
True Multi-Tenancy	Dedicated infra (Kafka, GCS, Elastic) per tenant — promoted based on volume or SLA
RBAC	Controlled via metadata DB and AD group mapping

🛡️ Security, Rate Limiting & Abuse Protection

Ingress Layer (HTTP proxy) applies:
- Request validation
- API token auth
- Rate limiting per tenant
Circuit breaker patterns to prevent cascading failures
DDOS prevention via quotas & burst control

📊 Telemetry & Monitoring

Logs from agents and query APIs sent to a telemetry Kafka topic
Prometheus/Grafana stack monitors system health and usage patterns
Audit logs retained for admin usage

🔁 Resilience & Failure Handling

Kafka consumers checkpointed to restart safely
Elasticsearch snapshot restore plans
Retry queues for failed ingestion
Dead Letter Queue (DLQ) for malformed logs

📦 Object Storage Strategy

Long-term log archive in GCS/S3 with partitioning by tenant/date
Lifecycle rules to auto-transition to cold storage (e.g., after 30 days)
Optional encryption per tenant (KMS)

📌 PIA Notes — Follow-Ups & Deep Dive Areas

🔸 Retention & Lifecycle

Per-tenant TTL?
ILM policy configurations in Elasticsearch?
GCS lifecycle rules for archival?

🔸 Schema Strategy

Schema registry integration
How to version agent schemas cleanly?
What to do if agent sends unexpected fields?

🔸 Query Layer UX

Native UI (like Kibana) or custom frontend?
Role-based query controls
Audit logs of queries

🔸 Telemetry & Alerts

Can users set their own alert rules?
Do we support lag/delay/error rate alerting?
Anomaly detection roadmap?

🔸 Multi-Tenancy Expansion

What triggers upgrade to true multi-tenancy?
Cost model for per-tenant provisioning
How to support hybrid mode?

🔸 Backpressure & Rate Limiting

Propagation of backpressure to agents?
Real-time quota metrics exposed to tenants?