ClickHouse Playbook: OLAP for High-Cardinality Telemetry

Decision framework for OLAP backends handling high-card telemetry—ingest patterns, sharding, cost trade-offs, and ClickHouse strategies for 2026.

Hook: Why high-cardinality telemetry breaks most OLAP assumptions — and what to do about it

High-cardinality telemetry (think: millions of unique device IDs, session identifiers, and custom tags) turns everyday analytics problems into engineering nightmares: ingest pipelines choke, queries go non-responsive, and cloud bills spike. If your platform teams are choosing an OLAP backend in 2026, you need a decision framework that weighs ingest architecture, sharding strategy, and cost trade-offs against real telemetry workloads. This ClickHouse playbook gives you that framework — actionable patterns, SQL/snippet examples, benchmarking guidance, and cost controls tuned to modern telemetry.

Executive summary (most important first)

ClickHouse remains a top choice for high-throughput telemetry in 2026 due to its ingestion performance, storage efficiency, and ecosystem growth (notably the large 2025 funding round that accelerated managed offerings). But it's not a one-size-fits-all win.

Choose ClickHouse when you need sub-second analytics on high-volume streams, real-time ingestion, and control over storage tiers.
Design ingest with Kafka (or cloud-native streams) plus ClickHouse's Kafka table engine and Materialized Views to avoid backpressure and enable exactly-once idempotence.
Shard thoughtfully: hash-shard on a high-cardinality key to spread write load, but mitigate cross-shard query costs with strategic pre-aggregations and localized primary keys.
Optimize costs by using tiered storage (hot/cold), TTL-based downsampling, pre-aggregation via AggregatingMergeTree, and careful replication factors.
Benchmark rigorously: measure inserts/sec, P50/P95/P99 query latency, storage bytes/event, and distinct-count accuracy for your telemetry patterns.

2026 Trends shaping OLAP choices for telemetry

Two trends are changing the calculus this year:

Managed ClickHouse and ecosystem acceleration: Significant funding in 2025 drove commercial offerings (ClickHouse Cloud, Altinity.Cloud, and other managed players), reducing operational friction for teams that lack DBAs.
Separation of compute and object storage: Serverless and storage-separated architectures let you push older telemetry to S3 while keeping recent data hot — essential when cardinality multiplies your overall footprint.

Start with a decision framework (questions to answer)

Before picking an OLAP backend, answer these in priority order:

What is your sustained and peak ingest rate (events/sec) and retention window?
How many unique cardinal dimensions (device_id, user_id, tag combinations) do you expect per day and over retention?
What are the dominant query shapes: point lookups, high-cardinality GROUP BYs, distinct counts, top-k time-series?
What are acceptable latencies for dashboards and ad-hoc queries (sub-second, seconds, minutes)?
What's your operational capacity — do you need managed service vs. self-hosting?
What's your cost target per million events or per TB-month?

Telemetry ingestion architectures that scale

Telemetry ingestion must absorb bursts and preserve ordering/idempotence. Below are proven patterns with ClickHouse.

1) Streaming-first: Kafka -> ClickHouse (recommended for high-card telemetry)

Pattern: Use Apache Kafka (or cloud equivalents: MSK, Confluent, Pub/Sub, Event Hubs) as the buffer and then the ClickHouse Kafka table engine + MATERIALIZED VIEW to write into MergeTree tables.

-- Kafka table to read raw JSON lines
CREATE TABLE telemetry_kafka (
  ts DateTime,
  device_id String,
  user_id String,
  metric Float64,
  tags String
) ENGINE = Kafka SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'telemetry',
  kafka_group_name = 'ch-ingest',
  kafka_format = 'JSONEachRow';

-- Local MergeTree table to store processed events
CREATE TABLE telemetry_local (
  ts DateTime,
  device_id String,
  user_id String,
  metric Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (device_id, ts);

-- Materialized view to move from Kafka into MergeTree
CREATE MATERIALIZED VIEW telemetry_mv TO telemetry_local AS
SELECT ts, device_id, user_id, metric FROM telemetry_kafka;

Why this works: Kafka decouples producers and ClickHouse consumers. Materialized views allow lightweight ETL; ClickHouse's Kafka engine can replay segments on restarts. Use batching and the kafka consumer group to parallelize ingestion.

2) Batch bulk loads (for lower-latency-tolerant pipelines)

Pattern: Buffer events for 1-5 seconds and write bulk inserts using ClickHouse's HTTP or native client. Lower per-row overhead and better compression, but increased latency.

3) Serverless/Direct API (small teams or managed services)

Pattern: For low ops overhead use ClickHouse Cloud or managed offerings with direct ingestion APIs. Good where engineering resources are limited but be mindful of vendor egress and rate limits.

Sharding strategies for high-cardinality telemetry

Sharding controls how write and query load distribute across nodes. With extreme cardinality, naive sharding leads to hot partitions. These are the main approaches.

Hash-based sharding on cardinal key (default for writes)

Hash-shard on your highest-cardinality identifier (device_id, session_id) so writes distribute evenly. In ClickHouse that means creating a Distributed table with a sharding key:

CREATE TABLE telemetry_dist ON CLUSTER ch_cluster AS telemetry_local
ENGINE = Distributed(ch_cluster, analytics, telemetry_local, sipHash64(device_id));

Pros: reduces write hot spots. Cons: cross-shard GROUP BY queries need data movement; this raises query network costs and latency.

Time-based sharding (for query locality)

Partition or shard by time ranges (day/week) in addition to hashing. This improves time-range queries and older partition eviction, but can concentrate writes during peak times.

Hybrid sharding (hash + time)

Most production systems use a hybrid: hash to distribute write CPU and network, time partitions for retention and TTL. Implement by combining sipHash64(device_id) for Distributed engine plus MergeTree PARTITION BY toYYYYMM(ts).

Co-located primary keys and query-aware routing

If your queries often filter by a second index (e.g., tenant_id), consider routing writes so data for frequent queries stays on a subset of nodes. This adds routing complexity but reduces cross-node aggregations.

Storage, retention, and downsampling — cost levers

High-card telemetry inflates storage faster than you think. Use these levers to control cost.

Tiered storage: Move cold partitions to S3/cheaper disks using ClickHouse storage policies. This keeps compute nodes small and cheap.
TTL-driven downsampling: Keep raw events for a short window (7–30 days) and store pre-aggregated rollups for longer windows.
Pre-aggregation/AggregatingMergeTree: Use Summing/AggregatingMergeTree to store rollups (per minute/hour) and avoid expensive GROUP BYs across billions of rows.
Compression tuning: For telemetry, LZ4 is common; for cold S3 tiers consider ZSTD or Brotli to reduce storage costs.
Replication factor: Lower replication reduces cost but increases risk. Use 2x replication for many telemetry workloads instead of 3x to save ~33% storage cost, if your SLA allows.

Example: TTL + tiered storage

ALTER TABLE telemetry_local
MODIFY TTL ts + INTERVAL 30 DAY TO VOLUME 'cold';

-- storage_policy example in server config maps 'cold' to S3

Pre-aggregation playbook

To avoid full-table scans and cross-shard GROUP BYs, pre-aggregate at ingest time into time buckets and common dimensions.

CREATE TABLE telemetry_rollup
(
  bucket DateTime, -- truncated to minute/hour
  device_id String,
  metric_sum AggregateFunction(sum, Float64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(bucket)
ORDER BY (device_id, bucket);

-- insert into rollup from streaming path every N seconds

Query rollups for dashboards; only fall back to raw events for debugging or narrow-scope analyses.

Benchmarking: what to measure and how

Benchmarks must mirror production telemetry shapes. Synthetic bench tools are fine — but emulate cardinality and query patterns, not just event rates.

Key metrics

Ingest throughput (events/sec sustained and burst)
Insert latency (tail of commit latency)
Query latency P50/P95/P99 for each query shape
Storage bytes per event post-compression
Distinct-count accuracy for approximation functions (HLL), error rates
Cost per million events/month (compute + storage + network)

Workload shapes to emulate

High-cardinality GROUP BY over many dimensions
Top-k queries (top devices by metric) across sliding windows
Distinct counts per time bucket (HLL vs exact)
Ad-hoc multi-join queries for debugging

How to run a meaningful benchmark

Profile your real event size and cardinality — sample a week of production if possible.
Generate a synthetic feed that matches event size and cardinal distributions (Zipf or Pareto for tags).
Run sustained and burst ingest phases (e.g., 1M events/sec sustained, 5M bursts) while executing query sets.
Measure end-to-end latency and resource utilization, then compute projected monthly costs.

ClickHouse-specific knobs and engines

Use the right table engines and functions for your telemetry profile:

MergeTree family: primary storage for high ingest. Choose ReplacingMergeTree for dedup, CollapsingMergeTree for event correction workflows.
AggregatingMergeTree / SummingMergeTree: for rollups and pre-aggregations.
Distributed engine: logical table that routes queries to shards; combine with sipHash64 shard key.
Kafka table engine: robust streaming ingestion.
Unique and HLL functions: uniqExact (accurate), uniqCombined/uniqHLL for approximate distinct counts.

Operational risks and mitigations

High-card telemetry magnifies a few risks. Plan mitigations now.

Hot partitions: Avoid using low-cardinality partitioning (like tenant_id alone). Use hybrid sharding and time partitions.
Cross-shard aggregations: Pre-aggregate and use rollups where possible. Consider materialized pre-aggregates for common queries.
Replication and availability: Use replicas across zones; test node failovers and rebalance costs (replication increases storage).
Backup and restore: For terabyte-scale telemetry, use object storage snapshots and test restores end-to-end regularly.

Cost modeling — practical example

Estimate costs using these inputs: events/day, size/event (after compression), replication factor, compute node cost, S3 cost. Below is a simplified example for planning.

Example: 100M events/day, 100 bytes/event compressed => 10 GB/day raw. 30-day retention = 300 GB. With 2x replication => 600 GB. Add index/meta overhead ~20% => ~720 GB.

If compute nodes cost $1.50/hour each and you need 8 nodes for throughput, compute = ~8 * $1.50 * 24 * 30 = $8,640/month. Storage at $0.023/GB-month (S3 standard) for 720 GB = ~$16.56/month (obviously real-world S3 charges and egress add more). The main cost driver is compute and replication, not raw object storage, which is why tiering and lower replication can save big dollars.

Key takeaway: for telemetry with high cardinality, compute (CPU & memory) typically dominates cost due to query patterns — choose sharding and pre-aggregation to reduce compute need.

When not to use ClickHouse

ClickHouse is powerful but consider alternatives if:

You need complex transactional semantics or heavy OLTP-style updates.
You prefer fully serverless consumption-based pricing and want to avoid cluster management — though managed ClickHouse in 2026 reduces this barrier.
Your queries are dominated by full-scan machine-learning feature joins better handled by data warehouses like Snowflake or BigQuery with built-in ML pipelines.

Concrete migration playbook (select, prototype, scale)

Select the candidate cluster size based on sustained inserts and sample queries; choose managed vs self-hosted.
Prototype with a 2–4 node cluster and synthetic load that mirrors production cardinality. Implement Kafka ingestion and rollups.
Benchmark using your query suite and cardinality stress tests; iterate on ORDER BY keys, partition sizes, and aggregation strategies.
Harden with replication, backup, monitoring, and alerting; set TTLs and storage policies for cold data.
Scale by adding shards; rehashing may be required so plan blue/green migration paths for large clusters.

Telemetry cardinality: >100k unique keys/day? Favor hash sharding + ClickHouse.
Ingest rate: >100k events/sec? Use Kafka + ClickHouse, tune batch sizes.
Query latency: sub-second dashboards? Pre-aggregate key metrics.
Cost cap: prioritize tiered storage + 2x replication + aggressive rollups.
Ops skill: limited? Choose managed ClickHouse Cloud or Altinity.Cloud (2026 offerings improved post-2025 funding).

Final recommendations — what to try first

Run a 2-node managed ClickHouse cluster with Kafka ingestion and a rolling ingestion pipeline. Measure the baseline.
Implement AggregatingMergeTree rollups for the top 20 dashboard queries and compare cost/latency before and after.
Introduce TTL-based downsampling: keep raw for 14–30 days and rollup for 12+ months.
Benchmark distinct-count approximations (uniqCombined/HLL) vs exact uniqExact to save CPU and memory.

Closing — why this matters in 2026

Telemetry cardinality keeps growing (device fleets, microservices, feature flags). The 2025 investments in ClickHouse and managed offerings have made it easier to adopt state-of-the-art OLAP without reinventing operations. But success depends on architecture: ingest buffering, sharding on the right key, pre-aggregation, and strict retention policies. Use the decision framework above to avoid costly rework.

"Design for write distribution and query locality first. If you control ingest and rollups, you control cost and latency." — Practical rule for high-card telemetry

Actionable takeaways (do this this week)

Profile your telemetry for cardinality and bytes/event.
Deploy a Kafka + ClickHouse prototype and measure P95 ingest latency and P95 dashboard queries.
Implement a minute-level AggregatingMergeTree rollup for the most-used dashboard.
Set TTLs to move 30+ day partitions to S3 via storage policies.

Call to action

If you're evaluating OLAP backends for high-card telemetry, get our ClickHouse Migration Checklist and a pre-built benchmarking script tailored to telemetry (Kafka + ClickHouse). Visit appcreators.cloud/playbooks to download the checklist, or contact our platform team for a 1-hour free session to review your telemetry design and cost model.