Why ClickHouse for Logs?
When we built Purl, we evaluated every major database for log storage: Elasticsearch, PostgreSQL, TimescaleDB, and ClickHouse. ClickHouse won decisively for three reasons:
- 1Columnar storage: Log queries typically touch 2-3 columns (timestamp, level, message). ClickHouse only reads the columns you query, skipping the rest.
- 2Extreme compression: ClickHouse achieves 10-20x compression on log data using LZ4 and ZSTD codecs. 1TB of raw logs becomes 50-100GB on disk.
- 3Vectorized query execution: Queries process data in batches of 8192 rows using SIMD instructions, achieving throughput that's orders of magnitude faster than row-based databases.
Schema Design
Our log table uses the MergeTree engine family — specifically ReplacingMergeTree for deduplication:
CREATE TABLE logs (
timestamp DateTime64(3),
level LowCardinality(String),
message String,
source LowCardinality(String),
host LowCardinality(String),
metadata Map(String, String),
ingested_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (source, level, timestamp)
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity = 8192
Key design decisions:
- LowCardinality for level, source, host — these have few unique values, so dictionary encoding saves 90%+ space
- Partitioning by day — enables instant partition drops for retention cleanup instead of expensive DELETE queries
- ORDER BY (source, level, timestamp) — optimizes the most common query patterns
- TTL — automatic data expiration, no cron jobs needed
Query Optimization
Materialized Views for Patterns
We use materialized views to pre-aggregate log patterns in real-time:
CREATE MATERIALIZED VIEW log_patterns_mv
ENGINE = SummingMergeTree()
ORDER BY (source, pattern_hash, hour)
AS SELECT
source,
cityHash64(replaceRegexpAll(message, '[0-9]+', 'N')) AS pattern_hash,
toStartOfHour(timestamp) AS hour,
count() AS count,
any(message) AS sample
FROM logs
GROUP BY source, pattern_hash, hour
This gives us instant pattern detection — the "Patterns" tab in Purl's dashboard queries this materialized view, not the raw log table.
Projection for Time-Series Queries
For the dashboard's time-series charts, we use projections:
ALTER TABLE logs ADD PROJECTION logs_by_time (
SELECT
toStartOfMinute(timestamp) AS minute,
level,
count() AS count
GROUP BY minute, level
)
This allows the dashboard to render log volume charts without scanning the full table.
Benchmarks
On a single node (4 vCPU, 8GB RAM, NVMe SSD):
- Ingest: 150,000 logs/second sustained
- Query (last 1 hour): 23ms average
- Query (last 24 hours, 50M rows): 180ms average
- Full-text search: 450ms for regex across 100M logs
- Storage: 1 billion logs in ~45GB (compressed)
For comparison, Elasticsearch on the same hardware handles ~20,000 logs/second with 3-5 second query times at the same scale.
Operational Simplicity
The biggest win isn't performance — it's operations. ClickHouse requires almost zero maintenance:
- No shard management
- No rebalancing
- No JVM tuning
- Automatic TTL-based cleanup
- Single binary deployment
This is why Purl can offer production-grade log management starting at just docker compose up.