Sounds interesting, just some questions:
- tables are partitioned? By year/month?
- how do you handle too many small parquet files?
- are updated/deleted allowed/planned?
Great questions, thanks!
Partitioning: yes, Arc partitions by measurement > year > month > day > hour. This structure makes time-range queries very fast and simplifies retention policies (you can drop by hour/day instead of re-clustering).
Small Parquet files: we batch writes by measurement before flushing, typically every 10 K records or 60 seconds. That keeps file counts manageable while maintaining near-real-time visibility. Compaction jobs (optional) can later merge smaller Parquet files for long-term optimization.
Updates/deletes: today Arc is append-only (like most time-series systems). Updates/deletes are planned via “rewrite on retention”, meaning you’ll be able to apply corrections or retention windows by rewriting affected partitions.
The current focus is on predictable write throughput and analytical query performance, but schema evolution and partial rewrites are definitely on the roadmap.
I didn’t really worry about confusion since this isn’t a browser, it’s a completely different animal.
The name actually came from “Ark”, as in something that stores and carries, but I decided to go with Arc to avoid sounding too biblical.
The deeper reason is that Arc isn’t just about ingestion; it’s designed to store data long-term for other databases like InfluxDB, Timescale, or Kafka using Parquet and S3-style backends that scale economically while still letting you query everything with SQL.
> Arc Core is designed with MinIO as the primary storage backend
Noticing that all the benchmarking is being done with MinIO which I presume is also running alongside/locally so there is no latency and it will be roughly as fast as whatever underlying disk its operating from.
Are there any benchmarks for using actual S3 as the storage layer?
How does Arc decide what to keep hot and local? TTL based? Frequency of access based?
The benchmarks weren’t run on the same machine as MinIO, but on the same network, connected over a 1 Gbps switch, so there’s a bit of real network latency, though still close to local-disk performance.
We’ve also tried a true remote setup before (compute around ~160 ms away from AWS S3). I plan to rerun that scenario soon and publish the updated results for transparency.
Regarding “hot vs. cold” data, Arc doesn’t maintain separate tiers in the traditional sense. All data lives in the S3-compatible storage (MinIO or AWS S3), and we rely on caching for repeated query patterns instead of a separate local tier.
In practice, Arc performs better than ClickHouse when using S3 as the primary storage layer. ClickHouse can scan faster in pure analytical workloads, but Arc tends to outperform it on time-range–based queries (typical in observability and IoT).
I’ll post the new benchmark numbers in the next few days, they should give a clearer picture of the trade-offs.
Exciting project and definitely something I'd like to explore using. I particularly like the look of the API ergonomics. A few questions:
- is the schema inferred from the data? - can/does the schema evolve? - are custom partitions supported? - is there a roadmap for future features?
Sounds interesting, just some questions: - tables are partitioned? By year/month? - how do you handle too many small parquet files? - are updated/deleted allowed/planned?
Great questions, thanks! Partitioning: yes, Arc partitions by measurement > year > month > day > hour. This structure makes time-range queries very fast and simplifies retention policies (you can drop by hour/day instead of re-clustering).
Small Parquet files: we batch writes by measurement before flushing, typically every 10 K records or 60 seconds. That keeps file counts manageable while maintaining near-real-time visibility. Compaction jobs (optional) can later merge smaller Parquet files for long-term optimization.
Updates/deletes: today Arc is append-only (like most time-series systems). Updates/deletes are planned via “rewrite on retention”, meaning you’ll be able to apply corrections or retention windows by rewriting affected partitions.
The current focus is on predictable write throughput and analytical query performance, but schema evolution and partial rewrites are definitely on the roadmap.
Arc Browser, Arc Prize, Arc Institute and now the Arc Warehouse
I am afraid “Arc” became too fashionable this decade and using it might decrease brand visibility
Did you consider confusion with the Arc browser and still go with the name, or were you calling this Arc first and decided to just stick with it?
Hey, good question!
I didn’t really worry about confusion since this isn’t a browser, it’s a completely different animal.
The name actually came from “Ark”, as in something that stores and carries, but I decided to go with Arc to avoid sounding too biblical.
The deeper reason is that Arc isn’t just about ingestion; it’s designed to store data long-term for other databases like InfluxDB, Timescale, or Kafka using Parquet and S3-style backends that scale economically while still letting you query everything with SQL.
The browser is dead anyway
Didn't that browser get mothballed by its devs?
I'll try this right now. I'm looking to self-host duckdb because MotherDuck is way too expensive.
Awesome, would love to hear what you think once you try it out!
If it’s not too much trouble, feel free to share feedback at ignacio [at] basekick [dot] net.
> Arc Core is designed with MinIO as the primary storage backend
Noticing that all the benchmarking is being done with MinIO which I presume is also running alongside/locally so there is no latency and it will be roughly as fast as whatever underlying disk its operating from.
Are there any benchmarks for using actual S3 as the storage layer?
How does Arc decide what to keep hot and local? TTL based? Frequency of access based?
We're going to be evaluating Clickhouse with this sort of hot (local), cold (S3) configuration soon (https://clickhouse.com/docs/guides/separation-storage-comput...) but would like to evaluate other platforms if they are relevant.
Hey there, great questions.
The benchmarks weren’t run on the same machine as MinIO, but on the same network, connected over a 1 Gbps switch, so there’s a bit of real network latency, though still close to local-disk performance.
We’ve also tried a true remote setup before (compute around ~160 ms away from AWS S3). I plan to rerun that scenario soon and publish the updated results for transparency.
Regarding “hot vs. cold” data, Arc doesn’t maintain separate tiers in the traditional sense. All data lives in the S3-compatible storage (MinIO or AWS S3), and we rely on caching for repeated query patterns instead of a separate local tier.
In practice, Arc performs better than ClickHouse when using S3 as the primary storage layer. ClickHouse can scan faster in pure analytical workloads, but Arc tends to outperform it on time-range–based queries (typical in observability and IoT).
I’ll post the new benchmark numbers in the next few days, they should give a clearer picture of the trade-offs.