Benchmarking Data Report
Scale breakdown: CPU vs GPU across data sizes
Theseus Benchmarks are part of an ongoing effort and regularly updated. This latest update (March 2025) includes refreshed results for Voltron Data's Theseus engine at 10TB, 30TB, and 100TB TPC-H scale factors, as well as new benchmarks for NVIDIA's Spark-RAPIDS engine at the same scale factors. The SPACE chart now reflects these additions. All other content within the Benchmarking Report remains unchanged, primarily covering TPC-H benchmark comparisons between Theseus and independently validated Spark EMR results.
We first introduced this SPACE (scale performance and cost efficiency) chart on our methodology page.
But tl;dr: the x-axis is the cluster cost per hour, so from left to right, we move from cheaper to more expensive. Total runtime is on the y-axis, so from top to bottom, we move from slower to faster. Bottom line: left and bottom means your dollar goes further – those are faster AND cheaper.
Let's break this one down, because it packs a lot of information in a single plot. You will notice that we have four curves plotted here: one for every system and data size combination.
100TB: Your dollar always goes further
There are two points we are excited to call out in this plot: Theseus at 8 nodes handling 100TB of data, and Theseus at 10 nodes handling it even better.
First of all, the obvious: Theseus at 10 nodes processing 100TB is under both the Spark curves. While it is close to the 10TB line, that only means that for a similar cost and runtime of processing 10TB with Spark, you can process 100TB with Theseus. Furthermore, it is always cheaper than processing 30 TB of data with Spark. Your dollar is just worth more with Theseus.
This is what an accelerator-native system can do — completely change the data scale so 100TB workloads look more like 10TB, leveraging GPUs to their fullest with surrounding infrastructure that makes them shine.
Per query breakdown
To learn more about the performance of Theseus at 100TB, explore these plots to see how different queries perform differently and how performance is affected by workload.
10TB: GPUs can do more with less
To put these costs in context, in 2024, a Big Mac in the United States costs $5.69. A half gallon of organic milk costs $4.88. A 10TB query run on Theseus costs $2.19 (on 2 nodes), the cheapest run on Spark costs $77.99 (on 20 nodes).
- Cost wall:
With this plot, it is clear Theseus always costs less than Spark. It is also easy to see the increasingly diminishing returns with Spark: from 10 to 100 to 200 nodes the smaller each percentage returns. Increasing spend and acquiring more nodes, yields less and less performance improvements.
- •Spark vs Spark:
At the high end, getting speed is quickly not worth the cost. For Spark 10TB, the change in execution time from 100 nodes to 200 nodes is 32 minutes to 27 minutes, while the cost leaps from $128.90 to $213.70. Ultimately, this results in paying 66% more for 17% more speed.
- •Theseus vs Theseus:
Scaling quickly reaches a limit, with barely noticeable cost increases for additional nodes: $2.19 to run all of the 10TB queries derived from the TPC-H on 2 nodes, $4.63 on 10 nodes. In this case, 2 nodes is adequate for the problem size.
- Technical wall:
Even with infinite money, scale cannot solve this problem with Spark. CPU performance is capped. No amount of money will jump over this wall.
- •Theseus vs Spark:
For this test, running Theseus at all is better than Spark at all points. In terms of cost: 2 Theseus nodes at 10TB complete the test at $2.19, as opposed to 200 Spark nodes at $213.70. Furthermore, there is no time tradeoff for reduced cost: 2 Theseus nodes finish 10TB of TPC-H in 5 minutes, as opposed to 200 nodes of Spark managing it in 27 minutes. This is a cost savings of 99%, while still going 80% faster.
30TB: It can pay off to scale up
When do you buy more hardware to accommodate larger data sizes? At 10TB, 2 Theseus nodes might be enough. At 30TB, matters become more interesting. Scale is needed for either system to quickly process this data.
- Minimum node counts: For example, Theseus at 2 nodes takes 23.5 minutes to complete this workload. Meanwhile, Spark at 10 nodes yields 12 hour execution time.
- Maximum node counts: If we push the systems to their maximum, we are in a more reasonable time-scale: 5 minutes for 10 nodes of Theseus, and 58 minutes for 200 nodes of Spark. The decision becomes even clearer when you compare costs: the Theseus run costs $10.10, while the Spark run costs $458.49.
Node configuration | 30TB | Runtime | Total cost |
---|---|---|---|
Minimum node counts | Theseus @ 2 nodes | 23.5 minutes | $9.42 |
Spark @ 10 nodes | 12 hours | $318.65 | |
Maximum node counts | Theseus @ 10 nodes | 5 minutes | $10.10 |
Spark @ 200 nodes | 58 minutes | $458.49 |
Query breakdown: If you are doing large joins, GPUs can help
One of the most common questions we hear is "When do GPUs make a difference for data processing?" The answer is basically across every query we tested, but the differences vary based on query.
In the heatmap, you can focus on different types of queries derived from the TPC-H based on their primary operations. There are plenty of ways to categorize types of queries — we are sharing one way that we find useful when comparing CPU and GPU systems.
- Heavy Joins: Tests the engine's ability to pull and analyze data from multiple tables to create a final dataset, which strains parallel processing and the ability to hold and move large volumes of data for a given system.
- Heavy Sorts: Like joins, sorts perform large data movement and distributed analysis, with the main difference being that algorithms perform more computation over a single table, as opposed to data compilation over multiple tables.
- Heavy Aggregations: Whereas sorts and joins emphasize data movement, stressing network and I/O systems, aggregation is intensely compute heavy, and relies almost exclusively on a system's ability to process large volumes of data to yield singular outputs.
Table: Query descriptions
The following table summarizes the 22 queries derived from the TPC-H. We include our query categories to help you interpret the heatmap. Note that where a join is followed by a number in parentheses, that number is how many joins are in the query.
Query Number | Short Title | Query Description | Key Operations | Category |
---|---|---|---|---|
1 | Pricing Summary Report Query | Reports the amount of business that was billed, shipped, and returned. | SUM, COUNT, AVG, WHERE, GROUP BY, ORDER BY | Heavy Aggregation |
2 | Minimum Cost Supplier Query | Finds the cheapest suppliers of a part in a specific region, considering discounts. | Subquery, INNER JOIN (7),WHERE, MIN, ORDER BY | Heavy Sort |
Bottom line: Cost savings
If you want to just skip straight to the bottom line, in this plot we highlight the cost savings between Spark and Theseus. We invite you to explore the differences between each system across queries at different node counts.
Conclusion
Performance is of course important, but it is not enough. This is one of the reasons we advocate for and build open standards-based composable systems that give you performance, plus a system that is: modular, interoperable, customizable, and extensible.
- No coupled interfaces: Projects like Ibis, Substrait, and SQLGlot provide interface abstraction so that when you write Theseus code you are writing code that supports, and seamlessly ports to, many different query engines, databases, and data processing frameworks.
- No proprietary storage formats: Being built on top of Arrow means that Theseus reads data where it lies, whether that is in file formats, databases, data warehouses, or anything that supports Arrow and ADBC.
Leaders in the data analytics industry (Andy Pavlo, Bucky Moore, Jordan Tigani) are all saying the same thing: due to open standards, OLAP engines are being commoditized, and barring architectural differences, performance across engines will converge over time.
This is why we propose a fundamentally novel distribution strategy built for an accelerator-native future. For the biggest data challenges, building a composable data system means you can choose the right tool for the job.
When it comes to scale:
- For less than 2TBs: We believe DuckDB and Arrow backed projects, DataFusion, and Polars make a lot of sense. This is probably the majority of datasets in the world and can be run most efficiently leveraging these state-of-the-art query systems.
- For up to 30TBs: Well-known data warehouses like Snowflake, Google BigQuery, Databricks, and distributed processing frameworks like Spark and Trino work wonders at this scale.
- For anything over 30TBs: This is where Theseus makes sense. Our minimum threshold to move forward requires 10TB queries (not datasets), but we prefer to operate when queries exceed 100TBs. This is an incredibly rare class of problem, but if you are feeling it, you know how quickly costs balloon, SLAs are missed, and tenuously the data pipeline is held together.
For a deeper dive on how we collected the data presented in this report, check our About page to learn more orread about our methodology.
