Trino Blog

Introducing the NUMBER data type

2026-03-25T00:00:00+00:00

One of Trino’s core strengths is breaking down data silos—enabling data engineers to query diverse data sources through a single SQL interface. However, when those sources use high-precision numeric types beyond Trino’s 38-digit DECIMAL limit, that promise breaks down. Users faced an impossible choice: skip the columns entirely and lose access to critical data, or accept lossy rounding that compromises data integrity.

This challenge required a new approach: a dedicated data type for high-precision, variable-scale decimals.

Adding a new built-in data type to Trino is exceptionally rare. The last time we introduced a new type was the UUID type in May 2019—nearly seven years ago. Types are fundamental building blocks that touch many parts of the system, from the type registry, through coercion rules to connectors, functions, and the protocol. They require careful design and long-term commitment.

With Trino 480, we’re excited to introduce the NUMBER type—a high-precision decimal type that breaks down these data silos and enables seamless access to numeric data across diverse database systems. This addition is particularly powerful for data engineers working with Oracle, PostgreSQL, MySQL, MariaDB, and SingleStore, which support numeric precision beyond the traditional 38-digit DECIMAL limit.

Let’s explore why NUMBER matters, how it works, and how it will simplify your data integration workflows.

The challenge: precision beyond 38 digits

Trino’s DECIMAL type has long supported exact numeric values with precision up to 38 decimal digits, which covers the vast majority of use cases. However, many database systems support higher precision:

Oracle NUMBER: when declared as NUMBER(p, s), precision must be in [1, 38] and scale in [-84, 127]. When declared as NUMBER without precision/scale, each value can have different scale, and actual precision can reach 40 decimal digits. Oracle can store values from 10^-130 to (but not including) 10^126.
PostgreSQL NUMERIC: supports precision and scale in range from -1000 to 1000; supports very high precision numbers with up to 131,072 digits before the decimal point. When declared without precision/scale constraints, each value can have different scale.
MySQL, MariaDB, SingleStore DECIMAL: up to 65 digits of precision (scale 0-30)

Before Trino 480, accessing these high-precision numeric columns required choosing between two unsatisfying options:

Skip the columns entirely and lose access to potentially critical data. This was the default behavior.
Accept lossy conversions - Use decimal-mapping=ALLOW_OVERFLOW with decimal-default-scale=S to force values into DECIMAL(38, S), losing precision through rounding and failing for numbers greater than or equal to 10^(38-S). For example, with scale 10, values ≥ 10^28 would fail.

Neither option is ideal for data federation and warehousing scenarios where preserving data fidelity is essential.

Enter NUMBER: arbitrary-precision decimals in Trino

The NUMBER type solves this problem by supporting floating-point decimal numbers of high precision and flexible scale. In practice, NUMBER supports values with up to 200 digits of precision – far exceeding what most database workloads require. Each value can have a different scale, allowing for values as small as 10^-16000 (or even smaller) and as large as 10^16000 (or even larger) within the same column.

Here’s what NUMBER looks like in action:

-- High-precision literal (50+ digits)
SELECT NUMBER '3.1415926535897932384626433832795028841971693993751';

 3.1415926535897932384626433832795028841971693993751

-- Scientific notation with extreme precision
SELECT NUMBER '12345678901234567890123456789012345678901234567890e30';

 1.234567890123456789012345678901234567890123456789E+79

-- Verify the type
SELECT typeof(NUMBER '123.456');

 number

Special values

NUMBER also supports special values similar to IEEE 754 floating-point types:

SELECT
  NUMBER 'Infinity' as positive_infinity,
  NUMBER '-Infinity' as negative_infinity,
  NUMBER 'NaN' as not_a_number;

 positive_infinity | negative_infinity | not_a_number
-------------------+-------------------+--------------
 +Infinity         | -Infinity         | NaN

These special values follow intuitive comparison and ordering semantics that follow DOUBLE behavior. NaN compares as inequal to all values, including itself. Any comparison with NaN returns false. When sorting, values are ordered as follows: -Infinity, all finite values, +Infinity followed by NaN.

The special values are particularly useful for handling edge cases in source data. In particular, PostgreSQL’s NUMERIC type can represent NaN and Infinity, and these values are now seamlessly mapped to NUMBER when queried through the PostgreSQL connector.

Seamless connector integration

The real power of NUMBER becomes apparent when querying external databases. Five connectors now automatically map high-precision numeric types to NUMBER, requiring no configuration changes:

Oracle connector

Oracle’s NUMBER type supports variable precision and scale. The Oracle connector now maps:

NUMBER(p, s) where p > 38 → Trino NUMBER
NUMBER without precision/scale → Trino NUMBER
NUMBER with extreme scale values → Trino NUMBER

-- Query an Oracle table with high-precision columns
SELECT order_id, unit_price, extended_price
FROM oracle.sales.orders
WHERE extended_price > NUMBER '1000000000000000000000000';

PostgreSQL connector

PostgreSQL’s NUMERIC type supports very high precision and even “unconstrained” precision. The connector automatically handles:

NUMERIC(p, s) where p > 38 → Trino NUMBER
NUMERIC without precision/scale → Trino NUMBER

-- Access PostgreSQL scientific data without precision loss
SELECT measurement_id, precise_value -- a NUMERIC column
FROM postgresql.lab.measurements

MySQL, MariaDB, and SingleStore connectors

These MySQL-compatible databases support DECIMAL precision up to 65 digits. The connectors now map:

DECIMAL(p, s) where p > 38 → Trino NUMBER

-- Join across different databases with high precision
SELECT
  m.account_id,
  m.balance as mysql_balance,
  o.balance as oracle_balance
FROM mysql.banking.accounts m
JOIN oracle.banking.accounts o ON m.account_id = o.account_id
WHERE abs(m.balance - o.balance) > NUMBER '0.01';

Backwards compatibility and migration

The NUMBER type integration is designed to be seamless and backward compatible:

Automatic mapping

If you previously relied on the default behavior (no decimal-mapping configuration), your queries now automatically use NUMBER for high-precision columns. No configuration changes needed.

Legacy configurations still work

If you explicitly configured decimal-mapping=ALLOW_OVERFLOW or decimal-mapping=STRICT, your existing configuration continues to work. The NUMBER mapping is disabled when these options are set, ensuring no surprises.

However, the decimal-mapping configuration and related session properties (decimal_mapping, decimal_default_scale, decimal_rounding_mode) are now deprecated and will be removed in a future Trino release. We recommend migrating to NUMBER-based workflows:

Before (with lossy conversion):

# catalog/postgresql.properties
connection-url=jdbc:postgresql://host:5432/database
connection-user=user
connection-password=password
decimal-mapping=ALLOW_OVERFLOW
decimal-default-scale=10
decimal-rounding-mode=HALF_UP

After (lossless with NUMBER):

# catalog/postgresql.properties
connection-url=jdbc:postgresql://host:5432/database
connection-user=user
connection-password=password
# No decimal-mapping needed - NUMBER is used automatically!

For Oracle, if you previously used oracle.number.rounding-mode to handle high-precision NUMBER columns, you can now remove this configuration to enable native NUMBER mapping.

Working with NUMBER

Type conversions

NUMBER integrates naturally with Trino’s type system:

-- Convert from other numeric types
SELECT
  CAST(DECIMAL '123.45' AS NUMBER) as from_decimal,
  CAST(12345 AS NUMBER) as from_integer,
  CAST(123.45e0 AS NUMBER) as from_double;

 from_decimal | from_integer | from_double
--------------+--------------+-------------
 123.45       | 12345        | 123.45

-- Convert NUMBER to other types
SELECT
  CAST(NUMBER '123.456' AS BIGINT) as to_bigint,
  CAST(NUMBER '123.456' AS DOUBLE) as to_double,
  CAST(NUMBER '123.456' AS DECIMAL(10,2)) as to_decimal;

 to_bigint | to_double | to_decimal
-----------+-----------+------------
 123       | 123.456   | 123.46

Aggregate functions

Common aggregate functions work naturally with NUMBER:

-- Aggregate high-precision values
SELECT
  department,
  sum(revenue) as total_revenue,
  avg(revenue) as average_revenue,
  min(revenue) as min_revenue,
  max(revenue) as max_revenue
FROM oracle.sales.transactions
GROUP BY department;

Creating tables with NUMBER columns

The Oracle and PostgreSQL connectors support creating tables with NUMBER columns:

-- Create a PostgreSQL table with NUMBER column
CREATE TABLE postgresql.schema.measurements (
  id BIGINT,
  precise_value NUMBER
);

-- Create an Oracle table with NUMBER column
CREATE TABLE oracle.schema.scientific_data (
  experiment_id VARCHAR(50),
  measurement NUMBER
);

Technical characteristics and limitations

While NUMBER provides high precision, it’s important to understand its characteristics:

Precision and scale

Trino’s NUMBER type characteristics:

Supported precision: currently 200 decimal digits. While we consider this an implementation detail that may change in future releases, it is unlikely that maximum precision will be decreased.
Scale range: -16,384 to 16,383
Variable scale: each value can have a different scale, similar to PostgreSQL NUMERIC and Oracle NUMBER
Special values: supports NaN, Infinity, and -Infinity

Comparison of decimal numeric types across database systems:

Database	Max Precision	Scale Range	Variable Scale
Oracle NUMBER(p, s)	38	-84 to 127	No
Oracle NUMBER	40	Approximately -130 to 126	Yes
PostgreSQL NUMERIC(p, s)	38	-1000 to 1000	No
PostgreSQL NUMERIC	131,072	-1000 to 1000	Yes
MySQL/MariaDB/SingleStore DECIMAL	65	0 to 30	No
Trino DECIMAL	38	0 to 38	No
Trino NUMBER	200	-16,384 to 16,383	Yes

Storage and representation

NUMBER uses a variable-width binary format optimized for flexibility:

2-byte header encoding sign and scale
Variable-length magnitude in big-endian format
The binary format is considered unstable and may evolve in future releases to enable optimizations and performance improvements

This flexibility allows Trino to improve NUMBER’s internal representation over time without breaking connector compatibility. Trino SPI provides a stable API for connectors to read and write NUMBER values, abstracting away the internal format.

Performance considerations

NUMBER uses Java’s BigDecimal for arithmetic operations, which provides exact precision at the cost of being slower than fixed-precision types like BIGINT, DOUBLE or DECIMAL. For this reason, NUMBER is designed for scenarios where precision is more important than computational speed:

Best for: reading and storing high-precision data from source systems, data federation, reporting, data warehousing
Not optimal for: computational heavy-lifting, complex mathematical operations, high-performance analytics on numeric columns

If your workload involves extensive numeric computation, consider whether DECIMAL (for up to 38 digits), DOUBLE (for approximate arithmetic), or BIGINT (for integer arithmetic) might be more appropriate.

Function support

NUMBER supports essential operations:

Arithmetic: +, -, *, /
Aggregations: sum(), avg(), min(), max()
Rounding functions: abs(), sign(), ceiling(), floor(), truncate(), round()
Special value checks: is_nan(), is_finite(), is_infinite()

Many advanced mathematical functions (trigonometric, logarithmic, etc.) do not work with NUMBER directly and require explicit type conversions to DOUBLE or DECIMAL.

What’s next

The NUMBER type support will continue to evolve. Additional connectors are planned for future releases:

ClickHouse: for Decimal256 type mapping
Apache Ignite: for high-precision numeric support

We’re also exploring performance optimizations and expanding function support based on community feedback.

Getting started

NUMBER support is available now in Trino 480. To start using it:

Upgrade to Trino 480 - NUMBER is available out of the box
Remove deprecated configs - If you used decimal-mapping configurations, consider removing them to enable automatic NUMBER mapping
Query your data - High-precision columns are now accessible without configuration

For detailed documentation, refer to:

Have questions or feedback? Join the discussion on the Trino community Slack in the #dev channel, or open an issue on GitHub.

The NUMBER type represents a significant milestone in Trino’s evolution, eliminating precision loss barriers and making high-precision numeric data from diverse sources readily accessible for analytics and reporting. We’re excited to see how the community uses this powerful new capability!

□

Core Principles and Design Practices of OLAP Engines

2025-03-27T00:00:00+00:00

Yiteng Xu and Yingju Gao are proudly announcing the new book “Core Principle and Design Practices of OLAP Engines” from China Machine Press. This is great news for the Trino community, since the book is based on the open source project Trino, specifically Trino 350. It took more than four years for the two authors to finish writing. All concepts and details are explained with Trino falvor and generalized to all OLAP engines. Let us walk throught the chapters and you will find out the two author dive deep into the source code layer and bring you so many treasures.

Author introduction

Yiteng (Ivan) Xu: is a data security engineer and is currently utilizing Trino, Spark, and Calcite for SQL analysis. His work encompasses various scenarios, including data warehouse metrics, SQL auto-rewriting, SQL purpose detection, and the development of SQL-based Purpose-Aware Access Control System.

Yingju (Gary) Gao is an Apache Seatunnel PMC member and the lead of the time series database team. He currently serves as the technical lead for the observability-engine team, and is responsible for building the ecosystem for observability data, including metrics, trace, log, and event data, providing a high-performance, high-throughput data pipeline from ingestion to consumption, storage, querying, and data warehousing. Additionally, he oversees metrics stability, multi-tenant access, and user requirement integration.

Both authors are passionate about sharing their technical knowledge. They have delved deep into source code and excel in technical writing, breaking down complex underlying principles into a linear and comprehensible format for readers. They firmly believe that sharing is a virtue and are committed to continuing their technical contributions.

So now it is time to get the book, or read on for a walk through of the content:

Get the book from dangdang.com Get the book from jd.com

Walk through

Let’s have a look at the different chapters in a high-level walk through.

Part 1: Background knowledge

Chapter 1: Introduce the concept of OLAP (Online Analytical Processing), provide comparsion among different engines like Trino, Impala, Doris and others.

Chapter 2: Provides a comprehensive introduction to the Trino engine, covering its principles, architecture, enterprise use cases, compilation, and execution. It also compares Trino with the Presto project and introduces the SQL statements that are referenced throughout the book.

Part 2: Core principles

Chapter 3: Offers an overview of the distributed SQL query process, serving as a high-level introduction to the subsequent chapters.

Chapter 4: Begins with the generation of query execution plans, including the transformation of SQL into abstract syntax trees, semantic analysis, and the creation of initial logical plans. It then delves into the theoretical knowledge of optimizers and the overall framework of the Trino optimizer.

Part 3: Classic SQL

Chapter 5: Explains the generation and optimization of execution plans for SQL statements involving only TableScan, Filter, and Project operations, along with their scheduling and execution processes.

Chapter 6: Focuses on SQL statements with Limit and Sort operations, detailing the generation and optimization of execution plans, as well as their scheduling and execution.

Chapter 7: Introduces the basic principles of aggregate queries. It then covers the generation and optimization of execution plans for grouped and non-grouped aggregate SQL statements, along with their scheduling and execution processes.

Chapter 8: Discusses SQL statements with count distinct and multiple aggregate operations, explaining the generation and optimization of execution plans, as well as their scheduling and execution. This includes the Scatter-Gather model and MarkDistinct optimization. Finally, a complex SQL statement is used to tie together the concepts from Chapters 5 to 8.

Part 4: Data exchange mechanism

Chapter 9: Introduces the overall concept of data exchange mechanisms and how data exchange is incorporated during the query optimization phase via the AddExchanges optimizer, along with the design principles for scheduling and execution.

Chapter 10: Explains how tasks establish connections during the query scheduling phase and the mechanisms for upstream and downstream data flow during execution. It also covers the principles of intra-task data exchange, RPC interaction mechanisms, and analyzes backpressure, Limit semantics, and out-of-order request handling.

Part 5: Plugin mechanisms and connectors

Chapter 11: Begins with an introduction to Trino’s plugin system and SPI mechanism, including plugin loading and JVM’s class loading principles. It then dissects connectors, covering metadata modules, read modules, pushdown optimization, and providing in-depth insights into connector design.

Chapter 12: Uses the example-http connector to help readers understand connector design and implements a simple data source using Python’s Flask framework.

Part 6: Function principles and development

Chapter 13: Provides an overview of Trino’s function system, including function types, lifecycle, and several function development methods. It delves into the data structures and annotations related to functions and explains the function registration and parsing process during semantic analysis.

Chapter 14: Focuses on how to write a udf in practice. It covers annotation-based development methods for scalar functions, as well as low-level development methods using codeGen or methodHandle APIs. For aggregate functions, it introduces annotation-based development methods and low-level methods where developers handle serialization and state on their own.

Why Trino?

In 2020, one of the authors, Yiteng Xu, encountered a scenario at work where data needed to be read from two Hive instances, each modified by different internal teams. The company’s infrastructure team attempted a simple solution by registering virtual tables and using MapReduce for federated queries. However, this approach proved inadequate for the agile analysis needs of data analysts, with complex queries taking nearly 12 hours to complete. One mistake per SQL meant an entire day was wasted.

Later, another team researched and adopted Presto (before Trino became independent). By adapting the Hive engine at the connector level, they enabled federated queries across the two Hive instances without data migration or extensive code changes. Users only needed to be aware of a catalog prefix, making the process incredibly convenient. The author later had the opportunity to participate in the project and developed a strong interest in its source code. The elegance of the open-source project, its plugin design, and the inner workings of connectors and Airlift framework sparked a deep curiosity, leading the author on a journey of source code exploration. As the PrestoSQL project was more active and receptive to developer feedback, the author chose to continue following the Trino project when it emerged in late 2020.

Get your copy

Now it is time for you to get your copy of Core Principles and Design Practices of OLAP Engines:

Get the book from dangdang.com Get the book from jd.com

Twenty four

2025-03-03T00:00:00+00:00

Six month ago we adopted Java 23 as requirement, following our standard procedure to upgrade with each Java version as soon as it becomes available. This allows us to take advantage of all the great improvement each release brings. The upgrade to 23 was pretty easy since the changes from 22 to 23 were not that big. The story turns out to be a bit different now with our upgrade to Java 24.

Java 24 features

We have been planning and working towards the upgrade consistently since the 23 bump in September. Java 24 is set to be released in March 2025 and the list of changes is quite significant:

JEP 450 Compact Object Headers (Experimental)
JEP 472 Prepare to Restrict the Use of JNI
JEP 475 Late Barrier Expansion for G1
JEP 478 Key Derivation Function API (Preview)
JEP 483 Ahead-of-Time Class Loading & Linking
JEP 484 Class-File API
JEP 485 Stream Gatherers
JEP 486 Permanently Disable the Security Manager
JEP 487 Scoped Values (Fourth Preview)
JEP 488 Primitive Types in Patterns, instanceof, and switch (Second Preview)
JEP 489 Vector API (Ninth Incubator)
JEP 490 ZGC: Remove the Non-Generational Mode
JEP 491 Synchronize Virtual Threads without Pinning
JEP 492 Flexible Constructor Bodies (Third Preview)
JEP 494 Module Import Declarations (Second Preview)
JEP 495 Simple Source Files and Instance Main Methods (Fourth Preview)
JEP 496 Quantum-Resistant Module-Lattice-Based Key Encapsulation Mechanism
JEP 497 Quantum-Resistant Module-Lattice-Based Digital Signature Algorithm
JEP 498 Warn upon Use of Memory-Access Methods in sun.misc.Unsafe
JEP 499 Structured Concurrency (Fourth Preview)

The list of new features is also quite large. You can find more details in the release notes and each individual JEP.

Trino perspective

From a Trino perspective we want to specifically take advantage of performance improvements to MemorySegment (mismatch, copy, fill), “JEP 491 Synchronize Virtual Threads without Pinning” and “JEP 475 Late Barrier Expansion for G1”. On the other hand JEP 486 Permanently Disable the Security Manager turned out to be the most impactful.

Since Trino and its connectors have a large footprint of dependencies there was a high chance that some projects as not keeping up with the security manager removal, although it was first deprecated with Java 17 in 2021.

At this stage the Kafka, Kudu, and Phoenix connectors are affected. The Kafka project is planning to make a new compatible release available in time and we will adopt that version.

The Kudu and Phoenix connectors however will be removed, since it is not possible to use them with Java 24 as requirement. Both connectors are not heavily used in our community as we learned from our communication with numerous users, integrators, and the results from our user survey. We are tracking progress for each removal in the issues #24419 Phoenix connector and #24417 Kudu connector. If either of these communities ends up supporting Java 24, or a newer version as required by Trino, in the future, we can potentially add the connectors back in if community members contribute updated versions.

Release plans

In terms of shipping the changes we follow our established pattern:

Clean up codebase and get it ready, specifically this include the removal of the Kudu and Phoenix connectors.
Cut a release that is completely ready to be used with Java 24, but does not yet make it a hard requirement
Allow for community testing and feedback using Java 24.
Introduce Java 24 as hard requirement in another release.
Adopt Java 24 features and bring the benefits to our users with following releases.

As you see, there is a bunch of work waiting, we we better back to it. As usual, if you have questions or comments, chime in on the relevant issue or chat with us on Trino Slack in the core-dev channel.

Out with the old file system

2025-02-10T00:00:00+00:00

What a long journey it has been! From the start Trino supported querying Hive data and used libraries from the Hive and Hadoop ecosystem. With the release of Trino 470 we mark another milestone to more features and better performance for data lake and lakehouse querying with Trino. We deprecated the legacy file system support, and will permanently remove them in an upcoming release.

Background

Trino always had a focus on performance and security. As a result we implemented custom readers for file formats like Apache ORC and Apache Parquet many years ago. We also have improved libraries for compression and decompression of files from object storage and and implemented our own support for other table formats with the Apache Iceberg, Delta Lake and Apache Hudi connectors.

For the underlying object storage solutions and file systems, we originally extended the libraries around the Hive system and added implementations for Amazon S3, Azure Storage, Google Cloud Storage and others. Over time the mismatch of the HDFS libraries and the cloud-centric usage with modern file systems became more and more of a maintenance headache. It also represented an unnecessary complexity overhead, resulted in performance problems, and forced us to carry the Hadoop dependencies with all their baggage of old Java code and security issues.

In the end David Philips, as our file system lead, decided in 2022 that it was time to write our own file system support as needed for Trino. By summer of 2023 and with Trino 419 a first support for S3 became available for the Iceberg and Delta Lake connectors. Over a year later in September 2024 and with Trino 458, we declared the old file system support on top of the Hadoop libraries legacy and advised users to migrate.

Since then you are required to declare what file system you want to enable in each catalog with fs.native-azure.enabled=true,fs.native-gcs.enabled=true or fs.native-s3.enabled=true. If you are truly using HDFS, or if you insist on using the old legacy support you can also use fs.hadoop.enabled=true.

Trino 470

With the recent Trino 470 release from February 2025, we took the next step. All catalog configuration properties for using the old, legacy support for accessing Azure Storage, Google Cloud Storage, S3, and S3-compatible file systems are now deprecated.

These properties include all names starting with hive.azure, hive.cos, hive.gcs, and hive.s3. The result of this deprecation is that Trino emits warnings during the startup for each of these properties in the server log.

We also removed all documentation for the old properties, leaving only relevant migration guides in place.

Next steps

Within the next weeks or months we will completely remove all these properties and the underlying code. We therefore renew our call out from numerous contributor calls, Trino Community Broadcast episodes, and our Trino Fest and Trino Summit events:

Stop using the old legacy file systems today.

If you need help, have a look at the documentation for your connector, the file system you use, and the migration guide for each file system:

The new systems are more stable and performant, and save you time and money. Migrate today, and if you encounter any issues, or find that there are features missing, ping us on Slack and chime in on the roadmap issue for the removal of the legacy file system support.

Trino in 2024 and beyond

2025-01-07T00:00:00+00:00

Wow, what an amazing year 2024 was for Trino! Martin Traverso presented about the achievements and progress of the project at the recent Trino Summit 2024. Let me dive deeper into the content of his keynote and elaborate some more about our amazing plans for the future.

Statistics

In his first slide of the presentation Enduring with persistence to reach the summit Martin presented some of the amazing statistics of the year:

Over 30 releases packed with features and improvements - Trino releases 436-467
5,000+ additional commits to the 40,000+ total commits since project start
225+ unique contributors in 2024, 925+ total
10.5k+ stars on GitHub
13,500+ Slack members
Trino Community Broadcast episodes 54-67

Improvements

Some of the major improvements in Trino are:

Access controls with Open Policy Agent and Apache Ranger
Improved observability with OpenLineage, OpenTelemetry, OpenMetrics, and Kafka
Significant client protocol improvements
Python user-defined functions
New connectors such as Faker, Snowflake, or Vertica
Numerous improvements on object storage connectors and integrations

Of course we also paid a lot of attention to bug fixes and shipped tremendous performance improvements.

Slides and video

If you want to find out all the details, have a look at the slides and the video recording:

Other projects

Martin also talked about the many improvements in other Trino projects such as Trino Gateway, trino-python-client, the new trino-js-client, and the new trino-csharp-client.

Plans for 2025

For 2025, we have some pretty big plans in addition to our continued software supply chain attention, performance improvemsnts and bug fixes.

Secrets management and dynamic catalogs
Client protocol improvements for all client drivers
Packaging improvements
More connectors such as DuckDB, LanceDB, HsqlDB, Loki, …
Continued and even increased work on performance improvements
Research and prototype towards a next generation optimizer
SQL language improvements such as PIVOT, ASOF joins, …

Of course, what really happens in 2025 and Trino depends on you all. The project lives and breathes only thanks to the efforts of all our contributors and maintainers and we look forward to working with you all.

Trino survey

Besides filing issues, sending pull requests, and discussing topics on Slack and GitHub, we also have some specific questions and would really appreciate your feedback. Answering should take less than a minute.

Help by answering the Trino survey

With Trino as a huge collaborative effort only one thing is for certain:

2025 will be an exciting year for Commander Bun Bun, Trino, and the Trino project.

Trino Summit 2024 resources

2024-12-18T00:00:00+00:00

What a view we had at the summit! Over 700 live attendees enjoyed the sessions and learned more about Trino-related use cases and projects. Now it is time for the additional 1000 registrants, our 13000+ Trino users on Slack, and everyone else in the Trino community and beyond to enjoy the presentations and recordings at their leisure.

Day 1 sessions

Enduring with persistence to reach the summit
Presented by Martin Traverso, co-creator of Trino and CTO at Starburst
Video recording | Slides

Running Trino as exabyte-scale data warehouse
Presented by Alagappan Maruthappan from Netflix
Video recording | Slides

Data lake at Wise powered by Trino and Iceberg
Presented by Peter Kosztolanyi and Abdullah Alkhawatrah from Wise
Video recording

Using Trino as a strangler fig
Presented by Trevor Kennedy from Fanduel
Video recording | Slides

A lakehouse that simply works
Presented by Vincenzo Cassaro from Prezi
Video recording | Slides

Empowering self-serve data analytics with a text-to-SQL assistant at LinkedIn
Presented by Gaurav Ahlawat, Albert Chen, and Manas Bundele from LinkedIn
Video recording | Slides

How Trino and dbt unleashed many-to-many interoperability at Bazaar
Presented by Shahzad Siddiqi, Siddique Ahmad, and Usman Ghani from Bazaar
Video recording | Slides

Maximizing cost efficiency in data analytics with Trino and Iceberg
Presented by Gopi Bhagavathula from Branch
Video recording | Slides

Lessons and news from the AI world for Trino

Manfred Moser, panel moderator and Trino maintainer at Starburst
Gunther Hagleitner, CEO and Co-founder at Waii
Rong Rong, Software Engineer at CharacterAI
William Chang, Co-founder and CTO of Canner and WrenAI
Mustafa Sakalsiz, Founder and CEO at Peaka
Dain Sundstrom, Trino co-creator and CTO at Starburst

Video recording

Day 2 sessions

Trino for observability at Intuit
Presented by Ujjwal Sharma and Riya John from Intuit
Video recording | Slides

Hassle-free dynamic policy enforcement in Trino
Presented by Ramanathan Ramu and Pratham Desai from LinkedIn
Video recording | Slides

Empowering HugoBank’s digital services through Trino
Presented by Mustafa Mirza and Razi Moosa from HugoBank
Video recording | Slides

Optimizing Trino on Kubernetes: Helm chart enhancements for resilience and security
Presented by Sebastian Daberdaku from CardoAI and Jan Waś from Starburst
Video recording | Slides

Virtual view hierarchies with Trino
Presented by Rob Dickinson from Graylog
Video recording | Slides

Opening up the Trino Gateway
   Presented by Manfred Moser and Will Morrison from Starburst,
   Vishal Jadhav from Bloomberg, and Jaehoo Yoo from Naver
    Video recording | Slides

Wvlet: A new flow-style query language for functional data modeling and interactive analysis
Presented by Taro L. Saito from Treasure Data
Video recording | Slides

Securing data pipelines at the storage layer
Presented by Andrew MacKay from Superna.
Video recording | Slides

Empowering pharmaceutical drug launches with Trino-powered sales data analytics
Presented by Harpreet Singh from Gilead
Video recording

Connecting to Trino with C# and ADO.net
Presented by George Fischer from Microsoft
Video recording | Slides

Our thanks go out to all our speakers as well as our event sponsor:

See you at Trino Fest 2025, one of our other events and meetings, and on Trino Slack.

Manfred, Monica, and Anna

The long journey to Apache Ranger

2024-12-02T00:00:00+00:00

Apache Ranger has arrived! With the new Trino 466 you all get another jam-packed release of Trino awesomeness. One of the goodies is a new plugin for access control for your data with Apache Ranger, and it has gone through a long story to get here.

Apache Ranger has a long history and wide adoption as an access control system for data lakes using Hadoop and Hive. Since Trino brings fast analytics to this space, and also supports modern data lakehouses and other data sources, Apache Ranger is a natural fit for access control on a Trino-powered data platform.

The beginnings

Apache Ranger has been in use with Trino for a long time - in fact there are early, rudimentary pull requests from 2019 that implemented some support. And even before then, various hacks existed. In 2020, a plugin for PrestoSQL was added to Apache Ranger. Aakash Nand blogged about Integrating Trino and Apache Ranger in 2021 to adjust for the changes to Trino. Jeff Xu followed up with Integrating Trino and Apache Ranger in a Kerberos-secured enterprise environment in 2022, followed quickly by the addition of the Trino support to the Apache Ranger repository.

Testing and container images

However that was only half of the needed support. The Trino project moves very fast with nearly weekly releases, so the best approach is to have the supporting plugin in Trino directly so every release includes the relevant updates. Erik Anderson created a more mature plugin that was in production use for quite a while for Trino. His pull request from July 2022 included great background reasoning for having the plugin in Trino. One of the issues that Erik solved for the Trino project is testing. Trino plugins require the availability of a container image for testing whatever integration. Apache Ranger did still not ship a container in 2022, but thanks to the lobbying efforts of Erik this changed and a container image became available over the months.

A long sprint

Unfortunately, focus changed and while the PR from Erik existed and was useful, it never made it to merge due to waning priorities. That changed when Madhan Neethiraj from the Apache Ranger project stepped up and created new PR in July 2024.

We knew this could be another shot at it, and it would require a lot of work to get it done, since we put a high focus on quality so that we can maintain the Trino codebase for the long run. Monitoring all PRs regularly I (Manfred Moser) noticed it and jumped in with first help.

Erik and other interested users chimed in. lozbrown and Manfred helped with documentation and getting other developers interested. The heavy technical reviews and lots of guidance came from Krzysztof Sobolewski and Grzegorz Kokosiński.

During the whole process, Madhan had to react to comments, update the code, and also regularly rebase his PR to adjust for the constantly changing Trino codebase in the master branch. Starburst recognized Madhan’s effort and featured him as Starburst Trino Champion. Interestingly, the container image ended up not being used for testing, however it will be crucially important for many users deploying Apache Ranger on Kubernetes anyway. Nearly 400 comments and over four months later we all got to celebrate. The Trino maintainer Grzegorz took on the responsibility and merged the PR. Yuya Ebihara and Martin Traverso followed up with minor cleanups, and we finally shipped the plugin as part of Trino 466.

A huge congratulations and thank you goes out to everyone involved.

Now it is your turn to have a look at the documentation, learn more about Trino and Apache Ranger, and maybe even proceed to help us improve the integration.

Next steps

Beyond our celebration, more tasks are waiting for all of us:

Test it out in your usage and migrate from any old or custom versions.
Help us improve the documentation significantly to allow easier adoption.
Work with lozbrown on adding support to the Helm chart.
Check out the codebase and help us fix bugs and add features.

And last, but not least - join us all to celebrate Trino at the upcoming Trino Summit 2024 for two days of amazing sessions and interaction with your peers from the Trino community and the Trino Contributor Call for more open community chat and discussion.

The glorious lineup for Trino Summit 2024

2024-11-22T00:00:00+00:00

We just wrapped up our mini training series SQL basecamps before Trino Summit, and now Trino Summit 2024 is less than three busy weeks away. It’s a good thing that we have also been working hard on all the preparations for the summit. Everything is coming together, and we are excited to share the full lineup for the free, virtual, two day event today.

In our first glimpse at the summit we were able to share a few sessions with more details. Now have a look at the whole lineup with speakers from all these and many other companies:

Make sure you register to get up to date information and more details for all the sessions. It will allow you to join us live, chat with the speakers during the event. You will also get important session follow up information, including recordings and slide decks becoming available, so you can review, watch anything you missed, and share sessions with your peers.

Keynote

In the keynote Enduring with persistence to reach the summit Martin Traverso, co-creator of Trino and CTO at Starburst, covers the developments from 2024 in the Trino projects and the Trino community. Martin also reveals details about new features, new projects, and plans for 2025.

Panel discussion

The hype and reality of AI has swept through the industry. In the panel discussion Lessons and news from the AI world for Trino, Manfred Moser is moderating experts from the community:

Gunther Hagleitner, CEO and Co-founder at Waii
Rong Rong, Software Engineer at CharacterAI
William Chang, Co-founder and CTO of Canner and WrenAI
Mustafa Sakalsiz, Founder and CEO at Peaka
Dain Sundstrom, Trino co-creator and CTO at Starburst

All panelists have have extensive experience with AI and Trino, and will share their knowledge and different perspectives.

Sessions

The following sessions allow our speakers to really dig into the details of their topic:

Optimizing Trino on Kubernetes: Helm chart enhancements for resilience and security presented by Sebastian Daberdaku from CardoAI and Jan Waś from Starburst
Trino for Observability at Intuit presented by Ujjwal Sharma and Riya John from Intuit
Opening up the Trino Gateway presented by the Trino Gateway maintainers
Data Lake at Wise powered by Trino and Iceberg presented by Peter Kosztolanyi and Abdallah Alkhawatrah from Wise
Hassle-free dynamic policy enforcement in Trino presented by Ramanathan Ramu and Pratham Desai from LinkedIn
Empowering self-serve data analytics with a text-to-SQL assistant at LinkedIn presented by Gaurav Ahlawat, Albert Chen, and Manas Bundele from LinkedIn
A Lakehouse that simply works presented by Vincenzo Cassaro from Prezi
Securing data pipelines at the storage layer presented by Andrew MacKay from Superna.
Maximizing cost efficiency in data analytics with Trino and Iceberg presented by Gopi Bhagavathula from Branch
Wvlet: A new flow-style query language for functional data modeling and interactive analysis presented by Taro L. Saito from Treasure Data
Running Trino as exabyte-scale data warehouse presented by Alagappan Maruthappan from Netflix

Lightning talks

Our lightning talks provide inspiration with some great examples of Trino adoption and usage:

Using Trino as a strangler fig presented by Trevor Kennedy from Fanduel
Virtual view hierarchies with Trino presented by Rob Dickinson from Graylog
Empowering HugoBank’s digital services through Trino presented by Mustafa Mirza and Razi Moosa from HugoBank
How Trino and dbt unleashed many-to-many interoperability at Bazaar presented by Shahzad Siddiqi, Siddique Ahmad, and Usman Ghani from Bazaar
Connecting to Trino with C# and ADO.net presented by George Fischer from Microsoft

Our special thanks go out to all our speakers as well as our event sponsor:

See you on the summit.

Manfred, Monica, and Anna

View the SQL basecamps before Trino Summit

2024-11-21T00:00:00+00:00

Trino Summit is inching closer fast, and we are busy with all the preparation. Nevertheless, we thought we bring you some more SQL and Trino-related training. The two live classes from our SQL basecamps before Trino Summit are now available for you all to enjoy, just in case you missed it.

In the two classes I teamed up with Dain Sundstrom and Martin Traverso, and created a interview-style training classes. Hopefully you learned something from their insights, and my guidance and questions.

Check out the two session recordings and the supporting material:

Moving supplies

In the first episode SQL basecamp 1 – Moving supplies Dain and I discussed the core concepts of a Trino-powered lakehouse, getting data in and maintaining the lakehouse.

Look at the slides

Getting ready to summit

The second episode SQL Basecamp 2 – Getting ready to summit builds on the foundation established in episode 1. Martin and I discussed some further details for lakehouse usage and then looked at structural data types and views.

Look at the slides

Next up, Trino Summit

If you think those two sessions were great, how about two days worth of great presentations at Trino Summit?

Trino and Javascript?! YES!

2024-11-18T00:00:00+00:00

Trino is written in Java. Trino contributors and maintainers are often veterans in the Java ecosystem and community, and Trino is very modern when it comes to Java. For example, Trino now requires the latest Java version and actively uses new features.

When it comes to JavaScript however, the story is a bit more complicated. Of course, JavaScript is commonly used in the Trino ecosystem and codebase. Let’s look at some of the specifics.

Client driver and applications

Client applications that allow users to submit queries to Trino, and then receive the results are written in numerous languages. Trino has good support for many of them.

Thanks to the collaboration with Filipe Regadas and the contribution of his JavaScript client driver to the Trino community, we now have an official trino-js-client project. After his initial donation we have applied numerous improvements and recently cut our first release.

The client is already used in the VisualCode support, the Emacs support, the example project discussed in Trino Community Broadcast episode 63, and numerous other applications.

And we have big plans as well:

Add support for more authentication methods supported in Trino
Improve documentation and example projects
Add support for the new spooling client protocol from Trino
Test with Trino Gateway and adjust as needed

While this project is a great addition for many users of Trino and their custom web applications, there are numerous other usages of JavaScript in the project.

User interfaces

Web-based user interfaces are one important use of JavaScript. Trino includes the Trino Web UI and the ongoing effort to replace it with a more modern and feature rich UI - currently called the Preview UI. It was inspired by the replacement of the legacy UI for Trino Gateway with a new UI based on current tools and libraries.

All three user interfaces require constant work in terms of upkeep to current libraries, bug fixes, and addition of new features.

Other projects

Beyond the user interfaces we also provide a plugin for Grafana that is mostly written in Javascript, and there might be more projects on the way.

What’s next?

The skills and experience needed for all these JavaScript-based efforts are different enough to ensure that there are developers out there who can help in these efforts without knowing much about Trino and Java.

If that is you, we want to hear from you. And if you are also knowledgable in Trino, Java, and many other things, and also interested to help on the JavaScript stuff, we also want to hear from you. There is always more stuff we want to get done and we need your help.

So have a look at the codebase that interests you the most, chat with us on Trino Slack, join an upcoming Trino contributor call and Trino Summit, and let me know if you would be interested in a regular Trino JavaScript call - for example monthly?

And if you don’t want to code in Java or JavaScript? Well, you can help us write documentation in Markdown, work on the Python client, the Go client, or maybe even contribute a client we don’t even have yet.

In all cases, we look forward to your help.

A glimpse at the summit

2024-10-17T00:00:00+00:00

Our efforts around Trino Summit 2024 are ramping up and the event is creeping closer and closer. We are really looking forward to the two-day, free, virtual event in December about all things Trino.

While we are working hard to put together the SQL basecamps before Trino Summit training sessions and other community events, a number of your awesome peers from the Trino community submitted session proposals, and we are excited to share that glimpse on the agenda for Trino Summit 2024.

First batch of sessions

Let’s see what already settled on the agenda.

Running Trino as exabyte-scale data warehouse

Presented by Alagappan Maruthappan from Netflix

Netflix operates over 15 Trino clusters, efficiently handling more than 10 million queries each month. As the initial creator of the Apache Iceberg, Netflix has over 1 million Iceberg tables extensively using the Trino Iceberg connector. In this session we talk about the operational challenges faced, internal efficiency improvements, and our experience with upgrading to the latest Trino version.

A Lakehouse that simply works

Presented by Vincenzo Cassaro from Prezi

With the billions of tech and vendor proposal, it’s easy to loose track of what truly matters. Vincenzo would like to show how a simple combination of established, maintained, open source technologies can make a lakehouse that truly works for a 150M users company.

How Trino and dbt unleashed many-to-many interoperability at Bazaar

Presented by Shahzad Siddiqi, Siddique Ahmad, and Usman Ghani from Bazaar

Learn how Bazaar leveraged the combined power of Trino and dbt to scale their data platform effectively. This talk delves into the strategies and technologies used to enable many-to-many integration, fueling data-driven decision-making across the organization.

Maximizing cost efficiency in data analytics with Trino and Iceberg

Presented by Gopi Bhagavathula from Branch

At Branch, we realized that our existing architecture, was not only expensive but also becoming unsustainable as data volumes grew for one of our business units and we decided to adopt Trino and Apache Iceberg. Our journey of migrating from Apache Druid to Trino and Iceberg taught us that the right combination of tools can transform data analytics for one of our internal business units, offering the perfect balance between cost savings, performance, and scalability. Learn more how we achieved 7-figure savings with a few “compromises”.

Using Trino as a strangler fig

Presented by Trevor Kennedy from Fanduel

This talk discusses how FanDuel uses Trino to migrate analysts from Redshift to Delta Lake using Martin Fowler’s Strangler Fig pattern. Trino slowly took roots after initial trials, started replacing parts of the legacy system, and eventually will be a complete replacement with a shadow of the original system.

Enduring with persistence to reach the summit

Presented by Martin Traverso from Starburst

In the keynote Martin presents the latest and greatest news from the Trino project and the Trino community. With more contributors, more maintainers, and a larger community we got a lot done since Trino Fest in June. Find out the details from the co-creator of Trino.

Surely, you don’t need any more convincing and you are ready to proceed to

Continued call for speakers

Now that you registered and saw what others have submitted and got accepted, we are sure you are thinking:

Well, thats interesting, but I can submit a talk like that and even better!

We agree and know you are up to it, so go ahead and submit a proposal:

Submit a talk!

And if necessary, check the original announcement for more tips and ideas.

To make the event a smashing hit, we are also looking for more sponsors. Starburst, as the organizing sponsor of the event, is excited and interested to collaborating with other organizations from the Trino community. If you are interested in sponsoring, email events@starburstdata.com for information.

A Kubernetes operator for Trino?

2024-10-10T00:00:00+00:00

Trino is deployed everywhere – on-premise, in private data centers, in the cloud with hosting providers, on bare metal servers, on virtual machines, and with containers. With all these options for deployments, a Kubernetes-based platform with a container emerged as the most widely used approach.

The Trino project caters for this usage with our container images for every release and our Helm chart. However we keep hearing from people who want to use a Kubernetes operator…

Existing operators

We know that various companies have Kubernetes operators developed internally, and we also know that open source ones exist, for example:

trino-operator from Stackable with integration in trino-lb
Charmed Trino K8s Operator from Canonical

Ideally these separate efforts can combine their work, and create a great operator in the Trino project that is closely aligned with Trino itself, and also suitable for future integration with Trino Gateway. In fact, the Trino Gateway is a good example where different parties came together and considerably innovated together. Hopefully we can achieve the same with the operator. It can still be expandable and modular to suite for specific needs on different platforms and for different users.

We also know that this is a long standing community wish from the issue and various discussions with users.

Discussing next steps

However there are some complications such as choice of programming language or commitment to help within the Trino project as subproject maintainer. We kicked off some of these discussion in the past at Trino contributor meetings, and hope that now is a good time to continue.

To that end we are arranging a community meeting:

Virtual video call
30th of October 2024
8:00 PDT / 11:00 EDT / 15:00 GMT / 16:00 CET
Invite available from Manfred on Trino Slack or via email:

Tell Manfred you want to join

We will also post connection details on the #kubernetes channel and we are collecting related discussion points on our contributor meeting page.

Looking forward to a great discussion.

SQL basecamps before Trino Summit

2024-10-07T00:00:00+00:00

Later in December your knowledge of our Trino SQL query engine will certainly peak again at Trino Summit 2024. To reach those heights and absorb all there is to learn at Trino Summit, you need to get ready.

That is why I teamed up with our Trino creators and BDFLs – Martin Traverso, Dain Sundstrom, and David Phillips. We aim to be your coaches and trainers to get you ready and get to the summit without the need for oxygen masks and sherpas. Join us for the “SQL basecamps before Trino Summit”, where we expand on our past SQL training series with two new episodes.

Both planned sessions provide a high-level overview and some practical tips and tricks over the course of an hour. The sessions are completed by an open questions and answers section with the speakers.

Moving supplies

In the first episode SQL basecamp 1 – Moving supplies David and Dain will help me provide an overview of the wide range of possibilities when it comes to moving data to Trino and moving data with Trino.

We specifically look at the strengths of Trino for running your data lakehouse and migrating to it from legacy data lakes or other systems. SQL skills discussed include tips for creating schemas and tables, adding and updating data, and inspecting metadata. We talk about table procedures for data management and also cover some operational aspects. For example, we talk about the right configuration in your catalogs for your object storage, specifically the new file system support in Trino.

Getting ready to summit

The second episode SQL Basecamp 2 – Getting ready to summit builds on the foundation established in episode 1. Data has moved into the lakehouse, powered by Trino, and more data is added and changed as part of normal operation. In this episode Martin and myself look at maintaining the data in a healthy state and explore some tips and tricks for querying data. For example, we look at data management with procedures, analyzing data with window functions, and examine more complex structural data.

What do want to learn

So there you have it - enough reason to register. Well, if not we can do better: Both sessions are aimed at all of you out there using Trino and we are ready to discuss your questions during class. More importantly though, I would also love to hear your suggestions for these and other topics about SQL and Trino. We can adjust this series, figure out a session for Trino Summit, or bring another SQL training series to you next year.

Submit an idea to Manfred

Trino Summit needs you!

Now with all that in mind, what are you waiting for? Get ready to learn more about SQL with Trino in the series and at Trino Summit.

I am convinced - register now

And of course, we are also interested in your speaker proposals and sponsorships for Trino Summit to make it an awesome event for everyone again.

See you soon,

Manfred

23 is a go, keeping pace with Java

2024-09-17T00:00:00+00:00

Only about ten Trino releases or six months ago, we released Trino 447 with the requirement to use Java 22. In recent releases we started to take more and more advantage of features that are only available with that upgrade. We made some big steps in terms of performance and talked talked about some of those performance enhancements around aircompressor in the recent Trino Community Broadcast 65.

The Java community runs its release processes on a very predictable schedule - March and September mean new Java releases. This time it’s Java 23, and Trino will not be left behind. We are upgrading to use and require Java 23 soon!.

Background and motivation

While the new features and improvements in Java 23 are not as impactful as in Java 22, we still need to keep pace to take advantage of the improvements and avoid any problems in the future. Here are the Java Enhancement Proposals that are included with Java 23:

If you want to learn more you can check out the short summary video or the three hour long launch stream. The Oracle press release as well as the community announcement also bring you a wealth of further information.

Overall our reasoning is unchanged from the upgrade to 21 and the upgrade to 22. So what are we specifically doing now?

Current status and plans

Early access binaries have been in use in our continuous integration builds for months. Java 23 launched today and the various JDK distribution binary packages will become available shortly. We are executing on the same blueprint as last time:

Wait for Eclipse Temurin binaries.
Ensure everything works with Java 23.
Change the container image to use Java 23.
Cut a release and get community feedback from testing with the container.
Adjust to any feedback and available improvements for a few releases.
Switch the requirement for build and runtime to Java 23.
Cut another release and celebrate.

Timing on all the work depends on obstacles we find on the way and how we progress with removing them. We use the Java 23 tracking issue and the linked issues and pull requests to manage progress, discuss next steps, and work with the community.

Feel free to chime in there, find us on the #core-dev channel on the Trino community Slack or join us for a contributor call.

Announcing Trino Summit 2024

2024-07-11T00:00:00+00:00

Fresh off the heels of Trino Fest 2024, where Commander Bun Bun was busy meeting the Trino community in-person, we’re already looking forward to another, bigger event to round out the year in Trino. For those who’ve been here a while, you know that can only mean one thing: Trino Summit 2024. Much like last year, it will be a two-day, fully virtual event, hosting a wide range of talks covering all things Trino on the 11th and 12th of December. Read on for more info, or if you’re already convinced…

Join us online

Trino Summit is an event that brings together engineers, analysts, data scientists, and anyone else interested in using or contributing to Trino. As the biggest Trino event of the year, we’re excited to bring together professionals from the big data and analytics community, so they can share experiences and insights, make connections, and learn from each other.

The event will be broadcast live, and speakers will be addressing questions asked in chat, so if you want the full experience, make sure to register and attend while the talks are happening. Even if you can’t make it, registering means you’ll be notified when we post videos of all talks to the Trino YouTube channel after the event, so don’t fret - sign up!

Call for speakers

Interested in speaking? We want to hear from everyone in the Trino community who has something to share. We are looking for full sessions (about 30 minutes) and lightning talks (15 minutes). We welcome beginner to highly advanced submissions for talks that are connected to Trino.

A two-day event means we’ve got room for everything, so if you’re unsure about whether to submit a talk, go ahead and do it! We’ll review all submissions, and we’ll do our best to work with you to turn your talk into a smash hit. Some possible topics include:

Best practices and use cases
Data lake, lakehouse, and data federation architectures
Query federation and data migrations
Table formats, file formats, and metadata catalogs
Optimizations and performance improvements
Data engineering, including data cleaning, batch and streaming architectures, and maintenance
Streaming and other data ingestion and pipelines
Data science workflows and analytics
SQL analytics, business intelligence, dashboarding and other visualizations
Data governance and security
Writing advanced SQL queries and pipelines
Help for Trino deployment on-premise and in the cloud
Developing custom connectors and other plugins
Contributing to Trino

Want to speak?

Submit a talk!

Starburst is the organizing sponsor of the event, but to make Trino Summit a smashing success, they’re excited and interested in collaborating with other organizations within the community. If you are interested in sponsoring, email events@starburstdata.com for information.

And regardless of whether you’re planning on attending, speaking, or sponsoring, we look forward to seeing you soon!

Trino Fest 2024 recap

2024-06-24T00:00:00+00:00

Trino Fest 2024 is successfully in the books! While over 100 enthusiastic members of the community gathered in Boston, over 650 virtual attendees joined us worldwide to learn from our expert speakers as they discussed topics such as table formats, enhancements and optimizations, and use cases with Trino both large and small. And now it is your chance to revisit the presentations or catch up on everything you missed.

Impressions

Judging from early results from attendee and speaker feedback, everyone enjoyed the event. Asked about what sessions the audience liked we got answers like

They were all very insightful.
All of it, but especially the realtime demos to see speed difference on query optimization.
and All of them, nothing was missed!

Just like some attendees, our speakers travelled from Europe, Asia, and other places, and enjoyed the event.

Thanks for organizing the awesome event and inviting me for the talk!
Was great to finally meet you and we had a great time at Trino Fest!
Thanks for a great event last week. It was a pleasure to meet you all.

Many of us also met Commander Bun Bun, and we sent greetings to the remote audience as well.

The keynote, the sessions, and all the talk in the hallways confirmed that Trino continues to thrive and expand in usage. Large companies like Apple, Microsoft, LinkedIn, Amazon, and many other users openly talk about shipping Trino as part of their products and using it for internal usage as well. Smaller companies either run Trino themselves or take advantage of Trino-based products for all their data platform needs. Our sessions for Trino Fest offered something to learn for everyone.

Sessions

Now, following is what you are really looking for. All the talks, speakers, short recaps, slide decks, video recordings, and following Q&A sessions, ready for you. Enjoy!

What’s new in Trino this summer
Presented by Martin Traverso from Starburst

Martin recapped everything that’s happened in Trino over the last six months, taking a look at the biggest new features and how Trino development is going better than ever. He also gave a sneak peek at what we can expect soon in Trino.
Video recording | Slides

Reducing query cost and query runtimes of Trino powered analytics platforms
Presented by Jonas Irgens Kylling from Dune.

Jonas gave a detailed talk about how Dune has improved their performance of Trino with a few key tweaks. That includes leveraging caching with Alluxio, advanced cluster management, and storing, sampling, and filtering query results.
Video recording | Slides

Enhancing Trino’s query performance and data management with Hudi: innovations and future
Presented by Ethan Guo from Onehouse.

Ethan gave a look into development on Hudi and Trino’s Hudi connector, explaining multi-modal indexing and how it can improve query performance. He also gave an overview of the roadmap and future of the connector.
Video recording | Slides

Trino Engineering @ Microsoft
Presented by George Fisher and Ishan Patwa from Microsoft.

George and Ishan gave a deep dive into what’s been going on with Microsoft’s deployment and management of Trino. This included clients and integrations, result caching, a sharded SQL connector, deep debugging and monitoring, and seamless security integration with Azure.
Video recording

Enhancing data governance in Trino with the OpenLineage integration
Presented by Alok Kumar Prusty from Apple.

Alok’s lightning talk is all about how Apple deployed OpenLineage, an open framework for data lineage collection and analysis, and built a Trino plugin to publish OpenLineage complaint events that can be viewed and monitored.
Video recording

Best practices and insights when migrating to Apache Iceberg for data engineers
Presented by Amit Gilad from Cloudinary.

Amit shared how Cloudinary expanded their data lake to use Apache Iceberg. He demonstrated how moving from Snowflake to an open table format allowed them to reduce storage costs and leverage different query and processing engines to run more powerful analytics at scale.
Video recording | Slides

Trino query intelligence: insights, recommendations, and predictions
Presented by Marton Bod from Apple.

Marton’s lightning talk explored how Apple has monitored and stored metadata for every Trino query execution, then used that data for for real-time cluster dashboarding, self-service troubleshooting, and automatic generation of recommendations for users.
Video recording

The open source journey of the Trino Delta Lake Connector
Presented by Marius Grama from Starburst.

Marius went into a deep dive on all the work and collaboration that’s gone into making the Delta Lake connector in Trino a robust, first-class connector. Casual discussions, engineers working together, GitHub issues filed by the community, and innovative contributions have all come together, and Marius’ talk shows why an open source community is so powerful.
Video recording | Slides

Tiny Trino; new perspectives in small data
Presented by Ben Jeter and Thomas Zugibe from Executive Homes.

Ben and Tommy explore how Executive Homes uses Trino’s robust suite of integrations to handle data at a small scale. Instead of petabytes, how about a handful of gigabytes in several different systems? It’s something that Trino is well-equipped to handle thanks to how well-supported it is in the data ecosystem, and they explain why.
Video recording | Slides

Bridging the divide: running Trino SQL on a vector data lake powered by Lance
Presented by Lei Xu from LanceDB and Noah Shpak from Character.ai.

Lei and Noah give an overview of LanceDB, how it works, and what makes it a great database for multimodal AI. Then they dive into a Trino connector for Lance, and explore how Trino slots into Character.AI’s workload to blend analytics with training and generating new models.
Video recording | Slides

How FourKites runs a scalable and cost-effective log analytics solution to handle petabytes of logs
Presented by Arpit Garg from FourKites.

With nearly a petabyte of logs being managed at FourKites, it shouldn’t be a huge surprise that they’ve turned to Trino to handle understanding and analyzing them. Arpit discusses how they’ve scaled log ingestion, strategically used S3 with Parquet to minimize storage costs, transformed and extracted those logs at scale, and leveraged Trino to search and explore the datasets with Superset as a frontend for visualization.
Video recording | Slides

Observing Trino
Presented by Matt Stephenson from Starburst.

Starburst has built a comprehensive observability platform around Trino to better serve its users and customers. Matt explored all the components of it, including how to integrate with Jaeger, Prometheus, and ELK.
Video recording | Slides

Accelerate Performance at Scale: Best Practices for Trino with Amazon S3
Presented by Dai Ozaki from AWS.

Dai’s talk explores best practices to get the most out of using Trino in conjunction with Amazon S3. He discusses partitioning, scaling workloads, reducing latency, and resolving common bottlenecks, providing valuable insights for anyone trying to manage and deploy Trino with S3.
Video recording | Slides

What’s next

While you are busy catching up, we are still working hard on a recap of the Trino Contributor Congregation. We also had a lot of great conversations that lead us to follow up action items such as more pull requests to review, new contributors to onboard, and more projects to work on.

Make sure you to join the community on Slack to learn more in the next little while.

Oh, and one last thing…

Trino Summit 2024 registration is open

See you soon,

Manfred, Cole, and Monica

One busy week to go before Trino Fest 2024

2024-06-06T00:00:00+00:00

This week has surely started off with a big bang and another boom in the data platform world. Snowflake introduced the open source Polaris catalog as implementation of the Iceberg REST catalog specification. And Databricks, the main driver of the Delta Lake table format, announced their acquisition of Tabular, a main driver in the Apache Iceberg community.

Interestingly enough, Trino is in the middle of all this with great support for Delta Lake, Hudi, Iceberg, and also the Iceberg REST catalog. And if all that interoperability with Trino is not enough reason to join us next week at Trino Fest 2024, I have some more ideas for you to consider.

Reasons to attend Trino Fest

Trino Fest is happening next week on the 13th of June, and following are all the reasons I can think of why you should tune in.

The event is free for all attendees. It is available as an in-person event in Boston and for virtual attendance across the rest of the world.
You can learn about real world experience with Trino, Delta Lake, Iceberg, Hudi, and many other data sources, clients, and add-ons.
Many Trino friends, users, and contributors from around the world and companies like Amazon, Apple, Bloomberg, character.ai, Dune, LanceDB, Microsoft, Onehouse and Starburst are going to attend and present.
Monica Miller and Manfred Moser will guide you through the event with the help of the awesome Starburst Trino events team.
In-person attendees might just meet our mascot, Commander Bun Bun.
On the following day, the Trino Contributor Congregation will dive super deep into technical details and collaborative efforts.

Convinced yet, or still wondering. In either case, go and have a look at the detailed agenda and then register to attend.

And last, but not least thank you to our sponsors for making this event happen…

Big names round out the Trino Fest 2024 lineup

2024-05-08T00:00:00+00:00

We gave a sneak peek of the Trino Fest lineup a month ago, and we’re excited to now bring you the full lineup for the event. We’ve got some major names being added, including Amazon, Microsoft, and another talk from Apple. With Fourkites and a joint talk with LanceDB and CharacterAI also added to the schedule, we’re excited to present the full lineup for Trino Fest 2024.

Trino Fest is barely a month away on the 13th of June, and whether you want to attend live in Boston or tune in virtually, this is a reminder that you should register to attend!

Trino Fest, the contributor congregation, and logistics

In case you missed our announcement of Trino Fest, it’s a hybrid event taking place from 9am-5pm Eastern Time on June 13th. It’ll feature talks from a wide range of Trino users and contributors, with topics ranging from use cases, migrations, cluster management and administration, to lakehouse integrations and more. If you want to join us in-person, we’ll be at the Hyatt Regency Boston. There will also be a meeting for Trino contributors the day after the event at the Starburst office in Boston from 9am-1pm, and if you’d be interested in attending that, please reach out to myself (Cole Bowden) or Manfred Moser on the Trino Slack.

If you still haven’t booked a hotel, we also have a discounted rate at the Hyatt for the event to make life easy - whether that’s waking up and heading downstairs for the start of the event, or being able to quickly duck back to your room for a 30-minute meeting without missing too much. One link will take you to a booking for just the night before the event, while the other allows you to optionally book an extra night prior or include the night after Trino Fest so you can stick around for the contributor congregation or explore Boston.

Book your hotel for June 12-13 Book your hotel for June 11-14

And don’t forget those additional speakers

George Fisher, Ishan Patwa, and Oleg Savin will be diving deep into how Trino is leveraged at Microsoft. While we’ve previously had LinkedIn at Trino events, this is the first time the Trino community is getting to hear about the scale of Trino within Microsoft proper, and with their plans to cover clients, integrations, result caching, a sharded connector, visualization for monitoring, and AKS deployment with Azure, there will be a lot to learn.

Alok Kumar Prusty and Amogh Margoor from Apple will be joining the lineup to discuss Trino query intelligence. With the mountain of query metadata, the team at Apple has been able to better understand Trino usage and use that knowledge to create impactful improvements for their Trino users. With dashboarding, self-service troubleshooting, and automatic recommendations for query optimization, Alok and Amogh will detail how a world-class engineering team can take an awesome tool like Trino and make it even better for the end users.

Also relatively new to the Trino community is discussing AI workloads. Lei Xu from LanceDB and Noah Shpak from character.ai will be highlighting exactly that, using Trino as an analytics engine on top of a LanceDB-powered vector data lake. With AI data so often being in a silo, analyzing it with a traditional SQL workload is often expensive or complicated… but Lei and Noah will be demonstrating how character.ai’s LanceDB/Trino pairing maintains the power of both systems while making it easy.

Dai Ozaki from Amazon will be diving into how to optimize Trino with S3. Given how many people are using Trino with S3 already, hearing directly from Dai, an engineer at Amazon, regarding best practices and optimizations should prove beneficial for a massive chunk of the Trino community. Dai plans on talking about how Trino and S3 interact, and how that knowledge can be used to get the most out of your stack and avoid common bottlenecks.

And last but not least, Aprit Garg from FourKites will be discussing utilizing Trino to handle nearly a petabyte of logs. FourKites is able to ingest massive amounts of logs, use S3 and Parquet to keep storage costs low, transform and extract logs at scale, and then use Trino as the engine to query those logs and reference them in context with other data sets and data stores. Arpit will also touch on using Superset as a frontend for Trino.

And keep in mind - all of that is in addition to the talks we’ve already announced! Register to attend, book your hotel, and the Trino community is looking forward to seeing you there!

Thank you to our sponsors for making this event happen…

A sneak peek of Trino Fest 2024

2024-04-15T00:00:00+00:00

Trino Fest is drawing ever closer. Commander Bun Bun has been hard at work behind the scenes arranging the schedule and making sure that Trino’s trip to Boston is going to be a great one. In case you missed it, we announced Trino Fest a couple months ago, and if you have missed it, make sure to go register to attend! All our speakers will be in person in downtown Boston on the 13th of June, with plenty of opportunities for networking and a happy hour event at the end of the day. But if you can’t make the trip to enjoy the lovely New England summer, we’ll also be live-streaming the event, and you can register to join us virtually.

Still on the fence, though? Read on for a preview of our speaker lineup and brief summaries of their talks. Keep in mind this also isn’t the full lineup, and we’ll follow up soon with the last few talks that round out the schedule.

A brief word from our sponsors…

Thank you to our sponsors for making this event happen…

And now onto what you’re waiting for: a preview of most of the talks coming to Trino Fest this year!

Lakehouses

It’s no secret that using Trino as part of your lakehouse has become one of its major use cases in the past few years. We’re excited to say that at Trino Fest, we’ll have representation for each of the modern big three table formats: Iceberg, Delta Lake, and Hudi.

Iceberg

Apache Iceberg will be covered twice: Amogh Jahagirdar from Tabular will be diving into the world of Iceberg views and how they can be leveraged to coordinate across different query languages and dialects. Amit Gilad from Cloudinary will be covering the story of migrating out of Snowflake to the wonderful world of open table formats and Iceberg.

Delta Lake

Marius Grama, a Trino contributor at Starburst, will be going into detail on the history, development, and improvements to the Delta Lake connector. With time travel for the Delta Lake connector landing in Trino 445, it’s one of the most exciting areas for development in open source Trino, and there’s some interesting stories that Marius is excited to share with the community.

Hudi

Rounding out data lakes, Ethan Guo from Onehouse will be diving into Trino’s Hudi connector, giving an update on what’s landed lately to improve performance and functionality. He’ll also give a preview of what’s coming soon. The features are flying in, and if you’re a current or prospective user of Hudi with Trino, you won’t want to miss out.

Data takes

Of course, there’s more to Trino than querying data lakes, and there’s a wide variety of talks to discuss the other activities going on within the Trino community.

Small scale

Ben Jeter at Executive Homes, who gave a talk at Trino Fest last year while at Datto, is back to discuss running Trino at a more moderate scale than that we’re used to hearing about in the Trino space. Forget petabytes and exabytes, and welcome a tiny cluster querying thousands, not millions, of records that still derives huge value from Trino. It’s a great playbook for smaller startups and enterprises who still need robust, flexible, performant analytics.

Maximizing performance

Jonas Kylling from Dune will be detailing how they’ve managed to optimize Trino and squeeze out every ounce of performance to reduce query costs and runtimes. That includes leveraging the new Alluxio-based file system caching, emulating various cluster sizes to avoid expensive idle cluster time, and storing, sampling, and filtering query results to avoid re-executing queries.

Query intelligence

Marton Bod and Vinitha Gankidi from Apple bring insights to query intelligence. They’ll demonstrate how Apple has understood when their clusters are most utilized and who’s using them, enabling slicing and dicing along different dimensions. Having a query intelligence dataset can be used for real-time cluster dashboarding, self-service troubleshooting, and automatic generation of recommendations for users, all of which can empower Trino to be better than ever.

And more!

Of course, Trino’s own Martin Traverso will be giving a keynote on the latest and greatest in the project, covering everything big that’s landed since Trino Summit, as well as a glimpse at the roadmap for the project in the coming few months. Several other big talks are falling into place that we can’t announce just yet, so stay tuned for more info as the event draws nearer.

Trino contributor congregation

The day after Trino Fest, we’ll also be hosting an in-person meetup for Trino contributors and engineers to catch up, discuss the Trino roadmap, and engage directly with the maintainers in-person. It’s a great opportunity to put faces and voices to those GitHub handles, align on the big ideas or tricky PRs that have been moving slowly, and find more ways to get involved in Trino development. If you’re interested in attending, message Manfred Moser or Cole Bowden on the Trino Slack, and we’ll get you added to the attendee list and share more details.

Time travel in Delta Lake connector

2024-04-11T00:00:00+00:00

Exciting news - time travel capability has finally arrived in the Delta Lake connector! After introducing support for time travel in the Iceberg connector back in 2022, we’re thrilled to announce that the Delta Lake connector now joins the ranks as the second connector offering this feature.

Background and motivation

Time travel as a feature has a number of practical use cases:

Data recovery and rollback: In the event of data corruption or erroneous updates, time travel allows users to roll back to a previous version of the data, restoring it to a known good state.
Auditing and compliance: Time travel enables auditors and compliance teams to analyze data changes over time, ensuring regulatory compliance and providing transparency into data operations.
Historical analysis: Data analysts and data scientists can perform historical analysis by querying data at different points in time, uncovering trends, patterns, and anomalies that may not be apparent in current data.

Time travel SQL example

Start by creating a catalog example with the Delta Lake connector, create a demo schema, and make it the current catalog with the USE statement.

USE example.demo;

Let’s create a Delta Lake table, add some data, modify the table and add some more data using the following SQL statement:

CREATE TABLE users(id int, name varchar) WITH (column_mapping_mode = 'name');
INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Mallory');
ALTER TABLE users DROP COLUMN name;
INSERT INTO users VALUES 4;

Use the following statement to look at all data in the table:

SELECT * FROM users ORDER BY id;

 id
----
  1
  2
  3
  4

The $history metadata table offers a record of past operations:

SELECT version, timestamp, operation
FROM "users$history";

 version |             timestamp              |  operation
---------+------------------------------------+--------------
       0 | 2024-04-10 17:49:18.528 Asia/Tokyo | CREATE TABLE
       1 | 2024-04-10 17:49:18.755 Asia/Tokyo | WRITE
       2 | 2024-04-10 17:49:18.929 Asia/Tokyo | DROP COLUMNS
       3 | 2024-04-10 17:49:19.137 Asia/Tokyo | WRITE

You can specify the version using FOR VERSION AS OF. For example, to time travel to version 1, which includes a WRITE operation, the query would look like this:

SELECT *
FROM users FOR VERSION AS OF 1;

As you can see, time travel not only rolls back the data but also the table definition:

 id |  name
----+---------
  1 | Alice
  2 | Bob
  3 | Mallory

Technical details

Delta Lake manages transaction logs in the _delta_log directory located under the table’s specified location.

Last checkpoint: The optional file that manages the last checkpoint version is named _last_checkpoint.
Delta log entries: The JSON file contains an atomic set of actions, for example 00000000000000000000.json
Checkpoints: The Parquet file contains the complete replay of all actions, up to and including the checkpointed table version, for example 00000000000000000010.checkpoint.parquet

More details are available in the Delta Lake protocol documentation.

Following is an example of the _delta_log directory:

00000000000000000000.json
00000000000000000001.json
00000000000000000002.json
00000000000000000003.json
00000000000000000003.checkpoint.parquet
00000000000000000004.json
00000000000000000005.json
...
_last_checkpoint

When the specified version is older than the last checkpoint, such as version 2, the connector reads the transaction log files starting from the initial checkpoint file (00000000000000000000.json) up to the specified version (00000000000000000002.json).

When the specified version is equal to the last checkpoint, in our example version 3, the connector reads only the checkpoint file for that version (00000000000000000003.checkpoint.parquet).

When the specified version is newer than the last checkpoint, so version 4, the connector reads the checkpoint file for the last checkpoint version (00000000000000000003.checkpoint.parquet) and the transaction log file for the specified version (00000000000000000004.json).

The actual logic without the last checkpoint is more complex because the connector cannot determine the checkpoints without listing file names in the _delta_log directory.

Conclusion

Time travel in the Trino Delta Lake connector opens up new possibilities for data exploration and analysis, empowering users to delve into the past and derive insights from historical data. By seamlessly integrating with Delta Lake’s versioning and transaction logs, Trino provides a powerful tool for querying data as it appeared at different points in time. Whether it’s auditing, historical analysis, or data recovery, time travel adds a valuable dimension to data-driven decision-making, making it an indispensable feature for modern data platforms.

Bonus

Join us for Trino Fest 2024 where Marius Grama presents “The open source journey of the Trino Delta Lake connector” and shares more tips and tricks.

Blazing ahead with 22

2024-03-13T00:00:00+00:00

It was not that long ago that we first announced support for Java 21, and subsequently made it a build and runtime requirement with Trino 436.

Since then, the codebase received some significant improvements in readability, and we have also seen better performance. However, innovation in Trino and Java is not holding still, on the contrary - it’s accelerating. On the Java community side, Java 22 is just about to be released, and we think it is time to drive innovation in Trino even further. Trino is going to use and require Java 22 soon!

Background and motivation

The planned move to use and require Java 22 for build and runtime of Trino is driven by numerous aspects:

Take advantage of performance and runtime improvements of the new JVM version.
Use the newly available language features to further improve readability and maintenance aspects of the codebase.
Enable the use of further performance improvements for Trino under the umbrella of Project Hummingbird.
Attract and motivate more contributors for Trino as an opportunity to work with a modern Java stack on a cutting edge, complex application and work with the relevant language features and APIs.

Speaking about APIs and new features, let’s look at a list of JDK Enhancement Proposals (JEPs) that we are actively looking at. Specifically we plan to experiment, and adopt any non-preview JEPs where we see benefits. We also plan to submit any issues and problems we encounter back upstream to the Java community:

Region Pinning for G1 (JEP 423)
Foreign Function & Memory API (JEP 454)
Unnamed Variables and Patterns (JEP 456)
Class File API in preview (JEP 457)
String Templates in second preview (JEP 459)
Vector API in 7th incubator (JEP 460)
Structured Concurrency in second preview (JEP 462)
Scoped Values in second preview (JEP 464)

Many of these API’s allow us to further modernize the feature set of Trino and adapt it to current hardware and compute power realities. Specifically we can continue with our commitment to the Java ecosystem and avoid many of the complexities and pitfalls of JNI - the traditional, now legacy integrations with native code and specific hardware features.

Another aspect some of you might wonder about is the move from a Java LTS version to a Java STS release – from “long term support” to “short term support”. So far Trino was using Java 8, Java 11, Java 17, and then Java 21 as requirements. Since all of them are LTS releases, some of you might have concluded that we have a policy of only using Java LTS versions. That is not the case, it is only a coincidence.

We always thrived to use up to date source code, dependencies, runtime environments, and so forth. The benefits, including better performance, available and included bug fixes, reduced need for backports, less security issues, and support for modern language features, development environments, and tooling, always far outweighed the effort of staying up to date.

We are now finally at the long planned status where we can move quick enough as a project to use latest tools, dependencies, and Java releases and keep iterating on our frequent releases. And that is exactly what we are doing for the benefit of everyone contributing to Trino and using Trino. Java 22 now. And then later this year we can move to Java 23, and next year to 24 and 25.

So what are we specifically doing now?

Current status and plans

Java 22 is scheduled to ship in March 2024. The various JDK distribution binary packages will become available shortly after the official release.

Early access source and binaries are already available, and our continuous integration builds already use such an EA build successfully.

Overall the transition is going well. Our plan is to follow the same approach as our switch to Java 21:

Ensure everything works with Java 22.
Change the container image to use Java 22.
Cut a release and get community feedback from testing with the container.
Adjust to any feedback and available improvements for a few releases.
Switch the requirement for build and runtime to Java 22.
Cut another release and celebrate.

And then the real fun starts all over. We can update code, libraries, and start working with the new APIs. Timing on all the work depends on obstacles we find on the way and how we progress with removing them.

We use the Java 22 tracking issue and the linked issues and pull requests to manage progress, discuss next steps, and work with the community.

Feel free to chime in there or find us on the #dev channel on the Trino community Slack.

Join us in this exciting next step for Trino.

Update from 8 May 2024: The release of Trino 447 includes the switch to Java 22 as a requirement for running Trino.

A cache refresh for Trino

2024-03-08T00:00:00+00:00

Thinking about our recent work on caching in Trino reminds me of the famous saying, “There are only two hard things in computer science: cache invalidation and naming things.” Well, in the Trino community we know all about caching and naming. With the recent Trino 439 release, caching from object storage file systems got a refresh. Catalogs using the Delta Lake, Hive, Iceberg, and soon Hudi connectors now get to access performance benefits from the new Alluxio-powered file system caching.

In the past

So how did we get here? A long, long time ago, Qubole open-sourced a light light-weight data caching framework called RubiX. The library was integrated into the Trino Hive connector, and it enabled Hive connector storage caching. But over time, any open source project without active maintenance becomes stale. And like a stale cache, a stale open source project can cause issues, or becomes outdated and unsuitable for modern use. Though RubiX had once served Trino well, it was time to remove the dust, and RubiX had to go.

Making progress

Catching back up to 2024, Trino now includes powerful connectors for the modern lakehouse formats Delta Lake, Hudi, and Iceberg:

Hive is still around, just like HDFS, but we consider them both close to legacy status. Yet all four connectors could benefit from caching. Good news came at Trino Summit 2022 when Hope Wang and Beinan Wang from Alluxio presented about their integration with Trino and the Hive connector - Trino optimization with distributed caching on data lake. They mentioned plans to open source their implementation and an initial pull request (PR) was created.

Collaboration

The initial presentation and PR planted a seed in the community. The Trino project had been moving fast in terms of deprecating the old dependencies from the Hadoop and Hive ecosystem, so the initial Alluxio PR was no longer up to date and compatible with latest Trino version. Discussions with David Phillips laid out the path to adjust to the new file system support and get ready for reviews towards a merge.

In the end it was Florent Delannoy who started another PR for file system caching support, specifically for the Delta Lake connector. His teammate Jonas Irgens Kylling, also a presenter from Trino Fest 2023, took over the work on the PR. The collaboration on it was an epic effort. After many months of time, over 300 comments directly on GitHub and numerous hours of coding, reviewing, testing, and discussion on Slack and elsewhere the work finally resulted in a successful merge, and therefore inclusion in the next release.

Special props for their help for Florent and Jonas must go out to David Phillips, Raunaq Morarka, Piotr Findeisen, Mateusz Gajewski, Beinan Wang, Amogh Margoor, Manish Malhorta, and Marton Bod.

Finishing

In parallel to the work on the initial PR for Delta Lake, yours truly ended up working on the documentation, and pulled together an issue and conversations to streamline the roll out.

Mateusz Gajewski had also put together a PR to remove the old RubiX integration already. With the merge of the initial PR we were off to the races. We merged the removal of RubiX and the addition of the docs. Mateusz also added support for OpenTelemetry.

Manish Malhorta and Amogh Margoor sent a PR for Iceberg support. They were also about to add Hive support, when Raunaq Morarka beat them and submitted that PR.

After some final clean up, Cole Bowden and Martin Traverso got the release notes together and shipped Trino 439! Now you can use it, too.

Using file system caching

There are only a few relatively simple steps to add file system caching to your catalogs that use Delta Lake, Hive, or Iceberg connectors:

Provision fast local file system storage on all your Trino cluster nodes. How you do that depends on your cluster provisioning.
Enable file system caching and configure the cache location, for example at /tmp/trino-cache on the nodes, in your catalog properties files.

fs.cache.enabled=true
fs.cache.directories=/tmp/trino-cache

After a cluster restart, file system caching is active for the configured catalogs, and you can tweak it with further, optional configuration properties.

What’s next

What a success! It took many members from the global Trino village to get this feature added. Now our users across the globe can enjoy even more benefits of using Trino, and also participate in our next steps:

Further improvements to the current implementation, maybe adding worker-to-worker connections for exchanging cached files.
Preparation to add file system caching with the Hudi connector is in progress with Sagar Sumit and Y Ethan Guo and implementation is following next.
Adjust to any learnings from production usage.

Our thanks, and those from all current and future users, go out to everyone involved in this effort. What are we going to do next?

Manfred

PS: If you want to share your use of Trino or connect with other Trino users, join us for the free Trino Fest 2024 as speaker or attendee live in Boston, or virtually from your home.

Japanese edition of Trino: The Definitive Guide

2024-02-27T00:00:00+00:00

Do you know where the name ‘Trino’ comes from? It’s actually a shortened form of ‘neutrino’. These fast and lightweight subatomic particles have recently made their way to Japan. You can now reserve your copy of the Japanese edition of Trino: The Definitive Guide!

Today, we are happy to announce that the Japanese translation of the book Trino: The Definitive Guide is available for the communities all across Japan and far beyond. Preorder today and get your copy from the first batch in the middle of March. Hopefully it can lower the barrier to Trino for native speakers. We invite you all to get your own copy:

分散SQLクエリエンジンTrino徹底ガイド秀和システム

Our thanks goes out Masanori Nishida and his teams at Shuwa System. I would also like to thank my great team of translators and collaborators, Kai Sasaki, Akira Ajisaka, Kaname Nishizuka, and Miki Takata for their help in making the book a reality. We hope many readers can benefit from the translated edition.

We look forward to chatting with many of our new readers and Trino users on the general-jp channel in the Trino community Slack, other channels, and direct messaging.

Also, don’t forget to tell us about your usage of Trino in the upcoming Trino Fest 2024 as a speaker. Or just register to attend the free event.

Yuya Ebihara

Trino Fest goes to Boston in 2024

2024-02-20T00:00:00+00:00

After the resounding success of Trino Fest and Trino Summit in 2023, Commander Bun Bun has exciting news to share: we’re taking our biggest events of the year back to being in-person. They’ll be hybrid, to be more specific, so if you can’t travel, don’t fret, you’ll still be able to watch and ask questions in chat. But if you can travel, you won’t want to miss out! Everything you already know and love about Trino Fest is moving to the East Coast for the lovely Boston summer. The event is on the 13th of June in the Hyatt Regency Boston, where we’ll have a full day of talks, time to network, and a happy hour at the end of the day. You may even get to meet Commander Bun Bun, who’s ditching the hiking gear in favor of training for the Olympics. Sound exciting?

Join us in person

Our event will be hosted at the Hyatt Regency in Boston, where we are planning a full day of festivities followed by a happy hour on the Hyatt Regency deck. There is a discounted room block set aside for those interested in attending live and staying with us in Boston. If you are looking to book hotel dates in addition to what is provided on the room block, email events@starburstdata.com, and they will help you coordinate your reservation.

Regardless of whether you plan on attending in person or online, you do need to register, so make sure to click the button above!

Call for speakers

Interested in speaking? We want to hear from everyone in the Trino community who has something to share. If you aren’t sure whether it’s worth it to submit, submit anyway! We’ll review all submissions, and we’ll do our best to work with you to turn your talk into a smash hit. We are looking for both full sessions (about 30 minutes) and lightning talks (10-15 minutes). We welcome intermediate to advanced submissions for talks that are connected to Trino on any of the following topics:

Best practices and use cases
Data migrations
Optimizations and performance improvements
Data governance
Data engineering, including batch and streaming architectures
Data science
SQL analytics and BI
Cloud data lake use cases
Data lake architecture
Query federation
Table formats
Data ingestion

Want to speak?

Submit a talk!

Trino contributor congregation

Starburst is the organizing sponsor of the event, but to make Trino Fest a smashing success, they’re excited and interested in collaborating with other organizations within the community. If you are interested in sponsoring, email events@starburstdata.com for information.

And regardless of whether you’re planning on attending, speaking, or sponsoring, we look forward to seeing you soon!

Open Policy Agent for Trino arrived

2024-02-06T00:00:00+00:00

Trino now ships with an access control integration using the popular and widely used Open Policy Agent (OPA) from the Cloud Native Computing Foundation. The release of Trino 438 marks an important milestone of the effort towards this integration.

Collaboration and history

Open Policy Agent was first released in 2016 and has gained more and more popularity in the ecosystem of cloud native applications and beyond.

Initial efforts for an integration with Trino started at Bloomberg, Stackable, Raft, and other places separately and sometimes in parallel, with only partial collaboration. You might have first heard about it in August 2022 in the Trino Community Broadcast episode 39 with a team from Raft as guests.

Usage and experience with OPA grew. In the end, Pablo Arteaga from Bloomberg and Sebastian Bernauer and Sönke Liebau from Stackable had the initiative to start a pull request to Trino. Their persistence and collaboration led them through many review comments, update commits, and even a second PR, to submit a talk and eventually present at Trino Summit 2023 about the Open Policy Agent access control with Trino and their motivation to move from Apache Ranger to OPA.

OPA at Trino Summit 2023

The presentation from Pablo and Sönke titled “Trino OPA authorizer - An open source love story” received a lot of interest from the audience at the event and on YouTube since then. They detailed the architectural differences of using Ranger and OPA. Sönke detailed the usage of OPA in the Stackable platform and how it enables a single access control platform to apply across many systems. They discussed their collaboration on the pull request, and Pablo showed a migration path from Ranger, and a full demo of OPA with Trino.

They also made the slide deck available for your reference.

Edward Morgan and Bhaarat Sharma from Raft also presented Avoiding pitfalls with query federation in data lakehouses at Trino Summit, and detailed their OPA usage in their Data Fabric platform. It combines Delta Lake, Trino, Apache Kafka, and Open Policy Agent (OPA) into a robust lakehouse data platform. They talked about access control in Trino overall and how important it is for their customers, including the US Department of Defense. Their presentation also included a demo of OPA with Trino.

OPA on the way to Trino

Pablo and Sebastian continued their efforts on the pull request after Trino Summit. They worked successfully with Dain on the code review and necessary changes, and helped Manfred with the documentation.

Finally, with the release of Trino 438, the Open Policy Agent access control is available to all Trino users.

The community is already taking notice with follow up pull requests for further improvements and blog posts such as Enhancing Security and Observability in Trino with Open Policy Agent and OpenTelemetry from Isa Inalcik.

Benefits of OPA

The arrival of OPA support for Trino marks an important step. OPA is a mature and widely used access control system. Its ecosystem includes many integrations, user interfaces, development tools, and other resources.

OPA is a very flexible authorization system, making it an ideal match for Trino. Trino deployments are often part of a diverse data platform, spanning a variety of interconnected data sources, pipelines, client tools and applications.

Trino users now have an alternative to the file-based access control from the Trino project itself, the effort to support your own Ranger integration, or the use of commercial offerings for access control.

What’s next

We reached another milestone but we are not done yet. Specifically for OPA, we are looking at the following next tasks:

Get more features from various older, private forks converted into pull requests to Trino so everyone can benefit.
Update the documentation with more practical advice and tips.
Provide further resources for running OPA with Trino, writing rego scripts, and helping the community.
Implementation of row level filtering and column masking, based on the draft from Pablo

Special thanks go to everyone participating so far. Consider this an open invitation to join the effort.

Ping me on Slack directly or find us in #opa-dev.

Manfred

Trino 2023 wrapped

2024-01-19T00:00:00+00:00

If “Wrapped” is good enough for Spotify, it’s good enough for Trino, right? As we look forward to a bright 2024, we can also take a moment to get sentimental, look back at everything we’ve accomplished, and reflect on the progress we’ve made. Commander Bun Bun has been hard at work, so if you haven’t been paying close attention to Trino or want an idea of all that went down in 2023, we’re happy to present you with an end of year recap. We’ll be exploring what’s gone on in the community, on development, the events we’ve hosted, and discuss the cool new features and technologies you can use when you’re running Trino.

2023 by the numbers

64,288 views 👀 on YouTube
5,872 hours watched ⌚on YouTube
5,018 new commits 💻 in GitHub
2,985 new stargazers ⭐ in GitHub
2,494 pull requests merged ✅ in GitHub
1,227 issues 📝 created in GitHub
704 new subscribers 📺 in YouTube
45 videos 🎥 uploaded to YouTube
30 Trino 🚀 releases
39 blog ✍️ posts
10 Trino Community Broadcast ▶️ episodes
2 Trino ⛰️ Summits

We’re excited to say that Trino continued to grow in 2023:

GitHub stars increased by nearly 50% total and by 8% more than last year
Commits increased by 7%
Slack usage picked up dramatically
YouTube viewership was up 7% despite a lack of Pokemon-themed musical content compared to 2022 (our bad)
30 releases kept new versions of Trino coming out more than every other week.

Thanks in part to all that growth, it’s more important than ever to be on our Slack. If you’re a Trino user or community member and aren’t already on there, you’re missing out! Make sure to join up for community announcements, release statuses, the shared expertise of the entire Trino community, and event-specific channels for discussion when we’re hosting things like Trino Fest and Trino Summit. Speaking of those…

Trino events

One of the best parts of being an open source community is that it’s easy to be excited and connect with others about using such a cool piece of technology. Whether that’s bringing Trino to new users who can take advantage of it, or sharing our learnings with other Trino users to make the most, events are one of the best ways to distribute that knowledge. So what were we up to this year?

Trino Fest and Trino Summit

Trino Fest and Trino Summit are becoming mainstays on the Trino calendar each year, and 2023 was no different. Formerly “Cinco de Trino,” we ditched the Cinco de Mayo theme and went with the simpler “Trino Fest” in June, opting to theme it around Commander Bun Bun’s Lake House Summer Camp, with a focus on integrating Trino with lakehouse and data lake architectures. Trino Summit only wrapped up a little over a month ago, rounding out the year and highlighting some amazing developments that we’ll be talking about later in this blog post.

Trino Fest has historically been the smaller event, but it did some catching up in 2023, as both Trino Fest and Trino Summit were made virtual and expanded to 2 days this year. Easier to attend than ever before, we reached a combined total of about 1,200 live attendees, with thousands more views on demand.

The lineups were packed with 34 talks across both events, featuring speakers from huge Trino users like Salesforce, Stripe, Apple, and Lyft, as well as from major Trino contributors like Starburst, Tabular, and Bloomberg. You can view recordings of every Trino Fest talk and every Trino Summit talk on the Trino YouTube channel if you missed out.

Meetups and international events

One of the more exciting developments was our a major event in Japan - Trino Conference Tokyo. A virtual event with four sessions, it brought Trino to a Japanese-speaking audience and further pushed our favorite query engine across language borders. On top of that, Starburst co-hosted a Trino meetup in Bengaluru, and the community organized the first-ever Korean Trino meetup (pictured below).

And last but not least, Trino, the Definitive Guide, 2nd Edition was translated into Mandarin and Polish.

The Trino Gateway

One of the biggest announcements in the Trino community this year was the launch of the Trino Gateway. A proxy and load-balancer, it’s a crucial piece of Trino infrastructure for organizations that need more than one Trino cluster to suit their needs.

Why would you want more than one Trino cluster? Maybe you want one cluster with fault-tolerant execution enabled for ETL workloads and another cluster for speedy ad-hoc analytics. Perhaps you have analysts performing wildly differently-sized queries, and high-volume compute-intensive queries are proving to be bad neighbors for lightweight and low-latency queries that shouldn’t take more than milliseconds. Historically, users would have to manually manage swapping between clusters, establish a new connection, and try not to get a headache in the process.

Enter the Trino Gateway! By routing all of your Trino traffic automatically, it’s never been easier to manage, maintain, and query multiple Trino clusters at once. Load balancing ensures that no one cluster gets overworked, and it’s the perfect way to stop large queries from getting in the way of the little guys. Add in the fact that you can seamlessly shut down an individual cluster for updates or maintenance while the Trino Gateway routes traffic elsewhere, and it’s easy to see why this is such a game-changer. We’re super excited for it to be out there in the world, and we hope it makes running Trino at the largest scales simpler and faster than ever before.

For more information on the Trino Gateway, check out:

New features

With more development on Trino than ever before, there were obviously a ton of new things being added to it. Let’s go over some of the biggest adds in 2023.

SQL routines

Whether you want to refer to them as SQL routines or as user-defined functions, they’re a big deal. Fresh off the presses and only a few months old, they do exactly what you’d expect them to do: you, a user, can define and re-use your own functions! Define and use them inline as part of a query to make that query cleaner, easier, and simpler to understand. Or, if you’re really cooking, you can run a query that defines the routine in the schema of the catalog. This allows other Trino users to access the same routine time and time again as part of their other queries. It’s a level of customization that we’ve never had before in Trino, and no longer do you need to write your own Java plugins to create and re-use functions that do exactly what you need them to do.

If you want to learn more about SQL routines, you can check out the introduction to SQL routines in our documentation, as well as a video from our SQL training series and a few example routines which give a good look at how they can be used.

Schema evolution and dynamic catalogs

While we’re providing more power, customization, and flexibility to Trino users, it’s also important to highlight just how much has been added this year to make it easier to adjust things on the fly.

Schema evolution in Hive was a big addition, allowing you to alter columns’ data types, rename columns, and handle nested fields when dropping columns. Instead of needing to use the underlying database or modify it some other way and reboot Trino, Trino can handle the adjustments on the fly.

But if you don’t use Hive and are feeling left out, we’ve experimentally taken things one step further in 2023, adding dynamic catalogs to Trino. Rather than adjusting your schema one column at a time, what about adding or dropping an entire catalog in one go? You can do that now. Though it’s currently still bleeding-edge and not ready for widespread use on your important production data sources, we’re looking forward to improving it and making it resilient and stable in 2024.

Project Hummingbird

Trino has always been about squeezing out every ounce of performance that you can get. Check out our release notes and you’ll see that every version includes at least a couple performance improvements. Over time, these performance improvements add up to a substantial gain, meaning that version-over-version, year-over-year, Trino is always getting faster. Project Hummingbird was a concerted effort this year to take a look at the core engine and make a number of architectural changes paired with small improvements that would add up to something very substantial. The GitHub issue tracking it lists a ton of work that’s been accomplished already, with a lot of that work done in 2023. Though stay tuned for more, because that’s only scratching the surface…

Lakehouse improvements

Want to leverage the historical log of all actions taken on a table in Hudi? The new $timeline system table has you covered. How about in Delta Lake? We’ve got the table_changes function for that, and views were added there, too. Too many metadata tables to list were added to Iceberg, along with the REST, JDBC, and Nessie catalogs for metadata.

Java 21!

Java 21. It’s required to run version Trino versions 436 and later. With the upgrade from Java 17 to 21 comes a ton of improvements that will make development on Trino easier and better than ever, which will in turn make it faster and smoother than ever. Though not as huge of a deal as our upgrade to Java 17 last year, expect to see the benefits coming down the pipeline as the engineers working on Trino are able to take advantage of the latest and greatest features in Java.

Trino ecosystem updates

There’s more to Trino than Trino itself! With community updates and other technologies integrating with Trino, the number of ways you can access and use Trino are always growing. And the number of people taking care of Trino is growing, too.

Python clients

Trino’s own Python client saw heavy development in 2023. It was updated to support SQLAlchemy 2.0 and had type support fully fleshed out, making it a robust, free, and open-source tool for running your Trino queries.

Elsewhere in the Python ecosystem, we heard from both Fugue and Ibis at Trino Fest, two different Python clients that integrate Trino with Python in new ways. Fugue is a wrapper that helps integrate with other Python tools and clients, and Ibis can help convert your Python code into SQL queries, making it feasible to be a 100% Python-based organization that still leverages the speed and power of a SQL query engine like Trino. We had Phillip Cloud from Voltron Data on for an episode of the Trino Community Broadcast to talk about Ibis in even more detail.

And other clients, too!

Also on the Trino Community Broadcast repping new client support for Trino in 2023 were Dolphin Scheduler, PopSQL, and Coginiti. Dolphin Scheduler is a workflow orchestrator - and scheduler! - that can be used to routinely run and coordinate Trino queries. PopSQL is like Google Drive for SQL, providing a suite of collaborative tools for editing and working on queries as a team, including synchronous query editing, storing query history, and a robust commenting and feedback system. Coginiti is a high-powered data workspace that connects to Trino among many other things, supporting a host of powerful features that make it easier to reuse code and snippets of queries, as well as featuring embedded variables to minimize redundancy. If you want to learn more about any of these clients, click in on the links above to check out the Trino Community Broadcast where we went in-depth with them!

Oh, and don’t forget the Trino Typescript client, for when you want to work at the beautiful intersection of web development and accessing tons of data.

New maintainers

Trino saw three new maintainers added to its ranks this year:

Manfred even took the liberty of updating the website’s roles page to list out all our maintainers. Thank you to them for their dedication to making Trino the best it can be, and congratulations to them on their shiny maintainer titles!

Conclusion

2022 had been the busiest year in Trino’s history, but 2023 has managed to surpass it. If you’re interested in contributing to Trino, make sure to check it out on GitHub. Even if you’re not interested in contributing, give us a star on GitHub, anyway! It’s been a great year for Commander Bun Bun, and we can’t wait to show you what 2024 has in store for everyone’s favorite data rabbit.

Trino Summit 2023 recap

2023-12-18T00:00:00+00:00

Two days of non-stop Trino action are done! Last week, Trino Summit 2023 took place virtually another great community event. Great presentations from Trino experts across the globe showed different use cases and experiences with Trino.

Recap

During the event, our lively audience of over 600 attendees asked questions from the speakers and each other on chat, and we had fun with Trino trivia questions.

We talked about the SQL routine competition and announced Kevin Liu from Stripe and Jan Was from Starburst as the winners. You can find their submissions in the examples page for SQL routines.

Starburst announced their Trino Champions program. Kevin and Jan are the first recipients of the award and will receive their swag packs soon. Going forward, new champions will be crowned regularly, and Starburst is looking for nominations.

Sessions

If you missed out on the event, the following list of all the sessions provides links to the recordings. Over time, we will follow up with blog posts about each session with the presentation and further details.

The mountains Trino climbed in 2023 presented by Martin Traverso from Starburst. (Slides)
Trino workload management presented by Jinyang Li and Tingting Ma from Airbnb.
Secure exchange SQL: Building a privacy-preserving data clean room service over Trino presented by Taro Saito from Treasure Data.
Powering Bazaar`s business operation using Trino presented by Umair Abro from Bazaar. (Slides)
Efficient Kappa architecture with Trino presented by Sanghyun Lee at SK Telecom. (Slides)
Many clusters and only one gateway presented by Will Morrison (Starburst), Andy Su (Bloomberg), and Jaeho Yoo (Naver).
Trino upgrade at exabytes scale presented by Ramanathan Ramu from LinkedIn.
Powering data marts through Trino Iceberg connector at Zomato presented by Shubham Gupta and Bhanu Mittal from Zomato. (Slides)
Pinterest journey to achieving 2x efficiency improvement on Trino presented by Carlos Benavides from Pinterest.
Avoiding pitfalls with query federation in data lakehouses presented by Edward Morgan and Bhaarat Sharma from Raft.
Adopting Trino’s fault-tolerant execution mode at Quora presented by Gabriel Fernandes de Oliveira and Yifan Pan from Quora. (Slides)
Inherent race condition in Guava Cache invalidation and how to escape it presented by Piotr Findeisen from Starburst. (Slides)
Unstructured data analysis using polymorphic table function in Trino presented by YongHwan Lee from SK Telecom. (Slides)
Transitioning to Trino: Evaluating Lyft’s query engine capabilities presented by Charles Song from Lyft. (Slides)
Visualizing Trino with Apache Superset presented by Evan Rusackas from Preset.
Trino OPA authorizer - An open source love story presented by Sönke Liebau (Stackable) and Pablo Arteaga (Bloomberg). (Slides)
VAST database catalog presented by Jason Russler from VAST. (Slides)
Support for Parquet decryption and aggregate pushdown In Trino presented by Amogh Margoor and Manish Malhotra from Apple.

Shout outs

Shout outs for all their work with the speakers and organizing the event go to Anna Schibli, Mandy Darnell, and Monica Miller from the Trino Summit event team, and everyone else at Starburst who helped make this event a success.

Special thanks for making this Trino Software Foundation event a reality go out to our hosting sponsor Starburst, and our other sponsors Alluxio, Coginiti and Monte Carlo.

We will see you all at future Trino Contributor Congregations, Trino Fest 2024, Trino Summit 2024, and other events related to Trino.

Final reminder for Trino Summit 2023

2023-12-11T00:00:00+00:00

Are you ready? Trino Summit 2023 is just two days away, and our lineup of speakers, sponsors, and activities is truly amazing. Make sure to register and join us live.

Over the two days of the event we will enjoy sessions with our speakers from numerous well-known and respected companies, including Airbnb, Apple, Bloomberg, LinkedIn, Pinterest, SK Telecom, and others. Look at the full lineup for details.

Just like last time at Trino Fest 2023 we will have some fun Trino quiz questions for you all to puzzle over, and are ready to reward your fast and correct answers.

Cole Bowden and I will guide you through the two days of the event as hosts. The chat on the event platform as well as the Trino slack channel for the event will allow you to talk to other community members and the presenters, ask questions, and follow up for more answers and discussions.

We will announce the winning entries for our SQL routine competition and look a bit at the implementation. And if you are keen to write one, there is still have time to share your best SQL routine. You might be among the winners.

So you see - Trino Summit 2023 will be great. The event is virtual and free, so there really is no excuse for missing out:

Special thanks for their help with making this Trino Software Foundation event a reality go out to our hosting sponsor Starburst, and our other sponsors Alluxio, Coginiti and Monte Carlo.

We all look forward to see you in just two days. So exciting!

Functions with SQL and Trino

2023-11-29T00:00:00+00:00

In the fourth part of our training series Learning SQL with Trino from the experts Martin Traverso, Dain Sundstrom and I took on the big topic of aggregation functions, and covered the two new and exciting features of table functions and SQL routines.

The recording of the event allows you to watch it all as if you attended live, jump to specific sections as desired, or pause while you follow along with the demos:

Following are a couple of specific timestamps for interesting topics:

More timestamps for every part of the talk are in the description on YouTube. Also make sure you take advantage of these additional resources:

General overview slide deck for the series, with links to resources like our community chat
Slide deck for Functions with SQL and Trino, including files with all SQL statements, configurations and more ready to go
Trino: The Definitive Guide

With this last episode of the series for 2023 we are ready to showcase Trino with an amazing lineup of speakers and sessions at the upcoming Trino Summit 2023. Register now and catch all the presenters live for questions in the chat:

See you at Trino Summit 2023, upcoming Trino Community Broadcast episodes, and maybe even more SQL training in 2024.

Manfred

Trino Summit 2023 nears with an awesome lineup

2023-11-22T00:00:00+00:00

As winter nears, the days may be getting shorter, but so is the wait until Trino Summit 2023! It’ll be here before you know it on December 13th and 14th. We’ve got a packed speaker lineup full of exciting talks, and we’re ready to share some details with the Trino community today. Read on for a preview of some talks, and if you’re interested in attending, make sure to…

So, who’s going to be talking at Trino Summit? Here’s a quick rundown of the talks coming in from various companies.

Starburst: The mountains Trino climbed in 2023

As always, our keynote will come from Martin Traverso, Trino co-founder and co-CTO at Starburst. He’ll be giving a project update on everything exciting that’s happened in Trino since Trino Fest, as well as a sneak peek at the roadmap for features coming to Trino in 2024. It’s one of the best ways to keep up with the ongoing developments in the Trino community, and you won’t want to miss it.

Starburst, Bloomberg, and Naver: Many clusters and only one gateway

A second talk, which is a collaboration among Starburst, Bloomberg, and Naver, will be exploring the new Trino Gateway, a proxy and load-balancer that has been in the works for a long while in the Trino community. There’s no more need to worry about noisy neighbors or huge queries bullying out the quick and small workloads - with multiple clusters and the Trino Gateway on top, users interact with Trino like normal, but under the hood, queries get routed to available clusters to ensure that the time it takes to get your insights are shorter than ever before.

Airbnb: Trino workload management

Trino is the main interactive compute engine for offline ad-hoc analytics at Airbnb. Recently, they’ve redesigned their query workload processing on Trino clusters, introducing query cost forecasting and workload awareness scheduling systems. This helps them deliver a more stable and consistent analytics query service to offline data users at Airbnb, with improved performance and speed. And they’ll be explaining how they did it!

Pinterest: Journey to achieving 2x efficiency improvement on Trino

Trino usage has been growing at Pinterest each year, which comes with growing costs and increased demand on the existing Trino clusters. To help reduce costs and serve their Trino users, the engineering team there has migrated to AWS Graviton, taken advantage of Trino improvements, consolidated traffic, improved job scheduling, and worked to optimize their data and metadata formats. The end result has been a reduction in cost and an increase in query throughput. They’ll be sharing the details on the effort it took to make Trino faster and cheaper at the same time.

Quora: Adopting Trino’s fault-tolerant execution mode

Quora will be covering how they adopted Trino’s fault-tolerant execution mode to run some of their heaviest ETL jobs. They separate Trino queries from their main data pipelines in two clusters, one running the FTE mode for memory-intensive and longer jobs and another without it for lighter, general pipelines. This separation helped achieve better query failure rates, improved the execution time of long queries due to the more flexible autoscaling in FTE, and provided an alternative to run queries that would otherwise run out of memory without scaling up the cluster.

LinkedIn: Trino upgrades at exabyte scale

LinkedIn has been keeping up with Trino releases at an impressive rate, but getting to that point has required a lot of time, effort, and work on streamlining the update process. They’ll be discussing the challenges of breaking changes, applying internal patches, and ensuring that there are no meaningful performance regressions. They’ve automated much of this, including implementing a post-commit integration test suite that ensures nothing has broken, and creating an automated test framework that can validate the performance of each new Trino release before it deploys to users.

EA: Migrating 120 million HMS metadata records without customer impact

Migrating production databases is a scary task no matter who you are. It’s scarier when you’re talking about 600+ databases, 35,000+ tables, and over 120 million partitions, all of which you need to migrate while avoiding any customer impact. EA managed to pull it off with the help of Trino, and they’ll be at Trino Summit to share how they made it work and what they learned along the way.

SK Telecom: Efficient Kappa architecture with Trino

SK Telecom is bringing us two talks this year, as they’ve got a lot going on and some unique Trino stories to share!

The first talk will dive into Kappa architecture and the challenges involved in getting it to run in real-time at the massive scale SK Telecom needs. They started with Trino’s Kafka connector, but the limitations of that architecture steered them towards a solution with Flink and Trino’s Iceberg connector, which they’ll explain. They’ll also be sharing some tips and tricks for tuning Flink and Iceberg to get the most out of your Trino deployments.

SK Telecom: Unstructured data analysis using polymorphic table functions in Trino

The second talk will discuss the challenges of dealing with unstructured data. Pre-processing is essential for analyzing unstructured data, and it’s difficult for ordinary users and analysts to distribute large amounts of unstructured data. With the power of a custom-built polymorphic table function, they were able to invoke Python code within Trino to help structure that data for analysis, solving the problem in a powerful and fascinating way. We’ll get to hear about polymorphic table functions, how they work in Trino, and how anyone else may be able to leverage them to solve problems.

Raft: Avoiding pitfalls with query federation in data lakehouses

Raft has partnered with the US Department of Defense to build a data fabric that is built on top of Delta Lake, Trino, Apache Kafka, and Open Policy Agent (OPA). This talk will discuss the challenges involved, provide solutions and considerations for each, and end with a demo of Raft’s data fabric. The talk will focus on a plugin for Trino, developed by Raft, that uses OPA as a policy engine to provide fine-grained access control at query time based on a user’s JWT passed along with the query.

Treasure Data: Secure exchange SQL

Secure Exchange SQL is a production data clean room service deployed at Treasure Data, which leverages Trino and differential privacy technology to enable cross-company data analysis while mitigating the risk of privacy breaches. In their session, they’ll introduce the concept of differential privacy and discuss the privacy protection methods that need to be implemented during SQL processing. To minimize changes to Trino’s codebase, they employed approaches of SQL rewriting and validation at the logical plan level. They’ll explain these methods and provide some practical use cases of their data clean room.

Zomato: Powering data marts through the Trino Iceberg connector

It’s a common theme in the Trino community - Zomato recently migrated from a traditional data warehouse to a Trino-powered data lakehouse in conjunction with Iceberg. They’ll be discussing how this has enabled their analytics to run better than ever, including periodic updates to their data marts and tackling the challenges involved in maintaining Iceberg tables.

Bazaar: Powering Bazaar`s business operations using Trino

Bazaar’s talk will discuss how they leverage Trino’s capabilities to optimize data analysis and support data-driven decision-making. The talk specifically explores including real-time data querying across multiple sources and performance optimization, illustrating Trino’s role in Bazaar’s data-centric strategies. This presentation provides in-depth insights for individuals well-versed in Trino, shedding light on the platform’s transformative impact on enhancing e-commerce operations.

Preset: Visualizing Trino with Superset

Preset will be diving into the “last mile” of the modern data stack and show you how to query and visualize data pulled from Trino with Apache Superset and/or Preset. Specifically, they’ll discuss things like Trino’s federated query support (a common wish for Superset users) and how Superset can support near-real-time analytics for Trino users. They’ll also give a demo of connecting to Trino, building SQL queries, designing charts and dashboards, and other ways to gain insight and stay on top of your data.

VAST: The VAST database catalog

The VAST Database connector for Trino was open-sourced this year! They’ll be discussing the architecture of VAST and the connector, the purpose and major use cases for it, and demonstrate the workflows surrounding the VAST Database in the Trino ecosystem.

And still more to come!

Believe it or not, the great lineup we’ve gone over here still isn’t every talk. Stay tuned here or on the Trino Slack to hear about the other speakers as they’re announced. And of course, if you want to catch all these talks live, engage in chat, and have an opportunity to ask questions, make sure to register to attend.

Data management with SQL and Trino

2023-11-15T00:00:00+00:00

In the third part of our training series Learning SQL with Trino from the experts David Phillips and I changed gears from reading data and performing analytics with Trino. We looked the the topic of write operations. We covered creating catalogs, schema, tables, and then inserting and updating data, and talked about related topics such as data source and connector support.

The recording of the event allows you to watch it all as if you attended live, jump to specific sections as desired, or pause while you follow along with the demos:

The full timestamps for every part of the talk are in the description on YouTube.

Also make sure you take advantage of these additional resources:

General overview slide deck for the series, with links to resources like our community chat
Slide deck for Data management with SQL and Trino, including a file with all SQL statements ready to go
Trino: The Definitive Guide

One more episode to go this year, and then we are going to celebrate our users at Trino Summit 2023. Register now and catch us live for both events:

See you next time. I am excited to show you more about SQL routines.

Manfred

Share your best Trino SQL routine

2023-11-09T00:00:00+00:00

We want to see the best SQL routines you can write, feature them as examples in the documentation, and send you some goodies as a reward!

With the recent Trino 431 release we shipped a feature that has been awaited by many Trino users for a long, long time. SQL routines are an easy way to define our own procedural, custom functions. All users on your Trino instance can then use that function in their queries and enjoy the new feature to simplify their queries.

The new process of writing a routine in your client tool in SQL can be used as alternative to the old way of having to create a custom plugin in Java, compiling it, and getting the binary deployed in your cluster. The time it takes to use a function has gone from hours to minutes and a few commands!

Our documentation includes details for all the supported statements:

BEGIN
CASE
DECLARE
FUNCTION
IF
ITERATE
LEAVE
LOOP
REPEAT
RETURN
SET
WHILE

With the memory connector and the Hive connector supporting routine storage, you can use CREATE FUNCTION and DROP FUNCTION, so that everyone using the cluster has access to your routines.

The unit tests and our examples documentation contain a number of routines that scratch the surface of what is possible. Now, we are looking for you to help us improve the documentation and maybe even find some bugs. So here is what we are asking from you:

Upgrade your Trino cluster, CLI, and other clients to 431 or newer. Support in client tools may vary.
Learn from the documentation and write your own routines.
Send us your best SQL routine.
- Create a pull request to add to the examples in the documentation with a new section, and request a review from Manfred (mosabua)
- Alternatively, email the details and submit a CLA separately.
Explain the use case, what the routine does, and maybe also how it works.
Include the full statement for the CREATE FUNCTION definition and an example invocation.
Add any necessary tables or data so we can test the function.
Reach out to us on the Trino community Slack, if you need any help.

We plan to present submissions at Trino Summit 2023, write a blog post, add them to the documentation, and Starburst will send a cool reward for the ten best entries.

Also, if you have more great Trino usage to talk about and share, we would love to see your speaker proposal for Trino Summit.

We look forward to seeing many great submissions from you all.

See you at Trino Summit 2023, and don’t forget to register.

Martin, Dain, David, and Manfred

Trino is moving to Java 21

2023-11-03T00:00:00+00:00

We’re excited to announce that as of version 432, Trino can run with Java 21. In fact, the Trino Docker image uses Java 21 now. We have done upgrades to newer Java LTS versions successfully before when we upgraded to Java 11 and then Java 17 with Trino 390. Each time the improvements to the JVM runtime, the garbage collectors, the involved libraries, and the dependencies resulted in performance gains that came nearly for free.

And each time we were able to take advantage of new language constructs and standard libraries to improve the codebase for all contributors and maintainers of the project.

Now it is time to do it again.

In September, Java 21 was released as the newest long-term support version. The consolidated release notes are truly impressive when it comes to breath and depth of improvements throughout the runtime, the standard libraries, the included tools, and the overall system.

Java 21 provides numerous great opportunities to improve Trino. Even without many code changes, the performance benefits can have a significant impact on the cost of running a Trino cluster.

Taking it one step further, and into the codebase and used libraries, we are able to move our performance work to the next level. Project Hummingbird, our performance fine-tuning initiative, is buzzing already. Dain Sundstrom shipped some great improvements recently again. Just like with our Java 17 upgrade, Mateusz Gajewski has been of critical importance to pull all the necessary changes together.

With the Trino 432 release we have now made the next big step. The Trino Docker image was changed to use the Eclipse Temurin distribution of Java 21. We have been running our test suites with Java 21 for quite some time and all looks good. With this release, you are now able to easily test Trino with Java 21. Just use the Docker container in your deployment or testing with your own pipeline or with the Trino Helm charts. The new version 0.14.0 of the chart already uses the right JVM configuration and Trino 432 by default.

Our plan is to make Java 21 the required runtime and move towards adopting the new language features and libraries. However, before we do that, we want your input. Are you ready to move to Java 21 for Trino? Did you do some testing with it already? Are there any issue you encounters? We want to know all about your experience. Find us on the Trino community chat and ping us in the #dev channel. Or leave comments in our Java 21 tracking issue.

We want to hear from you. Any input and feedback is welcome.

Update from 11 Jan 2024: The release of Trino 436 includes the switch to Java 21 as a requirement for running Trino.

Advanced analytics with SQL and Trino

2023-11-01T00:00:00+00:00

In the second part of our training series Learning SQL with Trino from the experts Martin Traverso and I built on top of the foundational knowledge from the first training session. We continued to learn more about data types and working with them, including the important strings, numeric, temporal, and JSON types.

The recording of the event allows you to watch it all as if you attended live, jump to specific sections as desired, or pause while you follow along with the demos:

Following are a couple of specific timestamps for interesting topics snippets:

The full timestamps for every part of the talk are in the description on YouTube.

Also make sure you take advantage of these additional resources:

General overview slide deck for the series, with links to resources like our community chat
Slide deck for Advanced analytics with SQL and Trino, including a file with all SQL statements ready to go
Trino: The Definitive Guide

We are halfway through the series, and there is lots more to cover. Don’t forget to register for the next session, join us to ask specific questions, and learn much more about SQL and Trino:

See you next time,

Manfred

Getting started with Trino and SQL

2023-10-18T00:00:00+00:00

In our training series Learning SQL with Trino from the experts Martin Traverso, Dain Sundstrom, David Phillips, and myself will run through the wide range of SQL support and features of Trino with our audience. In the first episode, we covered the concepts of Trino and SQL, and then started to learn some basic SQL. Now you can take advantage of the recording and available resources to learn at your own pace.

The recording of the event allows you to watch it all as if you attended live, jump to specific sections as desired, or pause while you follow along with the demos:

Following are a couple of specific timestamps for interesting topics snippets:

The full timestamps for every part of the talk are in the description on YouTube.

Also make sure you take advantage of these additional resources:

General overview slide deck for the series, with links to resources like our community chat
Slide deck for SQL and Trino concepts
Slide deck for SQL basics with Trino, including a file with all SQL statements ready to go
Trino: The Definitive Guide

Now that you know of the series and saw the first part of it, make sure you register for the next ones, so you can ask specific questions and learn much more about SQL and Trino:

See you then,

Manfred

A report from the Trino Conference Tokyo 2023

2023-10-11T00:00:00+00:00

The Trino community in Japan held an online event on October 5th, 2023. This article is a summary of the conference aiming to share the presentations and provide an overview.

Watch a replay of the whole event, or jump to specific time stamps and topic of interest:

This year, there were 4 sessions:

Trino, Starburst Galaxy, and Enterprise
Log infrastructure using Trino and Iceberg
Data infrastructure using Spark and Trino on bare metal k8s
Getting started Trino and a transactional data lake with serverless Athena

Trino, Starburst Galaxy, and Enterprise

The first session was presented by Yuya Ebihara (me) from Starburst. I explained the Trino changes from 2022 and 2023, as well as features of Starburst Galaxy and Starburst Enterprise. The session introduced a press release of the partnership of Starburst and Dell Technologies in Japan.

Log infrastructure using Trino and Iceberg

The second session was presented by Tadahisa Kamijo from Sakura Internet. He explained some requirements for new analytics environments such as concurrent read/write, schema evolution, record-level modification, restoring past snapshots, and addressing performance issues with the Hive metastore. They decided to use Trino and Iceberg for handling these requests. Kamijo-san also introduced the file layout in Iceberg and demonstrated how to debug Iceberg files using their Java client.

Data infrastructure using Spark an Trino on bare metal k8s

The third session was presented by Yasukazu Nagatomi from MicroAd. They started a migration to Trino from Impala to resolve the following issues - separating computing and storage, refreshing and utilizing table and column statistics even with large tables, and supporting schema evolution. Nagatomi-san shared a use case of the Trino features fault-tolerant execution and spill-to-disk, which is the first public use case of these features in Japan.

ベアメタルで実現するSpark＆Trino on K8sなデータ基盤 from MicroAd, Inc.(Engineer)

Getting started Trino and a transactional data lake with serverless Athena

The last session was presented by Sotaro Hikita from AWS. Athena is a serverless service for ad hoc analytics with Trino and Presto foundation. It supports not only S3 data but also various datasources via Federated Query. In Athena, Iceberg supports both read and write operations, while Hudi and Delta Lake only support read operations.

Wrap up

We sincerely appreciate the participation of community members in Japan. Thank you so much for watching the live event. We are planning to hold an offline event next year, see you next time!

Yuya

Trino Gateway has arrived

2023-09-28T00:00:00+00:00

You started with one Trino cluster, and your users like the power for SQL and querying all sorts of data sources. Then you needed to upgrade and got a cluster for testing going. That was a while ago, and now you run a separate cluster configured for ETL workloads with fault-tolerant execution, and some others with different configurations.

With Trino Gateway we now have an answer to your users request to provide one URL for all the clusters. Trino Gateway has arrived!

Today, we are happy to announce our first release of Trino Gateway. The release is the result of many, many months of effort to move the legacy Presto Gateway to Trino, start a refactor of the project, and add numerous new features.

Many larger deployments across the Trino community rely on the gateway as a load balancer, proxy server, and configurable routing gateway for multiple Trino clusters. Users don’t need to worry about what catalog and data source is available in what Trino cluster. Trino Gateway exposes one URL for them all. Administrators can ensure routing is correct and use the REST API to configure the necessary rules. This also allows seamless upgrades of clusters behind Trino Gateway in a blue/green deployment mode.

Up to now, many users had to maintain separate forks of the legacy Presto Gateway. Some of these users created numerous improvements in isolation of each other, sometimes even implementing the same feature multiple times. This first release of Trino Gateway starts a strong collaboration of some of these users. Bloomberg contributed the main bulk of the new features, including the much-requested support for authentication and authorization on Trino Gateway itself. Maintainers and contributors from Starburst pulled together the stakeholders and managed the project, and collaborators from Naver, LinkedIn, Dune, and others are already helping out and ready to move the project forward.

There are exciting times ahead for the project, and we have big plans for documentation, installation, and general modernizations of the app, so go and have a look at the project, read the documentation and release notes, file an issue, or submit a pull request:

Trino Gateway

Interested to find out more? Find us and others users and contributors on the trino-gateway and trino-gateway-dev channels in the Trino community Slack.

Also, don’t forget to tell us about your usage of Trino Gateway or Trino and submit a talk for Trino Summit 2023. And if you just want to learn and listen to others, register as attendee.

Manfred, Martin, and all the other Trino Gateway contributors

Learning SQL with Trino from the experts

2023-09-27T00:00:00+00:00

Do you have a rough idea of what SQL is? Do you need to get data out of object storage in the cloud and some relational database at the same time? You should look at Trino and learn about SQL.

Or do you know the ins and outs of joins, window functions, and your SQL queries are counted by the pages and not lines? You may even be the expert on SQL on your team. You should also look at Trino and SQL.

Luckily for you all, we have the right SQL training for everyone in our upcoming series with the founders of the Trino project and SQL experts Martin Traverso, Dain Sundstrom, and David Phillips, and myself as host and co-trainer.

In the SQL training series, we start with the basics of Trino. You will learn that despite the fact that there is leopard frog on the cover of Trino: The Definitive Guide, SQL does not stand for Silly Quacking Leopardfrogs. Instead SQL stands for Structured Query Language, and you will learn about the benefits of connecting many data sources to Trino, and using different clients. And you can always use the same powerful SQL. And for the SQL pros, you learn about catalogs and queries that go across data sources.

Then we’ll glance at the basic SQL foundations, since there are literally hundreds of books, videos, and training course around. All of them teach you things like SELECT statements, and WHERE clauses, and unravel the confusions around LEFT OUTER JOIN and the like.

And after this is when we get to the interesting stuff. Following is a list of some of the topics we will cover:

Trino concepts like cluster, data source, client, catalog, and more
Overview of all the SQL support with statements, data types, functions, and connector support
Working with data types, including numerical and text values, dates and times, JSON, …
Lots of scalar, aggregation, window functions
Object storage and other data sources
Creating schemas, tables, and views
Inserting, merging, moving and deleting data
Metadata in general and in hidden tables like $properties
Table procedures
Trino views, Trino materialized views and other views
Global and connector level table functions, including query pass-through
Support for SQL routines, also known as user-defined functions

Interested now? No matter how great your SQL knowledge or Trino expertise is, you will learn something new in this series. So what are you waiting for?

Join us in one or all of the sessions on the following dates:

18th of October 2023: Getting started with Trino and SQL
1st of November 2023: Advanced analytics with SQL and Trino
15th of November 2023: Data management with SQL and Trino
29th November 2023: Functions with SQL and Trino

We look forward to seeing you in class.

Martin, Dain, David, and Manfred

Update:

Videos, slide decks, and other resources for all classes are now available:

Getting started with Trino and SQL: Blog post with resources and video, Video on YouTube
Advanced analytics with SQL and Trino: Blog post with resources and video, Video on YouTube
Data management with SQL and Trino: Blog post with resources and video, Video on YouTube
Functions with SQL and Trino: Blog post with resources and video, Video on YouTube

Chinese edition of Trino: The Definitive Guide

2023-09-21T00:00:00+00:00

Trino, Trino, Trino everywhere. Just looking at our website stats and the users in our community chat, we know that Trino is going places. We also know that one of these places with a large user community is China. And now we have good news for you. A translation of the second edition of the book to Chinese is now available.

Today, we are happy to announce that a Chinese translation of the book Trino: The Definitive Guide is now available for the communities all across China and far beyond and hopefully a lowers the barrier to Trino for native speakers. We invite you all to get your own copy:

Trino权威指南(原书第2版) 机械工业出版社

Our thanks goes out the teams at O’Reilly and dangdang for making this happen. We hope many readers will benefit from the translated edition.

We look forward to chatting with many of our new readers and Trino users on the general-cn channel in the Trino community Slack, other channels, and direct messaging.

Also, don’t forget to tell us about your usage of Trino. You can contact us on Slack to be a guest in Trino Community Broadcast or submit a talk for Trino Summit 2023. And if you just want to learn and listen to others, register as attendee for Trino Summit 2023.

Manfred, Martin, and Matt

Join us for Trino Summit 2023

2023-09-14T00:00:00+00:00

The Trino community is buzzing. Commander Bun Bun is ready to invite you all to join us for Trino Summit 2023. And “all” really means everyone in the community. The event is free to attend, virtual, and full of news and shared knowledge from your peers using Trino. Don’t hesitate to submit your talk and register to attend now.

We are pleased to announce the upcoming Trino Summit 2023. The summit is scheduled as a virtual event on the 13th and 14th of December 2023, and attendance is free!

If you’d like to share your knowledge and information about Trino usage and give a talk at this year’s Trino Summit, we’re putting out a call for speakers. We are accepting submissions from now until the 12th of November, but we recommend submitting as soon as possible, because we expect slots to fill up fast.

We’re looking for intermediate to advanced-level talks on a variety of themes. If you have an interesting story about how you leverage Trino in your data platform for analytics and other workloads, found a neat way to extend it with a custom plugin or add-on, or swapped to Trino for a performance win, we’d love to hear about it. We’re excited to expand our speaker lineup with talks from the broader Trino community. Find more information about duration, technical details, and more suggestions when you submit your talk.

The event of the Trino Software Foundation is organized and sponsored by Starburst, and we invite other sponsors to help make this a successful event for the Trino community.

If that interests you or your employer, contact the Trino events team for more information.

And of course, we’re looking forward to reading your proposals and seeing you then.

FugueSQL: Interoperable Python and Trino for interactive workloads

2023-07-27T00:00:00+00:00

Fugue may be an unfamiliar name to those in the Trino ecosystem. It’s another Python tool, a programming model built to enhance interoperability between Python and SQL. On the Python side of things, it’s a wrapper around common tools like pandas and Polars that convert code into SQL for high-performance, large-scale query execution. So why are we talking about it at Trino Fest? Because Fugue recently launched an integration with Trino, enabling you to write Python code that can be converted to SQL to run on a high-powered Trino backend.

Though Trino users are quite familiar with SQL, it does present some challenges. Iterating on a SQL query and improving it can be difficult, and finding ways to optimize or speed things up can be a challenge that requires sophisticated external tools or working on hunches. Testing queries, especially incrementally, has never been super easy, either. Compare that to Python, which does not have those problems, but has issues of its own. Python, especially at scale, is not very performant. So it’s natural to try to take the advantages of both, which is what Fugue is aiming to do.

After that brief intro into Fugue, the rest of the talk consists of technical demos of the many various things that you can do with Fugue. This includes setting a query up, breaking it up into smaller parts, bringing it to pandas, and demonstrating extensions that are built into Fugue. With all of these intermediate steps, it becomes easier to unit test queries before sending them into production, making sure that everything works as expected.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Starburst Galaxy: A romance of many architectures

2023-07-25T00:00:00+00:00

Let’s cut straight to the chase with this lightning talk from Benjamin Jeter, a data architect, platform manager, and data engineer at Datto. For those that are not familiar with Datto, they are an American cybersecurity and data backup company. They’re the leading global provider of security and cloud-based software solutions purpose-built for Managed Service Providers (MSPs). In Benjamin’s talk, he goes through some of the considerations and design goals of a reference architecture pattern that they use and why they chose to use Trino with Starburst Galaxy.

Check out the slides!

Recap

But you might be wondering: what does Ben mean when he says “reference architecture”? A reference architecture pattern is a pattern for making arbitrary data available to end users in a reproducible and modular way. It’s an opinionated representation of what best practices look like for a given class of use cases. You can almost think of it as a conceptual tool for thinking critically about specific patterns through a pragmatic balance of simplicity and effectiveness. However, it is not something that will work for every use case and not necessarily the best solution.

The main design goal that Benjamin had was to facilitate near real-time data access while using only Trino. In addition, he wanted it to be simple, easy to understand, flexible, and adaptable. Accomplishing this design goal requires many steps, such as first having a daily batch transform that transforms JSON into Iceberg and serve as T-1 data. Then he created an unpartitioned external table that is rebuilt every day as part of the daily batch transform. Using the Great Lakes connectivity with this table allows Datto to have scan on query semantics, which enables data access about as real-time as you can get it without a streaming solutions like Kafka or Kinesis. Benjamin shows how easy it is to design a use case with just a couple lines of code using Trino with Starburst Galaxy.

Interested? Check out the video where Benjamin shows the code and explains how it works!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino optimization with distributed caching on data lakes

2023-07-21T00:00:00+00:00

By 2025, there will be 100 zetabytes stored in the cloud. That’s 100,000,000,000,000,000,000,000 bytes - a huge, eye-popping number. But only about 10% of that data is actually used on a regular basis. At Uber, for example, only 1% of their disk space is used for 50% of the data they access on any given day. With so much data but such a small percentage being used, it raises the question: how can we identify frequently-used data and make it more accessible, efficient, and lower-cost to access?

Once we have identified that “hot data,” the answer is data caching. By caching that data in storage, you can reap a ton of benefits: performance gains, lower costs, less network congestion, and reduced throttling on the storage layer. Data caching sounds great, but why are we talking about it at a Trino event? Because data caching with Alluxio is coming to Trino!

Check out the slides!

Recap

So what are the key features of data caching? The first and foremost is that the frequently-accessed data gets stored on local SSDs. In the case of Trino, this means that the Trino worker nodes will store data to reduce latency and decrease the number of loads from object storage. Even if the worker restarts, it also still has that data stored. Caching will work on all the data lake connectors, so whether you’re using Iceberg, Hive, Hudi, or Delta Lake, it’ll be speeding your queries up. The best part is that once it’s in Trino, all you need to do is enable it, set three configuration properties, and let the performance improvement speak for itself. There’s no other change to how queries run or execute, so there’s no headache or migration needed.

Hope then gives deeper technical detail on exactly how data caching works. She highlights a few existing examples of how large-scale companies, Uber and Shopee, have utilized data caching to reap massive performance gains. Then the talk is passed off to Beinan, who gives further technical detail, exploring cache invalidation, how to maximize cache hit rate, cluster elasticity, cache storage efficiency, and data consistency. He also explores ongoing work on semantic caching, native/off-heap caching, and distributed caching, all of which have interesting upsides and benefits.

Give the full talk a listen if you’re interested, as both Hope and Beinan go into a lot of great, technical detail that you won’t want to miss out on. And don’t forget to keep an eye on Trino release notes to see when it’s live!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Inspecting Trino on ice

2023-07-19T00:00:00+00:00

For those unfamiliar, Stripe is an online payment processor that facilitates online payments for digital-native merchants. They use Trino to facilitate ad hoc analytics, enable dashboarding, and provide an API for internal services and data apps to utilize Trino. In Kevin Liu’s session at Trino Fest 2023, he showcases the Trino Iceberg connector and how it can replace more complex usage to access Iceberg metadata. He also discusses how Trino is a core part of operations at Stripe.

Check out the slides!

Recap

Trino is the foundational infrastructure on which other data apps and services are built upon. In Kevin’s words, “I call Trino the Swiss army knife in the data ecosystem.”

At Stripe, they use Iceberg tables extensively, replacing legacy Hive tables. But Iceberg isn’t perfect: one problem with Iceberg is reading its metadata from S3. To work with Iceberg metadata, Stripe developed an internal CLI tool. The tool requires a privileged internal machine, which is only accessible to developers. And outputs the result in JSON format, which is difficult to process, read, and use for further analysis. However, Kevin found that the Trino Iceberg connector can replace most of the functionality of the Iceberg CLI. The connector brings Iceberg metadata information to Trino’s powerful analytical engine and facilitates lightning fast debugging and analysis.

Unfortunately, there was no way to grab all desired table property information from the Trino Iceberg connector, because they were using an older version. Thus, they use the Trino PostgreSQL connector to connect directly to the backend database of the Hive Metastore, allowing them to inspect table metadata directly. With the two connectors, they have all the information about the data warehouse, powering their analysis and meta-analysis of the data and how it’s used.

They also use Trino to inspect Iceberg usage patterns. They log every Trino query using the Trino event listener and store that in another PostgreSQL database. This gives the full information of every query that has ever run through Trino, and allows them to perform analysis using historical queries. Combined with Trino’s built-in query metadata enrichment, this method enables a multitude of auditing, debugging, and optimization use cases.

In the future, they plan to use Trino to improve data quality by leveraging it as a validation framework, to perform Iceberg table maintenance, and to optimize tables based on historical read patterns.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Data mesh implementation using Hive views

2023-07-17T00:00:00+00:00

At Comcast, data is used in a data mesh ecosystem, with a vision where users can discover data and request data through a self-service platform. With federation, various tools, and the ability to create, read, and write data with different platforms, it’s a full-blown data mesh. So how do you build that? With Trino, of course, and with the power of Hive views. Tune into the 10-minute lightning talk that Alejandro gave at Trino Fest to learn more about how Comcast pulled it off.

Recap

With various different storage systems, like S3 and MinIO, and users that want to be able to use a variety of data platforms, including Trino, but also Databricks and Spark, Comcast needed something to sit between the data and those platforms. The solution was the Hive CLI and Hive views, which could read from all their various forms of storage, and which could be read from all the user-facing query engines and data platforms with no issues.

By centralizing data, there was also the upside of easily integrating with Privacera, which allowed for privacy policies to be implemented without much issue. Users could request access to the data within the Hive views, and data owners could approve or reject access as appropriate. Because of the centralization, it was easy to go very fine-grained with data access rules, allowing for access control as specific as column-level.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

DuneSQL - A query engine for blockchain data

2023-07-14T00:00:00+00:00

The need to make blockchain data easily accessible has risen over the recent years due to the popularity of cryptocurrencies, NFTs, and other uses of blockchains. Dune has made it their mission to make blockchain data more accessible. Dune is a community data platform for querying public blockchain data and building beautiful dashboards. They use their own query engine called DuneSQL, built as extension of Trino, to query blockchain data. In the session, Miguel and Jonas from Dune talk about the challenges of querying blockchain data, their transition to Trino, and how DuneSQL is operated. Watch the recording of the session or keep reading for a recap.

Check out the slides!

Recap

The Dune community data platform is a serverless, open access, community-wide collaboration portal. Dune experienced some difficulties with blockchain data, such as processing and ingesting raw data, deserializing and decoding function calls and arguments, and allowing the community to build abstractions. Their engine, DuneSQL, is Trino with custom extensions that they created. It runs tens of thousands of queries that are executed, saved, and re-used each day.

At first, Dune used PostgreSQL, where they sharded per blockchain and used vertical scaling. However, they quickly ran into bottleneck issues on storage size and IOPS (I/O operations per second). Thus, they switched to Apache Spark with Databricks to allow horizontal scaling and support more blockchains processing and to support the vast query volume that they had. Unfortunately, the result was not performant and not interactive enough. In the end, Miguel says that, “Trino was our choice for performance reasons, for the good environment and ecosystem, and to fully support our scheme and our datasets.” Using Trino addressed the performance issues.

Operating DuneSQL requires modifications and extensions of Trino to suit the needs of the users and platform as a whole. DuneSQL needs to manage the whole fleet and the capacity they have, because they use over 4000 CPUs per hour, do more than 100 billion S3 requests per month, and operate over 10 clusters. To handle the scheduling and load balancing of these massive operations, DuneSQL uses query execution services and gateway. Clusters have a fixed size to have a predictable capacity and performance. The gateway exposes the clusters to reduce the blast-radius so failures do not affect other clusters. Even with all these adjustments, they still have work to do as they plan to optimize the billions of S3 requests they receive, improve data layout, and implement sandboxed user defined functions.

Interested in DuneSQL? Check out the video where Jonas goes over the specificities and unique characteristics of DuneSQL.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Let it snow for Trino

2023-07-12T00:00:00+00:00

In this recap, we can skip right to the exciting part: through the joint efforts of engineers at ForePaaS and Bloomberg, there is a Snowflake connector coming to Trino! Though it hasn’t landed yet, it has been tested and run in production at both companies, and a pull request is open and working its way towards completion as this blog post goes up. In the talk, Yu and Erik talk about difficulties in developing the connector, the motivations to make it happen, and the new features that come as part of it for Trino users to take advantage of. Sound interesting? Give the talk a listen, or read on for more details.

Check out the slides!

For those unfamiliar, Snowflake is a cloud-based data warehousing and analytics platform. It offers a great combination of scale, flexibility, and performance, with the downside of being a proprietary software that is vendor-locked, and in order to use Snowflake, you must go through Snowflake, Inc. ForePaaS and its customers store data in Snowflake, but they also store data in many other formats and systems, and they rely on Trino to run their analytics. With no Snowflake connector in Trino, this meant that while they could run analytics and queries on most data, Trino had a blind spot. They needed to develop a Snowflake connector in order to see and query 100% of their data. Bloomberg was in a similar boat, having data in Snowflake, using Trino for analytics, and needing a way to join those two together. With a shared need, ForePaaS and Bloomberg joined forced and made the connector happen.

The connector has been in use at both companies for some time, and it comes with the full feature set one would expect from a Trino connector. With the connector, you can query Snowflake directly from Trino, taking advantage of Trino’s lightning-fast speeds and the underlying features of Snowflake with no issue.

Curious to see more? For the rest of the talk, Erik Anderson at Bloomberg gives a demo of the connector in action. Give the talk a watch, and you can check out progress on how adding the connector to Trino is coming along on the pull request contributing it.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Redis & Trino - Real-time indexed SQL queries (new connector)

2023-07-10T00:00:00+00:00

Ever since the pandemic, it has become clear that the need for a digital first economy is becoming more and more necessary. As Redis’ Field CTO Allen Terleto said during their talk from Trino Fest 2023, “In a digital first economy, data is the lifeblood of the organization, which makes the databases the heart of enterprise architectures”. Redis, a popular open source project, is a distributed in-memory key–value database. It includes a cache, message broker, and optional durability. In his talk, Allen demonstrates Redis’ new connector for Trino. It can push down advanced queries and aggregations while leveraging Redis’ unique in-memory secondary indexing. As a result, performance with the new connector is much higher.

Recap

Redis is an open source, in-memory, NoSQL database that natively supports a variety of data structures. Redis is designed for utmost performance and high throughput use cases across different types of workloads. Redis is widely known for being the fastest data store in the market with sub millisecond performance, its ease of use, and being a multi-model database. Redis is able to map relational tables to a key-value database by adding a key-value pair as a hash attribute for each column. However, how can you search for a certain key in a way that scales well in high throughput databases? Redis has a unique way to deal with this problem: secondary indexing and Redis Search.

Redis Search enables secondary indexing and full-text search, which allows Redis to support many features such as multi-field queries, aggregations, exact phrase matching, numeric filtering, geo-filtering, and vector similarity semantic search on top of text queries. As Allen says, “Redis Search will be at the heart of our new integration with Trino and game-changing better performance at scale to the existing Redis Trino connector”. In addition, Redis supports a native data model for JSON documents, allowing you to store, update, and retrieve JSON values in a Redis database like other Redis data types. It also works with Redis Search to let you index and query JSON documents.

The syntax for Redis Search is a bit different from traditional SQL syntax, so Redis is introducing a quicker and more reliable Redis-Trino connector that lets you easily integrate with visualizations frameworks and platforms that support Trino. The connector is open source and publicly available on their public GitHub. In addition, it will be contributed directly to the Trino project.

Want to see Redis in action? Check out the video where Julien does a demo on how you can load data from some file system, relational database, or data warehouse and query it without writing a single line of code.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Skip rocks and files: Turbocharge Trino queries with Hudi’s multi-modal indexing subsystem

2023-07-07T00:00:00+00:00

Optimizing data access and query performance is crucial to building low-latency applications and running analytics. Even with the modern data lakehouse designed to be as efficient and performant as possible, there are a number of bottlenecks that can slow things down and plenty of challenges to overcome. Nadine and Sagar explored this at Trino Fest, introducing us to multi-modal indexing and the metadata table in Hudi, how they work, and how leveraging them with Trino can unlock queries faster than ever before.

Check out the slides!

Recap

When you’re building large-scale data-based applications, bottlenecks are inevitable. Finding ways to address these bottlenecks and optimizing your platform to avoid them is going to be a huge cost, so it pays off to know your requirements. In the same vein, if you know the types of services and features you need to effectively scale, you can build with them in mind from the ground up. Hudi has a couple key features you might be interested in that aren’t present in all lakehouses:

Write indexing, speeding up and optimizing inserts and upserts
Automated table services, which handle clustering, cleaning, compacting, and metadata indexing without any need for manual orchestration or overhead

Nadine also goes on a deep dive into exactly how the Hudi table format works, but emphasizes that these extra features elevate it to being an entire platform, not just a table format.

From there, Nadine passes things off to Sagar, who does an explanation of the multi-modal indexing sub-system in Hudi, which features a scalable metadata table, different types of indexes, and an async indexer. All of these features minimize tradeoffs while maximizing performance, helping you read and write data faster than ever. And with Trino’s Hudi connector, the Trino coordinator is able to read the feature-rich Hudi metadata to more effectively delegate workers, leveraging that speed as the best-in-class query engine for running analytics on your data stored in Hudi.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

AWS Athena (Trino) in the cybersecurity space

2023-07-05T00:00:00+00:00

Arctic Wolf Networks, a cybersecurity company that provides security monitoring to cyber threats, is one of the companies that have recently switched to using AWS Athena as a new and efficient service to query their data using Trino. AWS Athena is a serverless, interactive analytics service built on open-source frameworks that runs on Trino, supporting open table and file formats and providing a simplified, flexible way to analyze petabytes of data where it lives. Senior software developer Anas Shakra from Arctic Wolf Networks gave a talk at Trino Fest 2023 detailing their switch to AWS Athena and how “queries that took hours with old solution now take around a minute today”. Tune in to the talk or you can read the recap!

Recap

At Arctic Wolf, data access use-cases fall under three categories: investigations, compliance, and customer self-serve platform. The process of preparing the data follows an established pattern of starting with datastore, performing an operation to filter or transform the data, and then outputting the data in some format like a CSV or JSON, depending on the client needs. Arctic Wolf’s custom legacy service was unable to match the growing service demand and had four main problems:

Optimized for breadth over depth
Struggles to handle growing service demand
Proprietary query language
Complicated design

This compelled Anas’ team to find a different and improved service: Trino as provided by AWS Athena.

They had four main objectives for the new service: defined access patterns, performant at scale, user-friendly, and deterministic pricing. AWS Athena satisfied these objectives, while also providing numerous benefits such as using a powerful query engine, being purposefully built for large datasets, using SQL syntax, and having a clear pricing structure. However, with these benefits come some drawbacks for Athena. These includes being subject to quota limits, having suboptimal file sizes for their system, and being unable to control access sufficiently. Anas addresses this by using log queries that resolves these three main impediments. As next step, Anas is considering switching to a self-managed Trino deployment for more control with the same performance gains.

Want to learn more about log queries that they use? Check out Anas’ explanation in the video!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Ibis: Because SQL is everywhere and so is Python

2023-07-03T00:00:00+00:00

The PyData stack has been described as “unreasonably effective,” empowering its users to glean insights and analyze moderate amounts of data with a high level of flexibility and excellent visualization. The large-scale, production data stack using a query engine like Trino sits on the other side of the world, capable of handling petabytes and exabytes, but perhaps not integrating as seamlessly with the Python ecosystem as one would hope. SQL has been a means of bridging this gap, but we’ve now got an exciting solution to bridge it even better: Ibis.

Check out the slides!

A major problem with bridging the gap between Python and SQL engines has been the lack of standardization in SQL. Though Trino prides itself on being ANSI-compliant and many other SQL dialects strive to be similar, the reality is that every SQL engine is different, and a complicated SQL query will error out or return different results based on what engine you’re using. So if you want to convert some Python code to SQL, the question is… which SQL? If you’re doing your data analysis in Python because you prefer to use it, spending time scratching your head and trying to work out a SQL conversion can be frustrating, time-consuming, and painful. But SQL is everywhere, and for large, performant, efficient queries, you may need a SQL engine like Trino.

Enter Ibis, a lightweight Python library for “data wrangling.” It can easily convert your Python code into SQL queries for 16 different engines, including Trino. With Ibis, you can leverage the ease of writing Python code with the power and performance of running queries in Trino, getting the best of both worlds in both the Python and SQL ecosystems. Want to learn more? Check out the Ibis project website, give the talk a listen, and tune into the Trino Community Broadcast on July 6th, where we’ll be going into even more detail about Ibis.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

CDC patterns in Apache Iceberg

2023-06-30T00:00:00+00:00

Have you ever wanted to keep your data in a table and have an efficient way to interact with them? Iceberg, an open standard table format, is exactly what you need. One of the great and unique features of the Iceberg table format is its support for change data capture (CDC). Co-creator of Apache Iceberg, Ryan Blue, presented at Trino Fest 2023 this past week detailing the CDC support and the trade-offs between different patterns that can be used for writing CDC streams into Iceberg tables.

Check out the slides!

Recap

To begin, what is CDC and why should you use it? CDC is the idea that when relational or transactional tables are modified, you emit an update stream. This enables you to keep copies in sync by capturing changes to tables as they happen. As Ryan states, “[CDC] is very lightweight on the source database … rather than being super careful with what we run on the database, what we want to do is just make a copy of it very easily and maintain that copy.” Ryan continues giving an example of a bank using a transactional table in Iceberg to offer some context on what’s going on.

Although CDC has many advantages, there are also some problems that make it difficult:

Lower latency means more work
Write amplification - the work necessary to balance the trade-offs between efficiency at write time and efficiency at read time
Batch writes with double update and possible inconsistency
Read requirements with the different types of deletes in a table

With these types of problems, the importance of the trade-offs between the different patterns rise due to the need for utmost efficiency. The first trade-offs that Ryan talks about are the storage trade-offs between using direct writes and a change log table, which is considered the most important and often overlooked decision. The next trade-offs are in regards to the MERGE pattern’s choice of lazy merge (merge-on-read) or eager merge (copy-on-write). In addition, the commit frequency trade-offs have different benefits depending on if you prefer it to be faster or slower. The change log pattern and MERGE pattern both have benefits you may want, so Ryan suggests using a hybrid version of both that may give you what you want from both patterns. With Iceberg, you have the choice and the different CDC patterns can be supported for you to adjust your usage to your specific needs. Check out the video and review the slides for more details!

Want to read more about CDC? Check out some of Ryan Blue’s blog posts: Hello, World of CDC! and CDC Data Gremlins!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Zero-cost reporting

2023-06-28T00:00:00+00:00

Let’s say you have some data. Maybe it’s in a spreadsheet, a CSV file, a relational database, or multiple terabytes of data in an S3 bucket. You need to run SQL queries on this data, and you’d like to share those results with your teammates, coworkers, and partner teams, but you want to do it in a way that allows everyone to view those results on-demand, on the web, and with the latest results without the need for any manual effort on your part.

Recap

There are a lot of tools that might be able to do this for you, but whatever you choose, you’ll need to spend time or money to set it up, and you don’t want to spend a lot. With so many options, there’s the possibility of getting stuck in analysis paralysis, and trying to find the best way forward may leave you stymied. Jan Waś from Starburst has a suggestion: keep it simple with Trino, plaintext files, Git, and GitHub actions, and you can set it all up for free.

To start, why put results into plaintext files? With markdown, files are both human-legible and machine-readable. By saving queries in normal files, it’s easy to see and edit those queries. You can commit your queries and results to Git, and then you can push them to a service like GitHub, where those files will be even more readable thanks to the web UI. Then, once on GitHub, you can use the power of actions to re-run the queries, update your results on a schedule, and keep things up to date for teammates to view via GitHub Pages. Sound neat? Check out the talk to see how Jan does it!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Anomaly detection for Salesforce’s production data using Trino

2023-06-26T00:00:00+00:00

Rolling into our next presentation from Trino Fest 2023, we’re excited to bring you Tuli Navas and Geeta Shankar’s talk from the Performance Engineering Team at Salesforce. They provide numerous reasons for why they need Trino and further explain how it is essential for anomaly detection in their data. It’s an insightful talk about using a query engine to ensure data quality and how switching to Trino has massively improved their performance. You definitely don’t want to miss it.

Check out the slides!

Recap

Salesforce provides customer relationship management software and applications focused on sales, customer service, marketing automation, e-commerce, analytics, and application development. They host hundreds of thousands of customers that generate millions of transactions per day. For a company of this size, they need a query engine that is fast and efficient. During the talk, Tuli made it clear how much Salesforce relies on Trino, stating, “Trino has been a one-stop shop for analytics.” Trino is the perfect solution for them, as Tuli mentions, “Because of how well Trino scales and how efficiently it has been able to process even the most gnarly looking queries.” It allows them to do everything they need.

In addition, Trino has helped Salesforce get more value from their production logging data by accelerating their access to it, speeding up their decision making. For years, they used Splunk for all their production data, but after switching to Trino, they have had numerous improvements:

Reducing their team’s analytics cost
Improving their cost-to-serve
Improving the time it takes to run the same query by 194%
Providing an SLA of 20-minute latency on all production logs
Retaining and accessing data up to 2 years compared to Splunk’s 30 days
Reducing the number of queries needed, which creates a smaller footprint
Creating tables and views for temporary data storage and analytics

With this, they use specific heuristics to create an anomaly detection framework with a very quick response time that they are able to constantly observe. This also allows them to monitor customer behavior efficiently, allowing them to respond to any urgent changes quickly. In the future, they plan to expand and ramp up their usage of Trino throughout their teams.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino for lakehouses, data oceans, and beyond

2023-06-22T00:00:00+00:00

Trino Fest 2023 got off to a bang, as Trino co-creator and maintainer Martin Traverso gave an update on all the amazing things that have happened to Trino since Trino Summit last year. He also provided some insight into what’s coming down the pipeline for Trino, with a brief look at the project’s roadmap. You can watch the recording of the talk if you want to see for yourself, or you can read on for the highlights.

Check out the slides!

Recap

It’s only been about 7 months since Trino Summit in 2022, but Trino moves quickly. In the words of Martin, “the project is on fire” and “is as active as it’s ever been,” leaving us a lot to catch up to since then:

16 releases and 2,250 commits
Two new maintainers
Several new table functions
Simplified configuration and improved performance for fault-tolerant execution
Better support for schema evolution and lakehouse migration
45 bullet points worth of performance improvements
Tracing with OpenTelemetry
An improved Python client and dbt Cloud support

And keep in mind that these are the highlights of the highlights! In the talk, Martin goes into depth on all of the above, making it a worthwhile watch or listen. There’s also a lot to look forward to, which you’ll hear more about as they roll out in the coming months:

SQL 2023, including enhancements to JSON functions and numeric literals
A new Snowflake connector and an improved Redis connector
Java 21
Project Hummingbird, the ongoing effort to incrementally make Trino faster than ever before

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino Fest 2023 recap

2023-06-20T00:00:00+00:00

Last week we held Trino Fest, and it kept us all so busy, we forgot to spend time chilling by the lakehouse! Great demos, amazing announcements, new plugins, and use cases reached our active audience. Thanks go to our event host and organizer Starburst, to our sponsors AWS and Alluxio, to our many well-prepared speakers, and to our great live audience. Now you get a chance to catch up on anything you missed.

In the weeks leading up to the event, we published numerous blog posts, and racked up great interest in the Trino community and beyond. Over 1100 registrations blew away our numbers from last year. More importantly, during the two half-days of the event, we had over 560 attendees watching live and participating in the busy chat.

Sessions

If you could not attend every session, or if you missed out on attending completely, then we’ve got great news for you! You still have a chance to learn from the presentations and the experience and knowledge of our speakers.

Trino for lakehouses, data oceans, and beyond presented by Martin Traverso, co-creator of Trino and CTO at Starburst.
Anomaly detection for Salesforce’s production data using Trino presented by Geeta Shankar and Tuli Nivas from Salesforce.
Zero-cost reporting presented by Jan Waś from Starburst.
CDC patterns in Apache Iceberg presented by Ryan Blue from Tabular.
Ibis: Because SQL is everywhere and so is Python presented by Phillip Cloud from Voltron Data.
AWS Athena (Trino) in the cybersecurity space presented by Anas Shakra from Artic Wolf.
Skip rocks and files: Turbocharge Trino queries with Hudi’s multi-modal indexing subsystem presented by Nadine Farah and Sagar Sumit from OneHouse.
Redis & Trino - Real-time indexed SQL queries (new connector) presented by Allen Terleto and Julien Ruaux from Redis.
Let it SNOW for Trino presented by Erik Anderson from Bloomberg and Yu Teng from ForePaaS.
DuneSQL, a query engine for blockchain data presented by Miguel Filipe and Jonas Irgens Kylling from Dune.
Data Mesh implementation using Hive views presented by Alejandro Rojas from Comcast.
Inspecting Trino on ice presented by Kevin Liu from Stripe.
Trino optimization with distributed caching on Data Lake presented by Hope Wang and Beinan Wang from Alluxio.
Starburst Galaxy: A romance of many architectures presented by Benjamin Jeter from Datto.
FugueSQL, Interoperable Python and Trino for interactive workloads presented by Kevin Kho.

Next up

This first recap is sharing all the video recordings with you all if you can’t wait. But stay tuned, because we’ll also be publishing individual blog posts and recaps for each session, and they’ll include additional useful info:

Summary of the main lessons and takeaways from the session
Slide decks for you to browse on your own
Interesting and fun quotes from the speakers and audience
Notes and impressions from the audience and event hosts
Questions and answer during the event
Links to further documentation, tutorials, and other resources

We’ll be rolling out recap posts for a few talks each week, so keep an eye out on our community chat or the website for updates.

At the same time, we are already marching ahead and planning towards our next major event in autumn. Trino Summit 2023 - here we come!

Trino Fest nears with an all-star lineup

2023-06-01T00:00:00+00:00

Trino Fest is just around the corner! We’re only two weeks away, and we’re excited to share that we’ve got an incredible speaker lineup with a wide variety of talks about all things Trino. If you’re out of the loop, we announced Trino Fest back in April as a two-day, free, virtual event. If you want to attend, see talks live, engage with our speakers in Q&As at the end of each session, you’ll need to register, so don’t delay, and…

With that said, we’re also excited to bring you a preview of our exciting speaker lineup. Read on if you’d like to learn more.

New connectors

We’ve got two talks, one from Bloomberg and ForePaaS and another from Redis, about ongoing efforts to extend Trino’s functionality to query even more data sources. Erik Anderson from Bloomberg and Yu Teng from ForePaaS will talk about their shared need for a Snowflake connector and the collaboration to get their two connectors merged and then merged into Trino. Allen Terleto and Julien Ruaux at Redis will be talking about a new, custom, and improved Redis connector for Trino, showing how you can leverage the speed of both Redis and Trino to run queries faster than ever while seamlessly integrating with data visualization frameworks.

The Python ecosystem

We’ve got talks from Fugue and Ibis, two different tools that integrate Python with SQL, and then run that SQL on underlying data sources. Both have recently added Trino support, and they’re excited to share their use cases and introduce the Trino community to the new, powerful ways you can leverage it. Trino has always been a SQL query engine, but with Fugue and Ibis, writing Python code to run queries with Trino is suddenly a reality, and analysts and data scientists may not even need to know much SQL to get the insights they’re looking for.

Data lakes

Ryan Blue, the co-founder of Iceberg and founder of Tabular, will be exploring how to best write CDC (change data capture) streams into Iceberg tables. A talk from Kevin Liu at Stripe will explore how a data engineer can monitor queries being run on Iceberg to catch performance outliers and understand usage rates. A talk from Alluxio highlights caching optimizations with Trino and data lakes. OneHouse is giving a talk about using Trino with Hudi, exploring how to get query latency down, how multi-modal indexing works in Hudi, and how Trino can utilize that indexing to execute queries at astonishing speeds. A lightning talk from Comcast will explore Hive views, and DuneSQL will be discussing its use of Trino with Delta Lake, rounding out coverage on all four of Trino’s lakehouse connectors.

And more!

We’ll hear from customers of Trino’s main commercial vendors - Datto will be discussing their use of Starburst Galaxy, and Arctic Wolf will give an overview of how AWS Athena helps them provide data to customers. Jan Was from Starburst has a lightning talk on avoiding the costs of BI tools or expensive visualization software by setting things up for free with GitHub Actions. And Walmart has a talk on finding ways to cut costs with cloud storage, rounding out our expansive lineup.

Does any of that sound exciting? Go sign up to attend Trino Fest 2023, and we look forward to seeing you there!

Trino at Open Source Summit North America 2023

2023-05-15T00:00:00+00:00

Last week, I had the pleasure to attend Open Source Summit North America 2023 in Vancouver. A quick hop across the Strait of Georgia got me right into the event and into the midst of my peers of open source developers, advocates, and enthusiasts.

A highlight of the event for me was catching up with many existing and new friends from the open source communities. It was inspiring to learn details about the success of open source projects, including Opensearch, RISC-V, the British Columbia government DevHub project, NASA open source and open data projects, and many others.

In my interview with John Furrier and Rob Strechay for SiliconANGLE theCUBE, I was able to share more information about Trino, query engines, lakehouses, and Starburst. We also talked about the benefits of using Trino for different use cases, how data continues to be crucial, and how it is even important thanks to the new wave of large language models.

SiliconANGLE theCUBE features more interview coverage from the summit. and The Linux Foundation makes keynote and session videos as well as presentation decks available.

My special thanks goes to Starburst for sending me to represent the Trino community at the summit. I also really appreciate the help for organizing Trino Fest. The speaker proposals are all in, and the free, virtual event is promising to be a great showcase of Trino, modern lakehouse platforms and tools from the community of users, contributors and vendors, and our increased adoption for a wide range of use cases.

Join us in June for the event, you don’t want to miss some of the announcements and demos.

Manfred

Refreshing at the lakehouse summer camp

2023-05-03T00:00:00+00:00

Summer is just around the corner, and we are busy getting ready for Trino Fest 2023. Everything is ramping up. Early birds are starting to register, and so should you. Our Trino Fest theme song is available for your listening pleasure, and we are reviewing speaker submissions. The festival is promising to be another great event to learn about lakehouse use cases with Trino, but we are also featuring some great presentations for querying data with Trino. And of course, we are still looking for more presenters, so don’t hesitate and submit your proposal.

Before you dive into the technical details of our upcoming conference, lean back and listen to our theme song. Hopefully you are feeling the summer vibe coming your way already.

Our event host Starburst is again helping us ensure that Trino Fest is a venue for Trino beginners and experts to meet, exchange ideas, and learn from each other. One of the Starburst engineers, Jan Waś, is scheduled to present about his amazingly low-effort setup to use Trino for data analysis and report generation.

Getting closer to the theme of the event “Lakehouse summer camp”, we are planning to have sessions about Iceberg, Delta Lake, and Hudi usage with Trino. Learn about the latest developments from these projects and practical tips and tricks from the user community.

In the keynote, Martin Traverso will speak about the many new features that arrived in Trino since Trino Summit last year. This includes the new Apache Ignite connector we talked about in the Trino Community Broadcast episode 46. At Trino Fest we are going to share some more exciting news about new connectors and integrations for Trino. Specifically on the client tooling side you can expect some great demos and news from the Python community.

So what are you waiting for? It’s time to register for the event. And if you think you also want to share your knowledge and usage of Trino, submit a speaker proposal.

In either case, as your hosts and guides through the two half days, we look forward to have you at the event.

Manfred and Cole

Just the right time date predicates with Iceberg

2023-04-11T00:00:00+00:00

In the data lake world, data partitioning is a technique that is critical to the performance of read operations. In order to avoid scanning large amounts of data accidentally, and also to limit the number of partitions that are being processed by a query, a query engine must push down constant expressions when filtering partitions.

Partitions in an Iceberg table tend to be fairly large, containing up to tens or even hundreds of data files. It is therefore crucial to be able to skip irrelevant partitions while scanning a table in order to ensure high performance query processing speed. When a table is created in a data lake, its partitioning scheme constitutes a de-facto index, speeding up queries against it by pruning out irrelevant partitions from the scan operation.

Date and time are natural and universal partitioning candidates. Common partition patterns revolve around month, day, hour. One exciting feature of the Iceberg table format is its hidden partitioning. Iceberg uses handy transforms such as year, month, day, hour to deal with the complexities of mapping a raw timestamp value to an actual partition value in a manner that is transparent to the user.

Let’s look at a typical example of an Iceberg table containing log events which are partitioned by day:

CREATE TABLE logs (
    event_time timestamp(6) with time zone,
    level varchar,
    message varchar)
WITH (partitioning=ARRAY['day(event_time)'])

When dealing with logs, it often happens that we want to know what happened today or within the last few days:

SELECT *
FROM logs
WHERE
  event_time >= CURRENT_DATE

SELECT *
FROM logs
WHERE
  event_time >= CURRENT_DATE - INTERVAL '7' DAY

Constant folding

Trino uses the constant folding optimization technique for dealing with these types of queries by internally rewriting the filter expression as a comparison predicate against a constant evaluated before executing the query in order to avoid recalculating the same expression for each row scanned:

Predicate pushdown

Another common query scenario for log data is to query for a specific date in the past. A seasoned SQL user, being aware of the underlying data type of the partitioning column, would likely specify the date to be queried explicitly as two timestamp constant filter expressions:

SELECT *
FROM logs
WHERE
  event_time >= TIMESTAMP '2022-01-20 00:00:00.000000 UTC'
  AND event_time < TIMESTAMP '2022-01-21 00:00:00.000000 UTC'

A different flavor of the above-mentioned query would be to use the BETWEEN range operator:

SELECT *
FROM logs
WHERE
  event_time BETWEEN TIMESTAMP '2022-01-20 00:00:00.000000 UTC'
  AND TIMESTAMP '2022-01-20 23:59:59.999999 UTC'

Users can focus on writing queries that are concise and readable by other human readers, and leave the eventual grunt optimization work to the query engine.

A succinct way of querying the logs for a specific day would be to cast the timestamp field value to its corresponding date value and compare it with the day containing the relevant logs:

SELECT *
FROM logs
WHERE
  CAST(event_time AS date) = DATE '2022-01-20'

In this case, Trino unwraps the initial temporal filter to a filter that tests whether the column event_time is within the constant timestamp range corresponding to the date used in the initial filter, which is equivalent to the most efficient of the explicit filters mentioned above.

A different approach of querying the log data for a specific date is to use the date_trunc function:

SELECT *
FROM logs
WHERE
  date_trunc('day', event_time) = DATE '2022-01-20'

Trino again replaces the initial temporal filter to a filter testing whether the column event_time is within the constant timestamp range corresponding to the date used in the initial filter.

A slightly different use case is querying the log data to see whether an exotic error type is recorded in the logs during previous months of the current year by making use of the year() function:

SELECT *
FROM logs
WHERE
  year(event_time) = 2023

This time, Trino rewrites the temporal filter applied on the column event_time with a BETWEEN filter for the unfolded date range corresponding to the entire span of the specified year:

event_time BETWEEN TIMESTAMP '2023-01-01 00:00:00.000000 UTC'
AND '2023-12-31 23:59:59.999999'

Without predicate pushdown, the filtering is done by Trino on each tuple, after scanning the entire content of the table:

The optimization techniques employed by Trino to speed up the above mentioned types of queries all involve replacing the provided filter with an equivalent filter expression. Constant replacement optimizations compare the table column against a constant or a constant range with the purpose of literally pushing the filter down to Iceberg.

As a consequence, the partition pruning happens on the metadata layer of the table instead of filtering on top of the data itself, dramatically reducing the amount of actual data files scanned:

As described in the Iceberg Table Spec, for any snapshot of the table, Iceberg tracks each individual data file and the partition to which it belongs. Iceberg uses a hierarchical index in its metadata layer by storing lower_bounds and upper_bounds for:

each partition in the manifest list files
each data file in the manifest files

Desugaring seemingly variable filter expressions to comparison predicates involving only columns and constants or constant ranges pays off. Not only does it prune out partitions, but it also skips portions of the data file (for example a Apache Parquet row group) or even the data file altogether in certain circumstances. For instance, pruning and skipping can occur if the queried range value does not overlap with the indexed Iceberg metadata range of values contained in the file, in case of a non-partition column filter.

To put things in perspective, the optimization techniques presented in this article, which have been already integrated in Trino, can cause the execution of queries containing temporal filters with selective filters to complete in seconds compared (depending on the size of the table scanned) to hours.

A reader keen to experiment and discover whether the previously mentioned optimization techniques are actually effective can use EXPLAIN to examine the output of the query planning stage. If the temporal predicate employed in the query is being pushed down, the scan operation should definitely have fewer rows than the count of all rows contained in the table.

The queries in this post showcase just a tiny fraction of the myriad of techniques which can be employed to perform queries on date and time columns. Trino continuously strives to streamline its users’ workflows by providing the results of queries as fast as possible.

Trino and the BDFL model: a renewed focus

2023-04-06T00:00:00+00:00

For those who are paying close attention, you may notice updates to a few pages across the Trino website with a renewed focus on leadership roles in Trino. This is part of an effort to re-focus and make the operating model more transparent both for contributors and for end users. While this is not a functional change, this does involve clarifying our roles following the BDFL (benevolent dictator for life) model.

Trino has been a popular open source project used by many companies and organizations since its inception in 2012. As a founder-led project, it has consistently operated under a BDFL model, though not necessarily by name. The model is used to describe the persons who can make the final decisions for the direction and development of the project. Many successful open-source projects, including Linux, Python, Scala, Ruby, and Rust, operate using a BDFL model.

Why the BDFL model?

One of the key benefits of the BDFL model is that it allows for a clear decision-making process. When a project has a large number of contributors, it can be difficult to reach consensus on certain issues. The BDFL can step in and make the final decision, which can be particularly helpful in situations where time is of the essence. Additionally, having a BDFL can provide a sense of stability and direction for the project.

It’s important to emphasize that the use of the BDFL model is not a new development in Trino’s history. We (Dain, David and Martin) have acted in this role since the beginning.

Why now?

Why is there a renewed focus on the BDFL model now? Trino has reached a level of maturity and a community size that has made increasingly important to have clear leadership and decision-making processes. By making the BDFL model more explicit, we can ensure that the project remains focused and continues to deliver value to its users.

More info

You can check out the following pages for additional information:

Polish edition of Trino: The Definitive Guide

2023-04-06T00:00:00+00:00

At this stage Trino is used all around the globe as we know from the community chat and our speakers at Trino Summit 2022. One large community of Trino contributors and maintainers, many employed by Starburst, is located in Poland. Poland also has a very active participation of developers and users in the Java and Big Data communities.

Today, we are happy to announce that a translation of the book Trino: The Definitive Guide to Polish is now available for the communities in Poland and beyond. We invite you all to get your own copy:

Trino Profesjonalny Przewodnik

Our thanks for making this happen go out the teams at O’Reilly and Promise. We hope many readers will benefit from the translated edition.

Manfred, Martin, and Matt

Lakehouse summer camp at Trino Fest 2023

2023-04-05T00:00:00+00:00

Get ready to kick off your summer with Commander Bun Bun at Trino Fest 2023! This year’s event is going virtual and will take place over two days, the 14th and 15th of June. The focus of the event will be on Trino as a data lakehouse query engine, with discussions on how new features and the ecosystem around Trino can support better data lakehouse management.

Trino Fest 2023 is the new annual summer event dedicated to all things Trino. Building on the success of last year’s Cinco de Trino, we’re excited to bring the community together once again to explore the latest trends and innovations in Trino and data lakehouse management. With a focus on education, community collaboration, and inspiration, Trino Fest 2023 will be a valuable experience for anyone interested in improving their data and analytics platform. We hope to see you there as attendee, speaker, or sponsor! Read below to find out how to sign up.

Call for speakers

Call for speakers is now open, and we invite you to submit a talk if you have an interesting perspective on Trino. We’re particularly interested in talks related to:

Data lake and lakehouse use cases, architectures and experiences
Apache Iceberg
Delta Lake
Hudi
Industry use cases for Trino
Query federation
Data governance with Trino
SQL with Trino
ETL/ELT/batch query processing
Other tools and integrations in the Trino ecosystem

The call for speakers closes on May 19th, so be sure to submit your talk soon!

What’s new this year?

Aside from the new title, this year’s Trino Fest will differ from last years, short conference in a few ways. We’re featuring more talks from Trino practitioners, the event will run over two shorter days to avoid the death march of talks, and there will be more summer, lakehouse, and camping puns. Of course, there will be continued use of the Trinoritaville song . Whether you’re just getting started with Trino or you’re a seasoned pro, there will be something for everyone at Trino Fest.

What is Trino Fest versus Trino Summit

Trino was built from the beginning to query on Hive data, so Trino moving on to support a data lakehouse is simply the evolution from its flagship use case. Trino Fest covers the latest features and improvements to Trino that make it an even better choice for data lakehouse management. You’ll hear from speakers who are using Trino in innovative ways, and who can provide valuable insights and tips for managing your own data lakehouse. Going with the chill summer theme, there will be plenty of time to have fun and relax too!

If you’re interested in sponsoring Trino Fest 2023, we’d love to hear from you! Sponsoring the event is a great way to get your brand in front of a highly engaged audience of Trino enthusiasts and data professionals. Your support will help make the event a success, and in return, we’ll offer a range of benefits, such as logo placement on our website, social media shoutouts, and more. To learn more about sponsoring Trino Fest 2023, reach out to events@starburst.io.

See you there

Mark your calendar to save the 14th and 15th of June for Trino Fest 2023: Lakehouse Summer Camp. Get ready for a two-day event that will get you diving into the deep end of the data lake. Registration is open now, and the call for speakers closes on April 28th, so be sure to sign up and submit your talk soon!

Happy querying!

The rabbit reflects on Trino in 2022

2023-01-10T00:00:00+00:00

It’s that time of the year where everyone gives excessively broad or niche predictions about the finance market, venture capital, or even the data industry. And we are now bombarded with “year-in-review” summaries where we find out just how much data is being collected to generate those summaries. End-of-year reflections are always useful because you can find patterns of what’s going well and what’s going poorly. It’s also good to pause and take stock of the things that did go well, because without that, you’ll only be looking at the list of things that you still have to do, and that isn’t healthy for anybody. In that spirit, let’s reflect on what we’ve been able to accomplish as a community this year, as well as what to look forward to in the next year!

2022 by the numbers

Let’s take a look at the Trino project’s growth and what happened specifically in the past year:

1,031,842 unique visits 🙋to the Trino site
116,231 unique blog post views 👩‍💻 on the Trino site
60,296 views 👀 on YouTube
5,982 hours watched ⌚on YouTube
4,696 new commits 💻 in GitHub
2,775 new members 👋 in Slack
2,769 new stargazers ⭐ in GitHub
2,550 pull requests merged ✅ in GitHub
1,465 issues 📝 created in GitHub
1,322 new followers 🐦 on Twitter
1,068 pull requests closed ❌ in GitHub
702 new subscribers 📺 in YouTube
658 average weekly members 💬 in Slack
56 videos 🎥 uploaded to YouTube
37 Trino 🚀 releases
36 blog ✍️ posts
12 Trino Community Broadcast ▶️ episodes
12 Trino 🍕 meetups
2 Trino ⛰️ Summits

The Trino website got an impressive number of unique visits, also referred to as entrances. This metric filters out refreshes and through traffic to count the number of times a visitor started a unique session. Blog posts saw a 47 percent increase from last year. Slack membership grew 13 percent and average weekly active members grew an exciting 25 percent. YouTube views have increased by 218 percent. We’ve more than doubled the number of hours watched, which makes sense, as we’ve nearly doubled the number of subscribers since last year.

The project’s velocity hasn’t slowed down either. The number of commits grew 27.6 percent this year and the number of created issues grew by 20 percent. This increase in demand for features also pushed up merged pull requests numbers by nearly 29 percent!

Why are we pointing out the number of closed pull requests that weren’t merged? We are improving communication with contributors regarding when and why we explicitly decide not to move forward with a pull request. Part of this has included a new initiative to close out old and inactive pull requests. There have been a good number of pull requests that have fallen through the cracks and are missing communication from the pull request creator or reviewer. The DevRel team, Brian Olsen, Cole Bowden, and Manfred Moser, are actively working on improving the workflow around pull requests and issues. Cole recently posted a blog that dives deeper into what this team is actively working on to improve the experience of contributing to the project.

A lot of these metrics indicate the growing popularity of Trino, but they also help drive further awareness of the project to others. One metric we pay close attention to is the number of visitors we get through blog posts, as they grow Trino’s visibility. This increases the number of contributors and users that shape Trino to be the best analytics SQL query engine on the planet. One of our most successful blog posts was Why leaving Facebook/Meta was the best thing we could do for the Trino Community. The day this blog post was released, it doubled the website traffic we received and set the record for blog post views or website views in a single day. For reference, our previous record was the post we had when the project was rebranded.

This post gained a lot of traction for two reasons. Posts related to Meta and the inner workings of open source communities naturally perform well, as many developers are interested in these topics - drama is exciting! But you can have an interesting topic that doesn’t go viral if nobody sees it. The catalyst to this success was actually when David Phillips posted this to Hacker News. We hit the top ten of Hacker News and occupied the front page for about two days.

So what is the takeaway here? We need your help! While it made sense for David to do this post once, Hacker News generally looks down upon repeated self-promotion. Clearly there’s a lot of people interested in Trino, and Hacker News and many other social media outlets are how we get the word out. If you don’t think that sharing has much effect, we hope sharing this impact motivates you to help us. We don’t want to keep Trino the hidden secret of Silicon Valley much longer. We need your help to really get people continuously reading and hearing about all things Trino. So share any time you see something cool going on in our community!

Trino touches the world

Let’s take a look at the number of users who have initiated at least one session on the Trino site in 2022 by top 10 countries. This goes to show the true global reach this project has attained in 10 years.

123,326 USA 🇺🇸users
33,540 Indian 🇮🇳users
30,955 Chinese 🇨🇳users
12,282 British 🇬🇧users
11,638 German 🇩🇪users
10,760 Canadian 🇨🇦 users
9,980 Brazilian 🇧🇷users
9,098 Singaporean 🇸🇬users
8,649 South Korean 🇰🇷users
8,636 Japanese 🇯🇵users

Our reach currently favors the USA, but our aim is to grow Trino in all countries that are starting to show interest. The new edition of “Trino: The Definitive Guide” is being translated into Chinese, Polish, and Japanese. If you want to translate the book to your local language, please reach out to Manfred Moser.

Trino celebrates its tenth birthday

Of all the incredible things that happened, one that gave us cause to reflect was Trino’s tenth birthday. Martin, Dain, and David cite longevity of the project as one of the core philosophies that govern decisions around Trino. We expect that Trino will be used for at least the next 20 years. We build for the long term. This first decade has been an adventurous ride, and wow has it produced an incredible system.

We wanted to do something special with the community to celebrate this milestone, so Brian put together a birthday video to timeline the evolution of Presto and now Trino. We had a premiere watch party on the day of the tenth anniversary and got some folks’ reactions. Take a look at the video if you haven’t yet, you don’t want to miss it.

Trino Summit

The next event in 2022 was the Trino Summit, which was the first in-person summit we’ve had as Trino, with well over 750 attendees. We had a stellar lineup of speakers from companies like Apple, Astronomer, Bloomberg, Comcast, Goldman Sachs, Lyft, Quora, Shopify, Upsolver, and Zillow.

This summit had a Pokémon theme, making the analogy that data sources are much like Pokémon and Trino is much like a Pokémon trainer trying to access and federate all the data, train it, and level the data up. Check out the video for a small summary, and if you missed this event, we have all the recordings and slides available.

We want to thank Starburst for hosting this event and all the sponsors for making this year’s summit possible. As usual, a huge thanks to the community for showing up, engaging with each other, and bringing your stories and curiosity.

Cinco de Trino

Cinco de Trino was our mini Trino Summit held in the first half of the year. It dove into using Trino with complementary tools to build a data lakehouse. The virtual event was held on Cinco de Mayo (5th of May), which gave it a Margaritaville, on-the-lake vibe. We used this conference as a platform to launch the long-awaited Project Tardigrade features around the fault-tolerance mode for Trino.

Trino Contributor Congregation

This year, we began what we are calling the Trino Contributor Congregation (TCC), which brings together Trino contributors, maintainers, and developer relations under the same roof. This congregation was to counter the siloed nature of Trino development that occurred during the pandemic. Many community members felt like their work wasn’t being seen and much of this was due to lack of communication, and especially face-to-face communication, which builds empathy and demands attention. The TCCs aim to increase connections and collaboration between maintainers and contributors, create opportunities for highly technical exchange of ideas and plans for Trino, and learn about usage scenarios and issues from each other. This is different from the Trino Summit since it focuses on gathering those who contribute code to keep the conversations focused on developing features and removing blockers for contributors.

The first TCC happened just following Trino Summit in Palo Alto. This was convenient for many, as a lot of folks were already in San Francisco to attend Trino Summit. Moving forward we will continue having in-person TCCs around Trino Summit to minimize the travel expected for anyone wanting to attend in-person TCCs.

Along with the in-person TCC, we also had the first virtual TCC in December. This included a great deal of people in Eurasia who weren’t able to travel to San Francisco in November. We covered mostly similar topics but with a larger amount of interaction from those new voices.

During these discussions the biggest topics covered timelines of existing roadmap items and suggestions for other items that should get more attention. We talked about upcoming connectors and plugins, and all the required infrastructure needed to support that. A recurring theme was the need for better testing infrastructure. The more information we can gather as a community, the quicker we can remove any issues as new releases come out and increase adoption of newer versions of Trino. We also discussed desired features around resource-intensive and batch workloads, and the new polymorphic table function features.

The biggest takeaway from these meetings was that everyone now had a better basis to engage with each other. As we move forward, we will continue the cadence of having these virtual TCCs to keep everyone on the same page, and have in-person meetings when there is a larger conference. With that, let’s cover some of the features we gained this year.

Features

Of course, one of the main deliverables of our project are Trino releases. In 2022, we improved our release process and cadence, shipping 37 releases that were packed with features, and we’re about to dive into a high-level list of the most exciting ones that made their way to you. For details and to keep up you can check out the release notes.

Fault-tolerant execution mode

2022 was the year of resiliency for Trino. Users have long requested adding a fault-tolerant mechanism to Trino akin to query engines like Apache Spark. Users wanted the ability to take the queries that they were running in Trino and scale those queries to larger data or resource intensive queries. Experimental features were implemented in late 2021 for automatic query retries and earlier this year task-level retries. The efforts for these features were codenamed Project Tardigrade.

Fault-tolerant execution relies on storing intermediate data between task shuffles to have data persist in an exchange spool. The first iteration of this was AWS S3, but eventually Azure Blob Storage and Google Cloud Storage were included. The Project Tardigrade engineers started improving performance and fixing bugs in fault-tolerant execution as users tested the early implementation. Later, memory efficiency for aggregations, faster data transfers, and dynamic filtering with fault-tolerant query execution were added. The launch of fault-tolerant execution happened at Cinco de Trino. The first iterations only applied to queries being run on object-storage connectors such as Hive, Iceberg, and Delta Lake. Recently, support for MySQL, PostgreSQL, and SQL Server were added. These contributions added a foundation for other JDBC connectors. A few companies, most notably Lyft, have adopted this feature and are scaling it in production.

SQL language improvements

Here are all the notable SQL features that made it to Trino this year:

MERGE statement support is the most impactful SQL feature released this year. MERGE allows users to implement INSERT, UPDATE, and DELETE functionality in one statement. MERGE is not simply syntax sugar, the implementation has profound performance improvements. A lot of your operations can be merged (pun intended) from multiple tasks into a single scan over data. This functionality is absolutely critical for positioning Trino as a data lakehouse query engine. MERGE is currently available in the Hive, Iceberg, Delta Lake, Kudu, and Raptor connectors. We discussed this and did a demo with MERGE on the recent Trino Community Broadcast with Iceberg.
Another massive update was the introduction of polymorphic table functions ( PTFs). Table functions initially released with some initial passthrough query functionality that we see in connectors like Pinot, Elasticsearch, MySQL, PostgreSQL, and other JDBC connectors. However, this is only one small instance of what can be achieved with PTFs and the true power comes from the generalization of this feature. Dain and David gave a simpler explanation of PTFs. To dive in deeper, watch this episode of the Trino Community Broadcast where Kasia Findeisen and Martin discuss PTFs in greater detail.
Dynamic function resolution has been discussed for many years and finally arrived. This provides the ability for connectors to provide functions at runtime. Unlike before, where you needed to statically register your functions ahead of time, you can now provide a plugin that contains these functions that are resolved at runtime. This enables features like supporting function calls to dynamically registered user-defined functions in different languages like Javascript or Python. Martin and Dain go into great detail about how this works when answering this question at Trino Summit.
Trino gained support for JSON processing functions, which is a part of the ANSI SQL 2016 specification. This resolves a large number of issues reported by the community over the years. This includes the json_array, json_object, json_exists, json_query, and json_value functions that were added to Trino this year.
The JSON format was added to the EXPLAIN statement to provide an anonymized query plan output to enable offline analysis.
It became possible to comment on tables, columns of tables, and even views for various connectors. Support for setting comments on views was introduced very recently and includes support for Hive and Iceberg.
A ton of new functions were added, including to_base32, from_base32, trim_array, and trim.

Performance improvements

Despite all the hype about vectorization being a silver bullet to make databases go fast, the real speed comes from better algorithms and better data structures that lead to lower resource consumption. Following is a list of some improvements that made their way into Trino this year:

Trino now offers improved performance for a variety of operations, including complex join criteria pushdown to connectors, faster aggregations, faster joins, and better performance for large clusters. We have also implemented improvements specifically for aggregations with filters and for the Glue metastore. In addition, we now support dynamic filtering for various connectors and have faster query planning for the Hive, Delta Lake, Iceberg, MySQL, PostgreSQL, and SQL Server connectors.
Along with general performance optimizations, there have been a great deal of query planning optimizations that lead to better performance for specific SQL operators. These include faster INSERT queries, improved performance for LIKE expressions and highly selective LIMIT queries, and enhanced performance and reliability for INSERT and MERGE operations. We also made performance improvements for JOIN, UNION, and GROUP BY queries, as well as faster planning of queries with IN predicates.
There are also optimizations for specific SQL types’ performance, such as string, DECIMAL, MAP and ROW types. We have also made aggregations over DECIMAL columns faster and improved the performance of ROW type and aggregation.
A last set of improvements come from reading open file formats like ORC and Parquet efficiently. We have improved the speed of reading or writing of all data types from and to Parquet in general. There were also general performance to ORC types, and now have the ability to write Bloom filters in ORC files. We have also improved performance and efficiency for a wide range of ORC and Parquet-related operations.

These improvements in aggregate are at the core of what makes Trino fast. There is no silver bullet you can plug in to speed things up. It takes time, effort, and smart changes to improve the speed of various systems.

Runtime improvements

Trino upgraded to Java 17. This upgrade improves the overall speed and lowers the memory footprint of Trino with various performance fixes to the JVM and garbage collectors. Trino uses the G1 garbage collector which can now more efficiently reclaim memory and reduce pause times.

Aside from having to perform the upgrades, we get a lot of these performance enhancements for free. On top of performance, upgrading to Java 17 adds new Java language features to improve the ability to write and maintain higher quality code.

To learn more, read this blog post and watch episode 36 of the Trino Community Broadcast

Along with the Java upgrade, Trino now has a Docker image for ppc64le and added CLI support for ARM64, which means Trino’s Docker image can run on AWS Graviton processors and the image and CLI can run on the new MacBooks.

Security

Trino added the following improvements and features relevant for authentication, authorization and integration with other security systems:

There were a lot of updates to OAuth 2.0 authentication like support for OAuth 2.0 refresh tokens and allowing access token passthrough with refresh tokens enabled. We also added support for automatic discovery of OpenID Connect metadata with OAuth 2.0 authentication, support for groups in OAuth2 claims, and reduced latency for OAuth2.0 authentication.
Hive, Iceberg, and Delta Lake got AWS Security Token Service (STS) credentials for authentication with Glue catalog and allow specifying an AWS role session name via S3 security mapping config.

Object storage connectors (Hive, Iceberg, Delta Lake, Hudi):

One of the common uses for Trino is being used as a data lakehouse query engine. This year we not only added two connectors to this category, but a lot of performance improvements across the board with the file reader and writer improvements.

Earlier this year, we added the Delta Lake connector to finally reach everyone using Trino in the Delta Lake community. Delta Lake is a table format that improves on the Hive table format in areas like better support for ACID transactions. After the initial release, we added read and write support on Google Cloud Storage, added support for Databricks 10.4 LTS, and improved overall performance of the connector. To learn more about the Delta Lake connector, watch the Trino Community Broadcast on Delta Lake.
The Hudi connector is a more recent addition, but it’s just as exciting. Hudi was created at Uber with the goal of handling realtime ingestion to a data lake. This connector is the youngest of the three newest object storage connectors, so stay tuned to see more features land around this connector. See how Robinhood uses Hudi and Trino in the Trino Community Broadcast.
The Iceberg connector had a massive amount of improvements as well, bringing it to the same level of a production-ready connector as Hive. Iceberg now has new expire_snapshots and delete_orphan_files and OPTIMIZE procedures. Having these capabilities along with MERGE are really the keys to being an effective lakehouse query engine. This year, Iceberg added support for the Glue metastore, the Avro file format, file-based access control, and UPDATE and time travel syntax. Iceberg received a lot of performance improvements and improvement in latency when querying tables with many files.
Although it seems like Hive is gradually on its way out, there are many that still depends on the Hive connector to be performant. Hive received support for S3 Select pushdown for JSON data, IBM Cloud Object Storage in Hive, improved performance when querying partitioned Hive tables, and the flush_metadata_cache() procedure for the Hive connector.

Other connectors

A major feature of Trino is the availability of other connectors to query all sorts of databases with SQL. All with the speed that Trino users are used to. Here’s some of the major improvements that landed for these connectors in 2022:

New MariaDB connector
Performance improvements with various pushdowns in the MongoDB, MySQL, Oracle, PostgreSQL and SQL Server connectors.
Support for bulk data insertion in SQL Server connector.
Added a query passthrough table function to numerous connectors.
Expanded SQL features for various connectors by adding support for TRUNCATE TABLE, DELETE, CREATE/DROP SCHEMA, INSERT, and others.
Update Cassandra connector to support v5 and v6 protocols.
A collection of improvements on the Pinot and BigQuery connectors

Bug fixes

Any software includes issues and bugs, Trino included. Thanks to our community we learned about many of them, and fixed even more. Continue to test new releases and report issues. Check out all the release notes for details.

Updates in the Trino ecosystem

Outside of the excitement within the main Trino project, there was a great deal going on in the larger Trino community and ecosystem:

Trino: The Definitive Guide second edition

Martin, Manfred, and Matt released the second version of Trino: The Definitive Guide. This update of the book from O’Reilly fixed errata, added the deployment process to include newer Kubernetes installation methods, and updated features for all the additions that had been released since the first version of the book. Along with this, efforts are underway to translate this book to different languages. Huge thanks to everyone involved in this!

Starburst provides Trino in the cloud

As a major community supporter, Starburst helped us with events, marketing, developer relations, and partner cooperation. Starburst also provided a large part of development and code contributions to Trino and its related projects. Starburst acquired Varada and integrated the object storage indexing technology, and they shipped many Starburst Enterprise releases for self-managed deployments. On top of all that amazing work, Starburst launched Starburst Galaxy as a powerful, multi-cloud SaaS offering of Trino. Security, cluster management, a query editor, and many other features are included in this new platform.

Amazon upgrades Athena

Athena version three rolled out and is now based on a recent Trino release. This is great news for Athena users who were missing the many performance gains, expanded SQL support, and other features from Trino, since the prior versions are based on old Presto releases. As a result, the large Athena community and their feedback and knowledge have become more integrated with the Trino community, and we are seeing positive impact for Trino releases already.

dbt-trino

dbt users rejoice! The official dbt-Trino integration made it into dbt this year! This means that anyone who wanted to read or write data to or from multiple data sources is now able to. If you want to dive into it, check out this blog post written by the contributors of this integration.

Python client improvements

The amount of development of the trino-python-client doubled this year. A major focus was on performance improvements with the sqlalchemy integration. There was also a wide range of bug fixes.

Airflow integration

The long-awaited Trino/Airflow integration landed this year. This paired well with the new task-retry and fault-tolerant execution features. To learn more about the full capabilities of pairing Trino’s few fault-tolerant execution mode with Airflow, check out Philippe Gagnon’s talk at this year’s Trino Summit.

Metabase driver

A lot of folks in the community were asking for a Trino/Metabase driver after Trino updated its name. This was a large blocker for anyone who wants to move to Trino and uses Metabase. Through a collaboration of the Metabase and Starburst engineers, the metabase-driver for Trino was released, and we saw numerous users migrate to Trino.

2023 Roadmap

The upcoming roadmap was covered in detail by Martin at Trino Summit. To avoid extending this blog even further, we’ll leave you with the featured project that covers many aspects of the Trino core engine.

Project Hummingbird

Project Hummingbird aims to improve Trino’s columnar and vectorized evaluation engine. Every year we report on many incremental performance improvements. These improvements are typically small in isolation but have a large aggregate impact. This incremental approach is the real key to improving query engine performance, and there is always room for further optimization. If you want to get involved with this exciting project, or to learn about the latest innovations as they are being discussed, join the #project-hummingbird channel in the Trino Slack workspace.

Conclusion

2022 was by far the busiest year this bunny has been. Trino has consistently continued growing as we’ve attracted more contributors. We believe this trend will continue in 2023 as we begin to put more process in place around managing pull requests. Remember to get the word out and share anything you genuinely think is cool or important for others to hear! Looking forward to an even more successful 2023 Trino nation!

Cleaning up the Trino pull request backlog

2023-01-09T00:00:00+00:00

At some point in the lifecycle of a successful open source project, it reaches a point where the number of incoming pull requests (PRs) outpace the project’s ability to get code merged. It happens for a huge variety of reasons, including developers moving on to other projects before tying up every loose end, reviewers who miss a request for review, and because some stagnant PRs were never going to happen and should have been closed two years ago. The GitHub notification system doesn’t do anyone any favors, either. Having too many open PRs is a problem for a project, because they make it harder to tell what is being worked on and what may as well be dead code walking.

And when we cross 700 open pull requests in Trino, constantly adding a few more to the pile every week, what do we do? We clean it up! Let’s talk about how we’re doing it, why we’re doing it that way, and how we’re planning on preventing this from happening again. The end result should be some process improvements that make contributing to Trino a better, faster, and more painless experience.

Spring cleaning

The “how” is an easy thing to talk about. The Trino developer relations team is in the process of going through all open PRs, from oldest to newest, manually taking a look at each one and checking in on how we may want to proceed. For PRs where the author seems to have abandoned it and not responded to a review, we close them down, encouraging the authors to open them right back up if they decide they want to continue work. For everything else, though, we’ve been taking a more measured approach, offering to help facilitate reviews or discussion for these long-lasting bits of code that may still have a chance of making their way into Trino.

To anyone who’s managed a repository before, this may seem like more effort than necessary. You can add a bot to close anything that’s been stale or inactive for too long, and problem solved, right? Sure, that does solve the problem, but it creates a couple others.

First, and perhaps most importantly: it’s not very human. Having a pull request that you put time and effort into get shut down by a bot without having another person swing by to say hello can be demoralizing, and it builds a negative experience that might discourage future contributions to the project. We want our contributors to like Trino and to enjoy the process of adding on to it, and a GitHub bot slamming the door shut on their hard work isn’t going to help with that. Having a bot do our work for us would also deprive us of a valuable learning opportunity. Manually checking in on each pull request that slipped through the cracks has allowed us to identify pain points in Trino code reviews which we can try to mitigate moving forwards, and it’s provided a ton of valuable insights for deciding on how to best improve the process.

Second, and perhaps even more significant: there’s a lot of cool stuff we’d be missing out on if we automatically closed everything. While going through the backlog, we’ve found dozens of year-old pull requests that still have a lot of value for Trino and only needed someone to take another look at them. For some, the author may be missing, but the ideas are good and the PR can be handed off to someone else to carry the torch and get it across the finish line. For others, the author is still happy and ready to iterate on it, and all that’s needed to get the ball rolling again is to ping a reviewer or two to take another look. We’ve even found a couple PRs that were approved and ready to go, and all it took was a simple click of the merge button. The effort-to-impact ratio on that is off the charts - think of all the value we’d be missing out on if we’d automatically closed those!

The result of the effort so far has been excellent.

We’re not completely done with the cleanup effort, but as you can see, we’re slowing down. Our oldest PRs are increasingly recent, still in development, and worth having open. Going from a peak of 700+ open pull requests to around 300 is a massive improvement, and the goal is to end up in the vicinity of about 200 open pull requests in Trino at any point in time.

Keeping things pristine

But with the cleanup being so manual, the next challenge is stopping the pull requests from steadily piling back up while we’re not paying attention to them. The fix for that is simple - we’re going to keep paying attention. The Trino developer relations team is planning on tracking and getting involved in two categories of pull requests to keep the number of open PRs stable.

The first category is pull requests that don’t get any immediate attention from a reviewer. While Trino reviewers are overall excellent and quick to take a look at incoming pull requests, about five percent slip through the cracks, where a contributor submits something that receives no reviews or comments and lives on in the pull request backlog. That’s not a good experience for the contributor, and it’s not good for Trino, either, because that contribution could have a lot of value. We plan on stopping this from happening by implementing workflows which spring Trino developer relations into action when these situations arise. If a pull request goes a few days without a comment, we’ll be the safety net to ask questions, get engineers involved, and make sure that at least a few pairs of eyes take a look at every incoming PR in a timely manner.

The second category is pull requests that get some reviews, but eventually stagnate or stop being actively worked on. This happens for a lot of reasons, but in all cases, if a pull request goes a few weeks with no activity, the developer relations team will be checking in. Our goal will be to figure out the proper path forward, whether that’s flagging down some reviewers again, communicating that the pull request should be closed, or anything else. The end result should be that nothing slips through the cracks and ends up going months without human contact. If an author vanishes or everyone gets too busy to look at a pull request again, though, the final stop will ultimately be a stale bot which closes pull requests that have gone a few months with no activity.

With all these processes in place, contributors should never feel like their efforts are going unnoticed. Submitted code should be reviewed quickly, iterated on in a timely manner, and merged without much delay. In situations where a pull request is not going to be merged, the Trino developer relations team should be able to chime in quickly to make that clear, saving contributors from wasting time and effort on a false impression that their code will be landed. And if you have any questions, concerns, or suggestions about all of this, don’t hesitate to reach out to us directly on the Trino Slack using @devrel-team!

Using Trino to analyze a product-led growth (PLG) user activation funnel

2022-12-23T00:00:00+00:00

As the holiday season approaches, we have reached the end of our Trino Summit 2022 recap posts. With the last talk of the summit, Mei Long from Upsolver gave an insightful overview of how they use data to inform product decisions.

Check out the slides!

Recap

When talking about product-led growth (PLG), it helps to start by defining what it even means. The core idea is simple: see how users engage with your product, and make decisions based on how you can improve the product to better serve those users. At Upsolver, the goal of PLG is to maximize user value. The issue is that while this can be simple in some situations, when you’re delivering complicated analytics tools, it’s not always immediately clear what features would be the most valuable or useful. You need a lot of data to glean a lot of insight, and you need to make sure your insights that can lead to action. And of course, you need to be absolutely certain that your data is high-quality, accurate, and trustworthy, lest you end up accidentally giving a customer a ten million dollar discount.

Mei explores the initial pass at using analytics to drive PLG at Upsolver, letting her intern use a tool called Amplitude that worked for a time and for limited use cases. As Upsolver grew, the analytics requirements did, too, and Amplitude wasn’t powerful enough for Upsolver’s use case, nor for the more complicated queries and analysis that needed to be run.

Want to guess what query engine they swapped to using? Trino. Mei dives into a quick demo that shows how Upsolver ingests all of its streaming data and stores it for Trino to query, driving down time-to-insight to make it quick and efficient to ask questions and make decisions based on those answers. With Trino at the ready, Upsolver has never been better-equipped to work towards PLG.

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/23/trino-summit-2022-upsolver-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Using Trino with Apache Airflow for (almost) all your data problems

2022-12-21T00:00:00+00:00

As we close in on the final talks from Trino Summit 2022, this next talk dives into how to set up Trino for batch processing. Trino has historically been well-known for facilitating fast adhoc analytics queries as opposed to long-running, resource intensive batch/ETL queries. This is due to the fact that Trino kills queries that run out of resources in order to prioritize faster query execution. Earlier this year, Trino added features to better support batch queries with a new fault-tolerant execution mode. This mode backs up intermediate data during execution time, allowing Trino to restart individual query tasks on failure rather than a query stage or the query itself.

Batch queries don’t typically involve human intervention and run asynchronously. These tasks may depend on each other and have a complex workflow. This talk describes how to orchestrate this complexity using Airflow’s new Trino integration to run Trino batch queries to solve (almost) all your data problems.

Check out the slides!

Recap

In this talk, we’re going to hear from Philippe, a Trino contributor and Solutions Architect at Astronomer, the company building a SaaS product around Apache Airflow. Philippe describes a fictional trading scenario that initially follows a traditional warehousing approach to storing data. This architecture has data sources that are queried and submitted as raw data into a centralized warehouse. Within the warehouse itself, the raw data is transformed into data ready to be consumed.

This model enforces centralization, in which one team runs the platform and builds the integration between producers and consumers. This team focuses on the aspects of the data platform which further separates them from the business use case. As source databases evolve, the central data team must keep up with these changes. As the data consumers that rely on the data infrastructure grow, this team commonly becomes a bottleneck.

Trino allows you to move the queries as close as possible to the federated data sources, removing the labor-intensive process of moving data into stages before ingesting it into a central warehouse. This doesn’t mean that data movement is no longer a necessity, but the necessity shifts from an availability concern to a performance and scalability concern.

Without investing into more resources, your data professionals are able to work closely with producers and stakeholders with a shared understanding of the domain. This increases data literacy and data availability throughout your organization.

Trino is not only for fast adhoc analytics with a human in the loop, but now provides a fault-tolerant execution mode that enables it to run resource intensive batch jobs. This, paired with the federation capabilities, make Trino able to ingest any data that can be represented in a tabular format. Users can implement user-defined functions and run transformations using SQL without involving intermediate systems.

To run Trino batch queries at scale requires building complex interdependencies between different tasks and often needs monitoring if there are any failures that occur. This configuration also demands reactive automation to handle the failing instances. Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows on systems like Trino, perfectly complementing the challenges of handling these intensive queries at scale.

Even before introducing fault-tolerant execution mode, Trino was already being used to run batch queries at scale. In these scenarios, Trino and a tool like Airflow already work well together because these jobs will take time and likely nobody wants to wait around to run the pipeline components in sequence. The reason why fault-tolerant execution mode brings the Trino and Airflow combination to the forefront, is due to the anticipation of Trino being adopted as a batch query engine tool as the learning curve to run ETL jobs on Trino becomes as trivial as other tools in the space.

Philippe dives into building out basic Airflow jobs to run over Trino and introduces the concept of a directed acyclic graph (DAG). He then dives into multiple useful features that help break down large jobs into manageable tasks, and jobs that can adjust the schedule based on runtime execution. Sharded job creation splits large batch jobs into smaller tasks that can easily be retried. Dynamic task mapping splits jobs into smaller tasks based on data observed at runtime. Finally, a new features called data aware scheduling can schedule tasks based on interdependencies between datasets.

To get started with Trino in Apache Airflow, check out the Airflow Trino provider documentation.

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/21/trino-summit-2022-astronomer-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Journey to Iceberg with Trino

2022-12-19T00:00:00+00:00

This post comes from the second half of Trino Summit 2022 session. Our friends JaeChang and Jennifer from SK Telecom traveled across the globe from South Korea to join us in person! SK Telecom recently had some issues scaling Trino on the Hive model, among other issues that come with Hive. While some initial tweaking helped speed things up, it ultimately never solved the problem. After switching to Iceberg, SK Telecom ran initial performance tests with some very impressive results. In this talk, Jennifer and JaeChang describe their journey to Iceberg with Trino.

Check out the slides!

Recap

SK Telecom is a South Korean telecom company that has built and operated an on-premise data platform based on open source software to determine manufacturing yield since 2015. SK Telecom’s goal has always been to build an observable federated data platform on open source software at scale.

SK Telecom manages on-premise Hadoop clusters to store their data. Previously, they used tools like distcp to make data available in one center. SK Telecom started using Presto in 2016 and shifted to Trino in 2021. To run batch queries on their warehouse, Trino workers are deployed on HDFS data nodes. There is also an adhoc Trino cluster deployed to manage federated queries over multiple data silos from an array of disparate data sources. This was one of the slow and brittle processes that Trino replaced. They chose Trino because it simplifies querying novel big data systems and combines that data more commonplace systems for their users.

As Trino adoption grew within the company up to 300 requests per minute, they eventually faced challenges with scaling. Not only were the number of requests growing, but the range of data being queried grew as well; users were evaluating petabytes of data, with terabyte-sized query input processed across hundreds of nodes. Many user queries were blocked while waiting for resources to become available. In response, the data engineering team began investigating how they could both scale and improve individual query performance.

To find the root cause, SK Telecom’s data engineers investigated cluster behavior beyond what was exposed in the web UI. They began collecting all the query plan JSON files, coordinator and worker JMX stats, system metrics, and Trino logs to build out their own metrics dashboard. The two main causes were that input data was too large, and there were spikes in the number of BlockedSplit operations leading to queries being blocked while waiting for other tasks to complete. They initially aimed to address this by changing some settings to increase thread counts and tuning the settings, but these changes still didn’t achieve the desired results. The ultimate bottleneck was the Hive metastore and the expensive list operations that caused many of the blocking operations to finish slowly.

At this point, the team reevaluated their needs to consider alternative solutions. They needed a better indexing strategy on the data with a flexible partitioning strategy. They also needed to remove the bottleneck on the metadata for this data while still maintaining compatibility across multiple query engines as Hive did.

The team looked at the existing set of novel data lake connectors available in Trino version 356, which at the time only included Iceberg. SK Telecom was immediately impressed by the metadata indexing in the Iceberg project. They particularly liked Iceberg’s snapshot isolation as data is created or modified. They were able to speed up queries using data file pruning on partition and column stats stored in the manifest file.

After running a benchmark, the team found that Iceberg reduced the input data size on the order of hundreds, down to under ten gigabytes. They also investigated adding a high amount of partitions to continue lowering the input data, but found that there’s a tradeoff where creating too many partitions increases query planning time. Ultimately, they found a sweet spot where the input data size was around six gigabytes and planning only took 70 milliseconds.

This summary is just the tip of the iceberg of all the information JaeChang and Jennifer shared with us about how Iceberg helped SK Telecom with their Trino scaling issues. Watch this incredible talk to learn more if you’re considering taking the leap from Hive to Iceberg!

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/19/trino-summit-2022-sk-telecom-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino at Quora: Speed, cost, reliability challenges, and tips

2022-12-16T00:00:00+00:00

As we near the end of the Trino Summit 2022 recap series, it’s time to take a stop at Quora. At Quora, being an engineer responsible for maintaining Trino comes with its fair share of challenges. With concerns about cost, performance, and reliability, Quora has taken several creative steps to ensure that they get the most out of Trino. Other Trino users may be able to learn a few neat tips and tricks to do the same by tuning in.

Check out the slides!

Recap

Trino at Quora is used in the big ways that we’re all familiar with. It receives queries from a variety of clients and services, then executes those queries on an S3 data lake and Hive metastore to return results at high speeds. With a wide variety of clients, Quora gets the most out of Trino, using it for ad-hoc analysis, but also for ETL, backfill jobs, A/B testing, and time series queries. But as with any large system being used for so many things, this isn’t without a few challenges.

The first challenge is a universal one - how can Quora keep the costs of running Trino to a minimum? One of the biggest strategies was to migrate to AWS Graviton instances to run Trino clusters, as they have proven to be more cost-efficient than other AMD and Intel-based EC2 instances at Quora. Graviton does have lower availability, though, so they sometimes must be complemented with some AMD/Intel instances in order to avoid any downtime. Auto-scaling also led to great cost savings, as the workloads varied based on time of day. By checking usage and anticipating it by ramping up the number of machines during the busy workday and ramping it back down when fewer jobs are in progress, Quora was able to minimize idle machines and cut back on unnecessary spending. Finally, and perhaps most obviously, the team at Quora worked to make ETL queries more efficient. By using partitions effectively and creating a tool to detect inefficient queries scanning too many partition keys, the result is efficient queries that take less time and use fewer resources, saving on cost.

Up next - how could Quora maximize Trino’s performance? With data analysts expecting quick runtimes and occasionally running into problems, fine-tuning Trino to run as well as it possibly can isn’t always an easy task. One particular major issue they found at Quora was that some worker nodes which ran for 24 hours or more straight would utilize less CPU and run slow, bogging things down. The fix? Gracefully restart worker nodes that run for over a day, and implement a detector to flag and restart any nodes which showed signs of behaving slowly.

The final big concern at Quora is reliability, as users expect Trino to be up and running whenever they need it. In one instance, they found that overwriting a specific configuration option caused a cluster to crash repeatedly and slow down to a crawl. The issue was that they’d steadily been bumping the value of the query.min-expire-age configuration property up and up and up from the default value of 15 minutes, until eventually, unexpired query history was using up too much memory and causing the cluster to falter. Lowering the value back down to something more advisable saved the day in that situation. But wanting to avoid similar situations from happening again, Quora built extensive monitoring tools to track the health of their Trino clusters. They ensure that even when user error does cause problems, those problems can be flagged and send out alerts, bringing the data engineering team to the rescue.

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/16/trino-summit-2022-quora-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Federating them all on Starburst Galaxy

2022-12-14T00:00:00+00:00

As the Trino Summit 2022 recap post series continues on, I have been reading all the wonderful posts by our awesome speakers, facilitated by the Trino developer relations team. Because I have a perpetual fear of missing out, I convinced them that I should get in on the fun. For this latest installment in the series, I will be recapping my very own Trino Summit talk. Basically, I’m ripping off Bo Burnham’s comedy bit where he reacts to his own reaction video, blog style.

In this session, I demonstrate building a data lakehouse architecture with Starburst Galaxy, the fastest and easiest way to get up running with Trino. Before I dive into the recap, I want to thank the Trino community for showing up. I am grateful that I was able to meet and learn from so many members of the community in person.

Recap

The premise of this example is that we have Pokémon Go data being ingested into S3, which contains each Pokémon’s encounter information. This includes the geo-location data of where each Pokémon spawned, and how long the Pokémon could be found at that location. What we don’t have is any information on that Pokemon’s abilities. That information is contained in the Pokédex stored in MongoDB which I’ve cleverly nicknamed PokéMongoDB. It includes data about all the Pokémon including type, legendary status, catch rate, and more. To create meaningful insights from our data, we need to combine the incoming geo-location data with the static dimension CSV table located in MongoDB.

To do this, I build out a reporting structure in the data lake using Starburst Galaxy. The first step is to read the raw data stored in the land layer, then clean and optimize that data into more performant ORC files in the structure layer. Finally, I join the spawn data and Pokédex data together into a single table that is cleaned and ready to be utilized by a data consumer. Next I apply role-based access control capabilities within Starburst Galaxy, which provides the proper data governance so that data consumers only have read permissions to that final table. I then create some visualizations to analyze which Pokémon are common to spawn in the San Francisco area.

I walk through all the setup required to put this data lakehouse architecture into action including creating my catalogs, cluster, schemas, and tables. After incorporating open table formats, applying native security, and building out a reporting structure, I have confidence that my data lakehouse is built to last, and end up with some really cool final Pokémon graphs.

Helpful links

Sign up for Starburst Galaxy
Read the docs
Try a tutorial for yourself
Register for Datanova

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/14/trino-summit-2022-starburst-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino for large scale ETL at Lyft

2022-12-12T00:00:00+00:00

Buckle up, for the next post in the Trino Summit 2022 recap series. In this post, we’re covering the talk given by Lyft engineers, Charles and Ritesh, on how they have not only scaled Trino as adoption grew, but with less nodes and more effective usage. They also started moving to utilizing Trino more for ETL rather than just interactive analytics. Get ready for a smooth ride as Lyft brings you large scale ETL with Trino.

Check out the slides!

Recap

Lyft uses Trino to perform ETL jobs reading 10 petabytes of data per day and writing 100 terabytes per day. They run 250,000 queries per day, with around 2,000 unique users. This requires approximately 750 EC2 instances scaling up or down with an autoscaler. Over 90 percent of queries complete within a one to three minutes.

In the last year, Lyft cut their number of Trino nodes in half, while increasing their workloads. This is possible due to recent improvements in Trino and upgrades in Java versions. Lyft is not using fault-tolerant execution, but has started seeing interest in using Trino for ETL jobs due to the faster turnaround. Some issues Lyft has faced has been around how resource hungry Trino is, as well as, the issue where the coordinator can be a single point of failure for queries executing on a cluster.

Lyft was one of the earliest companies to really push using Trino for ETL use cases. They built custom best effort rollback code in Apache Airflow. If a query fails, the operation reverts to the state before the operation began. Lyft runs four Trino clusters split by the type of workload used on that cluster. The best practices are careful usage around broadcast joins, query sharding, and scaling writers for ETL loads.

One final point Lyft pointed out is keeping up with the rapid release cycle of Trino was a challenge. Lyft showcases their regression testing using their query replay framework. This session is a smooth five out of five ride. Enjoy!

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/12/trino-summit-2022-lyft-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Rewriting History: Migrating petabytes of data to Apache Iceberg using Trino

2022-12-09T00:00:00+00:00

Rolling right along with another one of our Trino Summit 2022 recap posts, we’re excited to bring you the engaging talk from Marc Laforet at Shopify. He talked about the ordeal (or, if you look at it in a positive light, the privilege) of migrating petabytes of data from Hive to Iceberg table formats with the help of Trino. With details on why Shopify chose to move to Iceberg, the various migration strategies that were considered, and the ultimate process of moving all that data while the Trino Iceberg connector was still in active development, it’s an insightful talk that you don’t want to miss.

Check out the slides!

Recap

Along with many other Trino users, it should come as no surprise that Shopify has a lot of data to work with. First-party data comes in from a few different sources, and there’s a mountain of modelled data to go along with it. In Shopify’s case, one of the issues was that some data sets were built on top of custom table formats. On top of that, the architecture wasn’t scaled with a careful plan in mind, leading to limited interoperability of datasets among various tools. With data scientists unable to unify data across different tools and storages, it was time for a change.

When you’ve got tons of data that isn’t currently in one place, what’s the fix? Create a central lakehouse for all the data to be accessible from, a single-service portal that could serve all users’ needs. The first question was which table format to use, and if the title of the blog post didn’t already give it away, they chose to go with Apache Iceberg. It was an easy, central vision to work towards: all data in a centralized lakehouse stored in Iceberg, then queryable by Trino.

Having a plan and putting that plan into action are two different things, though. When nothing is already in Iceberg, moving it all there is a migration on the scale of thousands of tables and petabytes of data. In Marc’s words from the talk, once Shopify committed to the migration and invested resources into it, the realization was, “crap, now I have to build it.” Even worse, because the old data was primarily in gzipped JSON format, it all needed to be rewritten… and so it was.

Then, enter Trino! With new Iceberg-based tables, Trino was identified as the right tool for the job to process all that data. This wasn’t without snags, as the migration happened while the Iceberg connector was still being aggressively worked on and developed. There were a few different incidents where Shopify hit a snag or an issue, and an update or bugfix to Trino’s Iceberg connector solved those problems in a matter of days or weeks.

The result of all of this? Some incredible benchmark results. Large tables saw a 96% reduction in planning time, a 96% reduction in cumulative user memory, and a 95% reduction in query execution time. That’s the difference between thousands of terabytes of memory to under 100, and a query that would take an hour to run only taking three minutes. For the absolute largest table at Shopify, some queries saw a 99.9% reduction in execution time. Yes, that number is real.

Moral of the story? If you find yourself using an old Hive table with outdated file formats, lamenting the resources you need and the time it takes, the decision is easy. Migrate to Iceberg with Trino. Shopify has shown us the way, and the full talk has plenty of useful advice for how to best go about it.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/09/trino-summit-2022-shopify-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Elevating data fabric to data mesh: Solving data needs in hybrid data lakes

2022-12-07T00:00:00+00:00

Tune in for the next post in the Trino Summit 2022 recap series. In this post, we’re joining Saj from Comcast, to talk about their migration from a data fabric to data mesh. Saj shows you that there is more to the buzzword than meets the eye. He gives a solid overview of why Comcast is taking data mesh to heart.

Check out the slides!

Recap

Comcast engineer Sajuman Joseph brings us through Comcast’s process to move from their initial use case of using Trino to power their data fabric architecture to include more governance features by leveraging Trino. Data fabric enables querying data across distributed data sets, but importantly, it allows Comcast to transparently migrate data across on-prem and cloud storage without impacting users.

Despite offering query federation, data fabric still misses out on a higher-quality experience that data mesh aims to solve. Not only does having access to the data matter, but also adding data quality checks and a dedicated owner to ensure the data is correct and consumable. The ownership is split by domains defined by Comcast. It is the responsibility of the owners to ensure data quality, compliance, and security on the data they own. This data can be exposed internally or externally as a data product. While many of the drivers for this are done through company policy, there are technical means to make this possible. This includes improving metadata on the data, access logs, global data catalogs, and managing data access.

Trino facilitates a single point of access and is the a primary location where policies are enforced. Comcast created an engine called the Enterprise Policy Hub which syncs with all data stores and compute engines to enforce company policy and update metadata on all data across Comcast. Trino, along with other query engines, consults this engine to determine what information a user has access to, who owns the data, and creates an audit trail of what queries are run.

There are still some open challenges Comcast is looking to overcome. Data discovery is a large challenge for anyone looking to find a specific table and who is responsible for updating it. Another interesting area Comcast is researching is creating automated retention and minimization of data copies. This talk was exciting and gives a pretty clear roadmap to some beneficial changes many teams can make to improve the quality and governance of their data sets.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/07/trino-summit-2022-comcast-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

AI and poetry meet Trino

2022-12-06T00:00:00+00:00

Sometimes when working in the depth of the query engine core, the planner, the optimizer, or some other tricky problem or code, you run out of steam and need a distraction. Your brain will quietly work on the problem in the background while you get to read or do something interesting.

During one break we found out more about GPT-3 and the recent improvements on the OpenAI project. So we had to try.

It’s pretty amazing that you can feed the AI a prompt just like we did:

Write a short rhyming poem explaining Trino in easy but accurate terms

And after a few trials, the best one produced by GPT-3 was this:

Trino is a tech platform,
That makes data integration a breeze,
It’s designed to help your business grow,
And optimise your systems with ease.

It links together different sources,
Both big and small,
Gathers data of different types,
Helping you make decisions for all.

So Trino is an integration tool,
That connects data for you,
It’ll ease the burden of data sync,
Giving you more time to do what you do.

Cool, huh? Have a look at another prompt:

Write a short rhyming poem explaining Trino, the SQL query engine in complex
and accurate terms.

And the compelling result:

Trino is a query engine that does more than just run,
It’s a SQL engine with performance that stuns.
It is secure, efficient, and kind,
Providing real-time analytics of any kind.

It’s a distributed system that’s built to scale,
Processing queries that never fail.
Connecting to myriad data sources,
It can quickly analyze any course.

It’s flexible, reliable, and fast,
And is designed to make your queries last.
It can handle large volumes of data with ease,
Making the most complex queries a breeze.

If you are interested to learn more you can look at the blog post on Ars Technica or go and try it yourself on the playground.

Enjoy while we are heading back to working on Trino pull requests and other code now.

Martin and Marcos

Leveraging Trino to power data at Goldman Sachs

2022-12-05T00:00:00+00:00

Continuing with the Trino Summit 2022 sessions posts, we’re diving into an insightful lightning talk from Goldman Sachs. They explore how they use Trino to help ensure data quality across the board for all users and customers. By using Trino to federate their various data sources, querying everything in one place provides them with the flexibility they need. With that flexibility, they can validate that all data is as it should be where that data lives, settling any concerns that may exist about data integrity.

Recap

Validating data quality can be a tricky and complicated process. Data resides in many sources, with different rules and different processes for checking quality. Goldman’s data ingestion team may not have a detailed understanding of all data sets. Despite that, there is a need to autonomously verify and validate all data to be confident in its quality and integrity. The solution to this challenge? A queryable data quality platform powered by Trino.

The underlying data quality platform’s logic handles the validation. Resting on top of it is Trino, the scalable, fast solution to ensure that users can query what they need. Even when the platform is profiling the data, enforcing various quality rules, and validating the data in different ways, Trino is there to provide access to everything contained within, proving that quality, speed, and accessibility don’t need to be tradeoffs.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/05/trino-summit-2022-goldman-sachs-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Optimizing Trino using spot instances with Zillow

2022-12-01T00:00:00+00:00

In this installment of the Trino Summit 2022 sessions posts, we jump into an exciting topic by folks from Zillow about running Trino on spot instances. Spot instances are cheap and ephemeral nodes that lead to reduced overall compute costs. Spot instances are cheaper as they are not guaranteed to remain available.

In this session, Zillow engineers talk about how they use Trino on spots to take advantage of the cost savings while handling the transitory nature of spots.

Check out the slides!

Recap

Zillow’s BI platform team is tasked with enabling access to data and metrics from their data lake in a self-serving and performant manner. The platform must handle generating up-to-date reports and metrics to unlock time-critical opportunities. They also need to enable adhoc analytics across multiple domains within Zillow.

There are close to 600 data pipelines and 65,000 queries running daily. The average read covers 600 terabytes of data, and the average P95 time is around 20 seconds. They have six Trino clusters that service various workflows based on load. These are all deployed on Amazon EKS with a range of eight to 60 workers based on CPU utilization.

When deploying Trino on EKS, Zillow uses worker groups, which enables them to collocate nodes in AWS local zones. It also made it possible to choose spot instances, which are 90% cheaper than regular on-demand instances. A critical aspect they needed to cover was to correctly tune the percentage of nodes that were spot instances. They created pools of nodes that were entirely on-demand for coordinators since a coordinator going down, brings down the entire cluster. Other pools used for workers are tuned to an optimal blend of spot and on-demand.

Watch this session to learn how to properly optimize the number of spot instances running for your Trino clusters, without losing reliability of your service. Also learn some ways that Zillow is planning on using the fault-tolerant execution mode.

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/12/01/trino-summit-2022-zillow-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino delivers for Amazon Athena

2022-12-01T00:00:00+00:00

Our community just keeps growing! Today, it is time to reach out and welcome another large group of Trino users. The release of the new engine version for Amazon Athena upgrades Athena to a recent version of Trino from a rather old version. This update brings a ton of improvements from the Trino project to the users of the popular cloud-based query service.

Shared history

Amazon Athena and Trino share a long history. From the beginning of Athena, the query engine under the hood was Trino, then still called Presto. Athena created a low-maintenance, powerful access mode to your data in S3 and beyond. It combined the performance and features of Trino, with the convenience of a cloud service, which enabled new users and use cases. You could take advantage of Trino without needing a team of experts to deploy and operate a Trino cluster for your organization. In fact, we wrote about this in the first edition of Trino: The Definitive Guide. There is also a section in the new second edition that you can get for free from Starburst.

Time flies

But since the initial release of Athena, time has not stood still. In fact, the Trino project has accelerated in innovation, features, and releases tremendously. Until now Athena users missed out on these improvements. However with the update Amazon Athena users now get access to many of these great features. As AWS mentions in the announcement, “over 50 new SQL functions, 30 new features, and more than 90 query performance improvements” are now available due the upgrade to a new version of Trino. These include Row pattern recognition with MATCH_RECOGNIZE, new window features, support for UPDATE or TRUNCATE statements, and many others.

Performance improvements in our core engine and all the Trino connectors show up in every release note. The improvements observed by the Athena team in their benchmarks show the resulting gains nicely. This is great evidence that our approach of constantly working on small improvements wherever we find potential works well. This approach is necessary since Trino is already at a very high performance level, and like an elite athlete, where every small improvement matters.

It is also important to note that these improvements are only in the Trino version of the engine, since the Presto project does not include these features.

Client tools and collaboration

Athena users also benefit from improvements for supporting client tools such as Python clients, dbt, Metabase and others. Working with other communities is of critical importance to the Trino project. The innovations in our Iceberg connector that are all now also available to Athena users are a great example how we can lead the way together. Working with contributors from Amazon and other companies and projects has yielded some amazing improvements. At the Trino summit and contributor congregation, we to reconnected in person and established even closer collaboration.

Looking forward

So, what is next for Trino and Athena users? First up, you should upgrade to the new Trino engine in Athena, and avoid the legacy Presto engine.

Second, check out some of the great presentations from Trino Summit 2022 and hear about some of our impressions.

And last but not least, stay tuned for more goodness. Trino already shipped further releases that included support for MERGE, table functions, and more performance improvements. The Athena team is working hard on updating Trino for your benefit regularly.

Celebrating our first decade of the Trino project this last summer has shown a great trajectory for the project and the community, and it looks like the next decade is going to be even better!

Sending a warm welcome from the Trino community to the Amazon Athena team and users. Now you know that you were Trino users all along.

Martin and Manfred

Enterprise-ready Trino at Bloomberg: One Giant Leap Toward Data Mesh!

2022-11-30T00:00:00+00:00

This post continues a larger series of posts on the Trino Summit 2022 sessions. Following the Trino at Apple talk, engineers from Bloomberg shared the latest about their additions to Trino. Bloomberg uses Trino to federate huge amounts of disparate financial data together. When you have many users with different use cases and resource needs, you need something to ensure that the huge workloads don’t bully the small ones. Enter the Trino Load Balancer, a privacy-aware solution to help maintain high availability while still treating data security as the first-class citizen that it should be.

Check out the slides!

Recap

Bloomberg collects data, creates experimental data, and ingests data from vendors. Its data analysts then refine, clean, and structure that data using whatever their preferred method is, generating even more diverse data. Internal teams and clients then want to look at and query that generated data, too. Sound like a data mesh? That’s because it is. Trino isn’t new at Bloomberg, and it’s been in use to help federate all of those varying data sets into one unified access point.

When trying to deploy multiple Trino clusters for such a wide array of users who demand high uptime, high throughput, and fast response times, the Trino coordinator becomes a single point of failure. There’s the risk of infrastructure outages, the need to shut things down for occasional upgrades, and some users run high-throughput jobs for millions of rows while others are expecting low-latency jobs for only hundreds. Keeping Trino up, running, and meeting all users’ expectations is no small task.

And that’s where the Trino Load Balancer comes in! As a fork of the open-source presto-gateway, it helps to do exactly what it says on the tin for Trino: balance workloads. By being aware of what’s running on each cluster and how many resources are being used, it can direct traffic to the ideal clusters to meet each user’s needs. And with a brief demo, we get a look at how data owners can set policies that are respected within the load balancer, ensuring that users can only access and query what they’re supposed to.

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/11/30/trino-summit-2022-bloomberg-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino at Apple

2022-11-28T00:00:00+00:00

This post continues a larger series of posts on the Trino Summit 2022 sessions. Following the Keynote: State of Trino session, engineers from Apple shared the current usage of Trino at Apple. They discuss how they support Trino as a service for multiple end-users, and the critical features that drew Apple to Trino. They wrap up with some challenges they have faced and some development they have planned to contribute to Trino.

Check out the slides!

Recap

Trino is deployed at scale in Apple, and it continues to see tremendous adoption across multiple teams at Apple. Yathi Peddyshetty, Software Engineer @ Apple

The commonplace adhoc and BI analytics use cases make up a lot of how Apple uses Trino today. They also have increasing uses in federated querying and A/B testing.

To deploy Trino as a service, Apple has an in-house Kubernetes operator to manage the Trino cluster lifecycles. They also created an orchestrator to provision and simplify cluster creation and management. They make this a self-service console that allows users to provision their own clusters per request. Their custom orchestrator also takes care of autoscaling and other technical complexities of maintaining a scalable Trino system.

Apple primarily uses Iceberg, Hive, and Cassandra connectors. They have a heavy focus on Apache Iceberg as their table format and have contributed a significant amount of PRs to improve interoperability between Trino and Spark, and increased coverage of Iceberg APIs. Other challenges Apple face stem from the lack of flexible routing of queries to achieve zero downtime, and having pluggable optimizer rules and operators.

Apple has various features on their roadmap to eventually contribute to the community. This includes, exposing remaining functionality in the Iceberg APIs, support all partition transforms, predicate pushdowns, bucketed joins, simple aggregate pushdowns, Iceberg native views in Trino, and more.

If you thought this talk was interesting, please consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/11/28/trino-summit-2022-apple-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino Summit 2022 recap: The state of Trino

2022-11-22T00:00:00+00:00

To kick off the Trino Summit 2022, we heard from Trino co-creators Martin Traverso, Dain Sundstrom, and David Phillips. Martin gave a talk on the state of Trino and project plans for 2023, then opened the floor to questions from the community. You can watch a recording of the talk, or read on if you’re only interested in the highlights.

Check out the slides!

Recap

So what has happened in Trino over the last year?

We celebrated Trino’s 10th birthday!
It was the busiest year in project history, with 600+ contributors, 4000+ commits, and near-weekly releases.
Tons of new features were added, including MERGE, JSON functions, table functions, fault-tolerant execution (look forward to a lot of talking about it in later recaps!), upgrading to Java 17, and a slide so dense with other goodies that it needed two columns.

And what’s coming down the pipeline?

Project Hummingbird, a large set of core engine improvements.
Expanded table function support, including accepting tables as arguments.
Extra community support, so that contributors have an easier and better time getting code merged into Trino.
New connectors, CREATE/DROP CATALOG, query tracing, and more!

There were also tons of great questions asked by live and online attendees answered by Dain, David, and Martin, so if you want to hear more, take a listen to the full talk!

If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web. Use the social card and link to https://trino.io/blog/2022/11/22/trino-summit-2022-state-of-trino-keynote-recap.html. If you think Trino is awesome, give us a 🌟 on GitHub !

Trino Summit 2022 recap

2022-11-21T00:00:00+00:00

Trino Summit 2022 was in a word, invigorating. I’m still coming off the high from the amount of energy I gained from being at this summit, meeting many of you face-to-face for the first time. Most surprisingly, I learned that Trino contributor James Petty from AWS was actually not famous painter Bob Ross.

If you’ve ever planned a conference, you know that there are a lot of details to iron out, and you can be left exhausted by the end. After this year’s Trino Summit though, rather than being worn out, I felt like it ended too quickly and I simply wanted more time to chat with everyone. A single day was simply not enough, and now all I can think about is the next summit. We not only got to hear an incredible lineup of talks and discussions from first-time Trino Summit speakers like Apple, Shopify, and Lyft, but also had many engaging discussions outside the auditorium.

There were cross-community discussions between Delta Lake, Airflow, and Alluxio about how to turbo-charge Trino integrations with these communities. There were many companies talking about best practices and gotchas while migrating from Hive to Iceberg or Delta Lake. Others wanted to learn how to use fault-tolerant execution. I spoke with managers of companies like LinkedIn and Bloomberg who wanted to help develop their engineers to get more involved with contributing to Trino. We all finally got to see the faces of people we had been talking to for the past two to three years for the first time. People were getting their free copies of Trino: The Definitive Guide signed by Manfred, Matt, and Martin and brought home other swag. After a long day of talks, we wrapped Trino Summit up with two happy hours on the roof of the Commonwealth club watching the sunset over the San Francisco bay bridge.

Session summaries

I would like to quickly summarize a few short takeaways I had from each talk at the summit. I highly recommend you watch the full videos on the Trino YouTube which are linked in the titles:

Keynote: State of Trino (Read more)

Trino co-creator, Martin, covers recently developed features, community statistics, and discusses roadmap features like Project Hummingbird.
Dain and David join Martin on the stage to answer audience questions.

Trino at Apple (Read more)

Apple has an in-house k8s operator to manage Trino cluster lifecycles, and an orchestrator to provision and simplify cluster creation and management.
Apple has a heavy focus on Apache Iceberg as their table format and has contributed a significant amount of PRs to improve interoperability between Trino and Spark and increased coverage of Iceberg APIs.

Enterprise-ready Trino at Bloomberg: One Giant Leap Toward Data Mesh! (Read more)

Bloomberg uses Trino to centralize access to their massive amounts of catalogs under many different departments.
To offer Trino-as-a-Service for varying workloads, they use a Trino Load Balancer (a fork of the popular presto-gateway project at Lyft) to add new functionality. In talking with them after their presentation, the Bloomberg team expressed an interest in wanting to open source this work to the community as a more generalized solution than the gateway project.

Optimizing Trino using spot instances (Read more)

In an attempt to minimize costs, Zillow is measuring the efficacy of running Trino ETL jobs on spot instances.
This currently runs the risk of retries for failure but future work will look at utilizing the new fault-tolerant execution method to mitigate retries in the event of failure.

Leveraging Trino to Power Data at Goldman Sachs (Read more)

Goldman Sachs uses Trino to power their data quality service, taking advantage of the fact that Trino centralizes all visibility across their platform.

Elevating data fabric to data mesh: Solving data needs in hybrid datalakes (Read more)

Comcast takes us through their Trino architecture journey by providing the history of their Data Fabric service, and now discusses the data governance and culture changes required to realize a Data Mesh with Trino.

Rewriting History: Migrating petabytes of data to Apache Iceberg using Trino (Read more)

Shopify has recently migrates of its workloads to Trino. One of the first hurdles was dealing with many issues in the Hive table format, so they quickly upgraded to the Iceberg table format.
They initially encountered numerous issued, but experienced incredibly fast turnaround of fixes from the Trino project that resolved their issues during the migration.
There’s also a benchmark of how updating to a columnar format and Iceberg table format drastically improves the results.

Trino for Large Scale ETL at Lyft (Read more)

Lyft is using Trino to perform ETL jobs scanning 10PB of data per day, and writing 100TB per day. They are not using fault-tolerant execution.
In the last year, Lyft cut their number of Trino nodes in half, while increasing the volume of their workloads due to recent improvements in Trino and upgrades in Java versions.
Keeping up with the rapid release cycle of Trino was a challenge and Lyft showcases their regression testing using their query replay framework.

Federating them all on Starburst Galaxy (Read more)

Running and scaling Trino is difficult. Starburst showcases Starburst Galaxy, a SaaS data platform built around the Trino query engine.
This demoes running federated queries over Pokémon data scattered across MongoDB and Iceberg tables.

Trino at Quora: Speed, Cost, Reliability Challenges and Tips (Read more)

Quora uses a large number of Trino clusters for ad-hoc, ETL, time series, A/B testing, and backfill data.
Quora faced some initially high costs on Trino due to inefficient uses of resources.
To address this they migrated to use Graviton instances, implemented autoscaling, and optimized query efficiency.

Journey to Iceberg with SK Telecom (Read more)

The speakers travelled all the way from South Korea to join us in person.
SK Telecom had a multitude of performance issues that all stemmed from the lack of flexibility in the Hive model and metastore.
They migrated to Iceberg to address performance issues and had added benefits of Iceberg’s table format to improve developer workflow.
Housekeeping operations like optimize were already addressed by the Iceberg community and quickly added to Trino.
This reduced query processing time by 80%.

Using Trino with Apache Airflow for (almost) all your data problems (Read more)

Airflow is a highly functional and well-adopted workflow management platform to schedule jobs on your data platform.
The Trino integration for Airflow recently landed and this coincided with the GA arrival of fault-tolerance execution mode in Trino.

How we use Trino to analyze our Product-led Growth (PLG) user activation funnel (Read more)

Upsolver solves a lot of common data problems on their platform.
One such problem is measuring activation rates in a product-led growthteam. This requires taking action on many sources of data.
Trino makes a natural fit to address the issues of joining this data together.

Federate ‘em all

After a whole day of throwing Trino balls out to the crowd, we got to see a nice metaphor for federated data by throwing them all in the air and yelling, “Federate ‘em all!”

Trino Contributor Congregation

The day after the summit, we invited a relatively small group of our contributors to meet for the inaugural Trino Contributor Congregation (TCC). This gathered many of our long-time and heavy Trino contributors. We had folks from companies like Starburst, AWS, Apple, Bloomberg, Lyft, Comcast, LinkedIn, Treasure Data, and others. Let’s dive into some of the topics we discussed.

We discussed feature proposals like:

The Trino loadbalancer which is an adaption of the popular gateway project from Lyft.
A Ranger plugin to be maintained by the Trino community rather than rely on the Ranger project.
A Snowflake connector that was traditionally held back by the lack of infrastructure.

We discussed the need for better shared testing datasets outside of the TPC-H and TPC-DS that are more representative of real workloads that many are using.

We discussed the need for a clearer process for contributors to follow to minimize the time to get features merged and avoid stale PRs. This is being addressed by the backlog grooming performed by the developer relations team, and assigning maintainers to own various PRs. While there is never a promise to merge a PR, improving the turnaround and communication on PRs is crucial to keep happy contributors and improve the health of the project.

While we were sad that not everyone could make the in-person TCC, we plan to have virtual TCCs on a more frequent cadence and have the in-person TCCs alongside larger in-person events. Getting these TCCs right is core to growing the maintainership and continued success of the Trino project.

We hope all of you who could join us in-person and online enjoyed yourselves. We all had such a blast! Stay tuned for updates on the next Trino Summit location!

Top five reasons to attend Trino Summit 2022

2022-10-31T00:00:00+00:00

This blog post wraps up a series of previous posts
teasing Trino Summit 2022. The conference is free and takes place in San Francisco, California on November 10th. Join us either in-person or virtually!

Lets dive right into the five reasons you should attend Trino Summit 2022. If you’re not into these lists, go ahead and register now!

1. Hear speakers from industry leading companies talk about their Trino architecture and use cases

This year’s summit contains leaders in the industry with varying workloads and use cases. There are also sessions on tips and tricks to scale and lower the cost of running Trino in production. Users from the following companies speak about their challenges and how they use Trino to help overcome them:

Apple
Astronomer
Bloomberg
Comcast
Goldman Sachs
Lyft
Quora
Shopify
SK Telecom
Starburst
Upsolver
Zillow

To see more information about the talks and the agenda for the conference, check out the Trino Summit 2022 agenda.

2. Meet the authors of the Trino: The Definitive Guide and get that Trino swag

This year, we are giving away autographed copies of the recently updated, Trino: The Definitive Guide to members who are attending. Already have a physical copy? Visit the Trino booth to get your book signed and meet authors Manfred Moser, Matt Fuller, and Martin Traverso who literally wrote the book on Trino.

We will be giving away swag packs containing an autographed copy of Trino: The Definitive Guide, a Trino Summit 2022 shirt, a Commander Bun Bun plushie, and more to both virtual and in-person attendees! This will be done during our sponsored giveaway breaks between sessions where we challenge both online and virtual attendees in a race against time to bag the swag!

3. Federate ‘em all

This year’s summit will be a free event that federates both data and humans. The theme extends from a popular show that many of you know called Pokémon. To understand the connection here, let’s break down what we mean by federate ‘em all. In the same way Pokémon protagonist, Ash Ketchum, catches and trains heterogeneous creatures called Pokémon, Trino queries and filters heterogeneous data sets from various data sources.

If you’re not familiar with Pokémon, a losing strategy is to train just one or two Pokémon as different types of Pokémon are better suited to different tasks. In the same way, centralizing all of your data to a single data warehouse or data lake doesn’t make sense either. There are different use cases and different needs across the company. Rather than spending your time building brittle one-size-fits-all architectures, Trino enables you to connect to multiple data sources using ANSI SQL.

4. Experience beautiful San Francisco

For those attending in-person, you will get to enjoy the beautiful San Francisco area. The Commonwealth club, is located right on the San Francisco Bay. The building is beautiful with a large auditorium for the main event, and plenty of floors and rooms for socializing.

At the end of the summit, we will have a happy hour on the scenic roof-deck that gazes over the San Francisco bay at the iconic Oakland Bay Bridge.

We know this only applies to our in-person attendees, but remember if you join us virtually, there are still plenty of resources to network and interact throughout the conference. We will be taking questions from our virtual audience and there will also be a chat forum to discuss with attendees from across the globe. Plus, unlike those of us attending in-person, no travel is required and pajamas are optional during the event!

5. Collaborate with some of the best minds working on Trino

Trino is a relatively new paradigm compared to the rest of data world. If you just realized that you don’t have to move all your data into one location, you’re on the right track. However, there’s still a lot to learn when it comes to scaling out a query engine that over time grows in usage. To get this right, you need a community to be successful. The creators Martin, Dain, and David and many of the core contributors of Trino will be attending, along with a large list of folks that are using multiple clusters over hundreds of petabytes of data.

Tap into this incredibly passionate group of Trino enthusiasts to augment your experience with this revolutionary query engine!

Register for the summit

Make sure you register quickly for in-person registration, as it is limited to 250 seats. Spots are running out quickly so don’t wait!

Announcing the final round of sessions and the agenda!

Now for the final list of sessions to announce for this year’s Trino Summit! This week is quite the reveal as we are showcasing a talk of how engineers at Apple use Trino for their analytics challenges! 🎉🤯

We also have three more amazing guests that are heavy hitters in the data and analytics tech scene.

Trino at Apple

In this talk the audience will learn how Apple uses Trino to accelerate analytics, the challenges we face deploying analytics at scale at Apple, and the areas we would like to collaborate on with the community.

Vinitha Gankidi, Software engineer at Apple
Yathindranath Peddyshetty, Software engineer at Apple

Enterprise-ready Trino at Bloomberg: One Giant Leap Toward Data Mesh!

Enterprises like Bloomberg love Trino. It allows us to embrace the data mesh with ease. Providing Trino as a service in a highly available, configurable, and access-controlled manner has been a key enabler for us in this paradigm shift. Join us to learn how we have leveraged open-source components to achieve these goals at Bloomberg.

Pablo Arteaga, Software Engineer at Bloomberg
Vishal Jadhav, Software Engineer at Bloomberg

Leveraging Trino to power data quality at Goldman Sachs

Data is at the core of today’s business processes. We are responsible for making accurate, timely, and modeled data available to our analytics and application teams. The source of these datasets can be quite heterogeneous like HDFS, S3, Sybase, Snowflake, Elasticsearch, and more. Also with an increase in data volume, velocity, and variety; data quality assurance is extremely critical to ensure the trustworthiness of data and mark it usable for consumers to use with confidence. We have leveraged Trino to make high-quality data centrally accessible through an efficient, secure, governed, and unified way of performing analytics.

Sumit Halder, Vice President at Goldman Sachs
Ramesh Bhanan, Vice President at Goldman Sachs
Siddhant Chadha, Associate at Goldman Sachs
Suman Baliganahalli Narayan Murthy, Vice President at Goldman Sachs

Optimizing Trino using spot instances

Trino is a critical tool used at Zillow for doing analytics on datalake. In this talk we aim to give a general overview of how we leverage Trino and dive deeper into the optimizations we have done for scaling Trino at Zillow using Spot instances.

In this session, we will show how fault-tolerant execution mode enables a more cost-effective and resilient execution running Trino on Spot.

Rupesh Kumar Perugu, Senior Software Engineer at Zillow
Santhosh Venkatraman, Software Engineer at Zillow

That finalizes all of our sessions! To see them all, check out the Trino Summit 2022 agenda.

Conclusion

Get excited, the conference is in less than two weeks so don’t forget to register, and always, Federate them all! It is really shaping up to be an educational and fun-filled event with Trino experts and aficionados.

A huge thanks to our sponsors: Starburst, Privacera, Monte Carlo, Immuta, CubeJS, Delta Lake, Hightouch, Backblaze, Databricks, Alluxio, and Tabular!

Well that’s a wrap, we’ll see you all in T-minus ten days!

Trino Summit 2022: Federating humans and data

2022-10-19T00:00:00+00:00

Trino has long been the de facto standard to querying large data sets over your cloud or on-prem storage, also known as data lakes. This Trino Summit’s theme instead will showcase Trino’s other claim to fame: query federation. Trino is a query engine providing an access point that exposes ANSI SQL across multiple data sources.

I urge you to join us either in-person or virtually if you are a fan of Trino, big data, open source, data engineering, Java, or all the above! This conference is free and takes place in San Francisco, California on November 10th.

Register for the summit

I can’t help but bring up the analogy of how Trino federates heterogeneous data while this Trino Summit will federate many of us in the community form all corners of the world. It really brings an appreciation to the international reach of Trino and makes me look forward to more in-person events!

Trino Summit will be held at the Commonwealth Club in San Francisco, California. Make sure you register quickly for in-person registration, as it is limited to 250 seats. Virtual registration is also picking up quickly so register today!

Get an autographed copy of Trino: The Definitive Guide, 2nd ed.

Want to meet the authors who literally wrote the book on Trino? Visit Manfred Moser, Matt Fuller, and Martin Traverso at the Trino booth during the conference. Bring your hard copy of Trino: The Definitive Guide to get it signed by the authors!

Don’t have a book? We’ll be giving away autographed copied of the book throughout the conference!

Trino Summit 2022 teaser

Check out the teaser for this year’s Trino Summit and get ready to Federate ‘em all!

Announcing the second round of sessions and speakers

As mentioned in the previous summit teaser, we announced some of our exciting lineup of speakers! The topics range from architectures like data mesh and data lakehouse, to running Trino at scale with fault-tolerant execution, and of course, query federation.

We have a full roster planned, but check out the next round of fully confirmed sessions. Stay tuned for one more blog post as we announce the final sessions in our agenda as they are confirmed!

SK Telecom’s journey to Iceberg

SK Group is one South Korea’s largest conglomerates in the nation covering industries from manufacturing to telecommunications. SK Telecom uses an on-premise data platform at petabyte scale using Trino as a query engine. We chose Trino for its ability to connect to heterogeneous data sources and ensures fast performance that plays a key role in our data platform.

As data along with user demands to analyze long-term data increased, the Trino Hive connector faced several challenges. Queries with an input data size exceeding a terabyte put a great burden on the cluster. This caused many jobs to fail which can be problematic as Trino’s resource sharing architecture affects multiple users when a heavy query occurs.

To address this situation, we optimized the data structure, tuned queries, and used the resource group to isolate queries, but none of this fixed the problem. We investigated Apache Iceberg and realized it could address some of these scaling issues we were facing. In this talk, we will share our journey.

JaeChang Song, Data Engineer at SKTelecom and Trino/Iceberg Contributor
Jennifer OH, Data Engineer at SKTelecom

Elevating Data Fabric to Data Mesh: solving data needs in hybrid data lakes

At Comcast, we have long had a complex hybrid data lakes that consists of data lakes in on-prem and multiple cloud environments. Comcast uses Trino to bridge the data in these environments using an architecture we call Data Fabric. Data Fabric is an abstraction layer that uses an internally built connector that connects to multiple instances of Trino. This enables us to query across all of these environments from a single Trino instance.

In recent years, emerging architectures like Data Mesh have nicely complemented the goals we have been building to for years. While we have effectively implemented some aspects of a Data Mesh, there are still core tenants that cannot be addressed by Trino alone. This is the journey we are on at Comcast, and we like to share our experience so far, challenges we overcame, and the ones yet to be resolved. Data abstraction, availability, movement, and governance are the various topics we will touch upon in this session.

Sajumon Joseph, Sr Principal Architect
Pavan Madhineni, Sr. Manager; Product Development Engineering

Trino at Quora: Speed, Cost, Reliability Challenges and Tips

Trino has become an essential part of Quora’s tech stack and a major component of our A/B testing framework that powers our decision-making on the product. Trino has brought a lot of advantages to us. However, at Quora’s scale, we face cost, speed, and reliability challenges when operating Trino.

In this session, we will talk about how we resolve the challenges. Some approaches are: auto-scale Trino clusters, experiment with different cluster and JVM configurations, and instance types, build checkers to detect slow workers and inefficient queries, and set up extensive monitoring.

Yifan Pan, Software Engineer of Data Infrastructure Team at Quora; Administrator/Primary Owner of Trino infrastructure at Quora

How we use Trino to analyze our Product-led Growth (PLG) user activation funnel

Being a PLG company, we must track and analyze every action our users perform within the product to remove friction and maximize usage and satisfaction. To understand how effectively and quickly users become educated and then active in the product, we had to instrument the user journey from signup to the Aha moment and beyond.

There are many tools on the market that can be used to analyze user behavior, but none met our needs. In this session you will learn how we built a data architecture to collect, model, and enrich user behavior events to optimize Trino query performance that accelerated our ability to understand and improve user conversion rates.

Roy Hasson, Head of Product at Upsolver

Conclusion

I hope you all are as excited as we are to finally federate the Trino community face-to-face! This conference is shaping up to be educational, fun, and filled with Trino experts and aficionados.

Stay tuned for new developments in upcoming blog posts, don’t forget to register, and always, Federate them all!

Release of the second edition of Trino: The Definitive Guide

2022-10-03T00:00:00+00:00

It was time for a refresh. A little while ago in April 2021, we announced the Trino version of our definitive guide. But again, Trino as a project and community has continued to innovate and grow. Numerous smaller and larger details changed, and the examples and resources needed to be fixed.

Today, we are happy to announce that after a few months of updates, testing, and editing, the second edition of Trino: The Definitive Guide is available.

Get a free copy from Starburst now!

The new edition of the book from O’Reilly is available in digital formats as well as physical copies. You can find more information about the book on our permanent page about it.

The book is now updated to Trino release 392 for all filenames, installation methods, commands, names and properties. We addressed all problems that our readers found and reported to us as well.

We updated to Java 17 usage, added more SQL statements, and added info about Python tools like dbt and clients like Metabase. We talk about the lakehouse architecture and new connectors like Iceberg and Delta Lake.

So what are you waiting for? Go get a copy, check out the updated example code repository, give us a star, provide feedback, and contact us on Slack.

Manfred, Martin, and Matt

And one last tip, join us at Trino Summit 2022 in San Francisco in November for a chat and maybe even a signed hardcopy of the book.

Trino Summit 2022 will be legendary

2022-09-22T00:00:00+00:00

Commander Bun Bun is back and this year we have an exciting lineup of speakers. Topics range from architectures like data mesh and data lakehouse, to running Trino at scale with fault-tolerant execution, and query federation. This conference is free and takes place on November 10th. The summit is a hybrid event for in-person and virtual attendance. Find out more details below!

Register for the summit

This year’s Trino Summit will be hosted at the Commonwealth Club in San Francisco, CA. In-person registration is limited to 250 seats so make sure you register quickly before spots run out!

Trino Summit 2022 teaser

Get ready to federate them all this year! Many times when folks think of Trino, their first instinct is to consider the data lake use case where it replaces Hive or other data lakehouse query engines. However, this summit will also drill into the lesser discussed query federation use case. Federate ‘em all!

Announcing the first sessions and speakers

We have a full roster planned but here is a glance at a few full confirmed sessions. Stay tuned for future blog posts as we announce more session as they are confirmed!

State of Trino keynote

Hear the latest on the state of the open source Trino project. Trino is the award-winning MPP SQL query engine. In this session, Trino creators discuss the latest features that have landed in the last year, the roadmap for the year ahead, and community growth highlights.

Martin Traverso, Co-Creator of Trino and CTO, Starburst
Dain Sundstrom, Co-Creator of Trino and CTO, Starburst
David Phillips, Co-Creator of Trino and CTO, Starburst

Trino for large scale ETL at Lyft

At Lyft, we are processing petabytes of data daily through Trino for various use cases. A single query can execute as long as 4 hours with terabytes of memory reserved. There are quite many challenges to operate Trino ETL at such a scale: how to make all queries as performant as possible with low failures rates; how should we define clusters, routing groups and resource groups for changing volume across a day; how to keep commitment to user SLOs during unexpected spikes, etc.

We’ll share what we’ve done with our config tunings, large query/user identifications, autoscaling and fault tolerant features to execute Trino at such a scale. We’ll also share our upcoming challenges and plans to move steps further with Trino adoption across the company.

Charles Song, Senior Software Engineer at Lyft

Rewriting history: Migrating petabytes of data to Apache Iceberg using Trino

Dataset interoperability between data platform components continues to be a difficult hurdle to overcome. This short coming often results in siloed data and frustrated users. Although open table formats like Apache Iceberg aim to break down these silos by providing a consistent and scalable table abstraction, migrating your pre-existing data archive to a new format can still be daunting. This talk will outline challenges we faced when rewriting petabytes of Shopify’s data into Iceberg table format using the Trino engine. A rapidly evolving landscape, I will highlight recent contributions to Trino’s Iceberg integration that made our work possible while also illustrating how we designed our system to scale. Topics will include: what to consider when designing your migration strategy, how we optimized Trino’s write performance and how to recover from corrupt table states. Finally, I will compare the query performance of old and migrated datasets using Shopify’s datasets as benchmarks.

Marc Laforet, Senior Data Engineer at Shopify

Federating them all on Starburst Galaxy!

You’ve federated them all on Trino, but to beat the elite four at Indigo Plateau, every data trainer needs help. In this talk, I will cover how Starburst Galaxy is the fastest path to query federation and cover a demo that trainers can follow later. We’ll also cover cool features like schema discovery and fault-tolerance execution. The queries we’ll run will be with Pokémon data so that you don’t have to witness yet another taxi cab or iris data set.

Monica Miller, Developer Advocate at Starburst*

Using Trino with Apache Airflow for (almost) all your data problems

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems. However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach. In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.

Philippe Gagnon, Solutions Architect at Astronomer

Conclusion

Stay tuned for new developments in upcoming blog posts, don’t forget to register, and always, federate them all!

Trino charms Python

2022-09-20T00:00:00+00:00

Wow, have we ever come a long way with Python support for Trino. It feels like ages ago that we talked about DB-API, trino-python-client, SQLAlchemy, Apache Superset, and more in Trino Community Broadcast episode 12. More recently we talked about dbt in episode 21 and episode 30, but there is so much more for Pythonistas, Pythonians, Python programmers, and simply users of Python-powered tools.

Where are we now

Python usage shows up with nearly every Trino deployment these days, and we had some really great developments for you all recent months:

Starburst has really ramped up the contributions to the foundation of a lot of Python tools connecting to Trino. The trino-python-client receives improvements regularly and is definitely a first-class client at the same level as the JDBC driver or the CLI.
dbt Labs and Starburst have worked hard on launching and improving the dbt-trino project and enabling automated data transformation flows.
Apache Airflow use cases are abound and the integration is improving
Apache Superset and Preset continue to add features and treat Trino as a major data source and integration, and we should probably have another Trino Community Broadcast episode to see that all in action.
Airbyte was demoed at Cinco de Trino and is widely used by companies such as Lyft.

And of course there are well-known usages such as notebooks everywhere, on your workstation, in your company, and out in the cloud. But is there more? There must be!

What else could we do

All of these developments are great for our users. I want to encourage you all to try these tools and learn how amazing they are with Trino. At the same time it feels like there got to be even more. The Python ecosystem is so large, and there are probably dozens of use cases we never heard about, have not considered, or dreamed about in our wildest dreams.

On the other hand I am sure there are still problems with these tools and integrations. What is an edge case for us, might be a daily task for you. What we consider hard and complicated, might be just what you have to deal with anyway. And in the spirit of constant improvement, we really want to fix these things and make it all amazing. But we need your help.

Let us know what you think

This is now your opportunity to tell us what need to make your Trino and Python experience better.

Help Trino and Python

Conclusion

Trino, Python, and all the tools in the ecosystem go from strength to strength. With your help we want to supercharge the tooling to hero levels. With your help and input we can do it.

Join us in the python-client on Trino slack, and don’t forget to answer that survey.

Thanks, and see you at the Trino Summit 2022.

Manfred, Brian, and Dain

Trino's tenth birthday celebration recap

2022-09-12T00:00:00+00:00

What an exciting month we had in August! August marked the ten-year birthday of the Trino project. Don’t worry if you missed all the excitment as we’ve condensed it all in this post.

Blog posts

We felt it necessary to chronicle the larger events that happened in the last decade of the project through the lens of where we are today.

We shared these posts on HackerNews and the Facebook and the query innovation posts both hit the front page. This resulted in one of the largest amount of page views on the Trino website in a given day - more than 25k views!

Trino ten-year timeline video

Another way we celebrated was creating an epic ten-year montage video that chronicles the incredible journey starting with the Presto project’s humble beginnings, and how it evolved into the success that Trino is today:

Birthday celebration with the creators of Trino

To cap things off last month, we hosted a meetup with the creators to reflect on the last ten years, laugh and listen to some stories from the early days, talk about the exciting features currently launching, and speculate on the next ten years of Trino. Here are some highlights you missed:

Adding dynamic catalogs

Dain discusses what dynamic catalogs could look like in Trino. Currently, to add catalogs in Trino, you need to add the new catalog configuration file and then restart Trino. With dynamic catalogs, you can add and remove these catalogs at runtime with no restart required. There is still no guarantee of exactly when this feature would arrive, but some of the foundations are currently being added. Dain dives into this a bit more in this clip

Vectorization and performance

As more marketing around vectorized databases has come up recently many have asked if Trino will be following the trend. This question comes up at an interesting time as Trino now requires Java 17 to run. Java 17 comes with a lot of capabilities to vectorize, and while we are excited to start looking into these capabilities, simply updating workloads to use vectorization doesn’t pack the performance punch that many would expect it to. The answer is more complex:

Do modern workloads benefit from vectorization? See Martin’s answer to this
Is there a benefit to vectorization over Java’s auto-vectorization? Sometimes, but Dain elaborates on when
If not vectorization, what type of performance improvements does Trino focus on? Martin and Dain list some simple but impactful ones
The debate around query time optimization versus runtime adaption. Which should you optimize first?

Polymorphic table functions

One feature that is top-of-mind for everyone in the Trino project are polymorphic table functions or simply “table functions” as Dain prefers to call them.

What is a table function? David and Dain discuss standard and polymorphic table functions
Could we rewrite the Google Sheets connector as a table function?. David and Dain discuss how this would work
Why table functions are so incredibly powerful. Eric and Dain talk about why PTFs are a game changer

If you want to learn more about polymorphic table functions, check out the recent Trino Community Broadcast episode that covers the potential of these functions in much more detail.

The early days of Presto and Trino

We wanted to get some insight into what the early days of the project looked like, and how Martin, Dain, David, and Eric began the daunting task of designing and building a distributed query engine from scratch. Some of the discussions were interesting while others were downright hilarious. Here are some steps you can take to write your own query engine, at least if you want to do it the way the Trino creators did it:

Look up a bunch of research papers to see how others are doing this 📑. Video
- Side note: Papers tend to be highly aspirational and skip important fundamentals. Video
Address the real challenges of making a query engine. Video
Take your initial version and just throw it away 😂🗑🚮. Video
Expand outside the initial use cases by learning from other companies and building community 👥. Video
Cause a brownout on the Facebook network 📉. Video
Realize the system you replaced was actually faster in some cases, but for all the wrong reasons ❌🙅. Video

After a lot of the initial work was done, Presto was deployed at Facebook and soon after open sourced. From here, we know that the velocity of the project picked up and once the project was independent of Facebook, the features took off even more. While everything may seem calculated in hindsight, it was a lot of hard work to grow the community and adoption around Presto and now Trino. The creators knew they were making a project that would be utilized outside the walls of Facebook, but they could never have anticipated the sheer scale of adoption Trino would see.

Conclusion

We hope you enjoyed all the fun we had celebrating these first ten years of the Trino project. We are thrilled to think of what the following decades will bring. We’d like to leave you with closing thoughts from Dain:

Make your Trino data pipelines production ready with Great Expectations

2022-08-24T00:00:00+00:00

An important aspect of a good data pipeline is ensuring data quality. You need to verify that the data is what you’re expecting it to be at any given state. Great Expectations is an open source tool created in Python that allows you to write detailed tests called expectations against your data. Users write these expectations to run validations against the data as it enters your system. These expectations are expressed as methods in Python, and stored in JSON and YAML files. One great advantage of expectations is the human readable documentation that results from these tests. As you roll out different versions of the code, you get alerted to any unexpected changes and have version-specific generated documentation for what changed. Let’s learn how to write expectations on tables in Trino!

The need for data quality

Managing data pipelines is not for the faint of heart. Nodes fail, you run out of memory, bursty traffic causes abnormal behavior, and that’s just the tip of the iceberg. Lots of Trino community members build sophisticated data pipelines and data applications using Trino. Building data pipelines in Trino became more common with the addition of a fault-tolerant execution mode to safeguard against failures when executing long-running and resource-intensive queries.

Aside from all the infrastructure problems that concern data teams, another category of problems that have been the silent problem for quite some time is data quality. Faulty data comes in, which can either cause data pipelines to fail, or it can possibly go unnoticed and cause inaccurate downstream reporting. Knowledge is scattered among domain experts, technical experts, and the code and data itself. Maintenance becomes time-consuming and expensive. Documentation gets out of date and unreliable. This is why using data quality checks using libraries like Great Expectations is so important when writing ETL applications.

Improve data quality in Trino with Great Expectations

As data quality moves to the forefront of the Trino community, the Great Expectations and Trino communities have partnered to do some events together:

Trino meetup to discuss Great Expectations
Great Expectations meetup to discuss Trino.
Superconductive joined this year’s mini Trino Summit event Cinco de Trino to showcase using managed solutions for Great Expectations and Trino.

Today, we’re walking through a demo that showcases a scenario with Trino running as the datalake query engine with multiple phases of data transformations on some Pokemon data sets. At each phase, we need to validate that the data is in the correct schema, counts, and various other factors to validate. We use Trino with Hive table with CSV for ingest and then move to Iceberg table for the structure and consume tables. This is one of the great uses of Trino in that you can operate using any of the popular table formats.

Trino and Great Expectations demo

In this scenario, we’re going to ingest Pokemon pokedex data and Pokemon Go spawn location data which lands as raw CSV files in our data lake. We then use Trino’s Hive catalog to read the data from the landing files, clean up, and optimize that raw data into more performant ORC files in the structure tables.

The last step is to join and transform the spawn data and pokedex data into a single table that is cleaned and ready to be utilized by a data analyst, data scientist, or other data consumer. Every area of the pipeline where the data is transformed opens up a liability. The state can go from good to bad when infrastructure fails or is updated as newer versions of the pipeline roll out. This is where adding Great Expectations is crucial.

Now that you have a better understanding of the scenario, feel free to watch the video, and try running it yourself!

Try this Trino demo yourself »

Conclusion

While data quality has always been a requirement, the standards for it increase as the complexity of data lakes increase. It is a necessity that improves the trust that data consumers have in the data. Dive into the Great Expectations documentation to learn more about the existing Trino support. If you run into any issues while running the demo, reach out on Slack and let us know!

Happy tenth birthday Trino!

2022-08-08T00:00:00+00:00

It’s inspiring and mindblowing to reflect on the ten year journey that has produced the community around Trino. Trino is the community-driven fork from Presto, the distributed big data SQL query engine created at Facebook in 2012. We are a community of engineers, scientists, analysts, and visionaries that work in a fast paced world where the expectations on the time to insights from our analytics and the scale of the data are ever-increasing. Sometimes words only do so much justice to encompass a journey like this one, so we created a video to let you experience it yourself! Enjoy!

Trino’s first ten years video

As we watch the video and think back to the five years Presto and Trino shared, you begin to appreciate the organic development of the community, and the excitement around the solution space that the project brought to big data. As a baseline, Trino offers a faster and more interactive alternative to accessing data stored in HDFS via Hive. But the project didn’t stop there. Development of the SPI abstracted metadata and storage access to different systems, making Trino a suitable engine to query an entire data ecosystem from one location using ANSI SQL! Since the projects split, Trino has skyrocketed in development from the original project and added an array of features that we’ve listed out in the evolution of the Trino architecture blog post.

To really celebrate this milestone, we wanted to offer some exciting ways for you to learn more about Trino, and spin up Trino on your own system to play around with it. We have a list of blogs, project stats, and ways to get involved below. Starburst is also celebrating by offering free Trino birthday t-shirts when you complete their Space Quest League mission. Also don’t forget to attend our annual Trino Summit in November!

Learn more about Trino

Getting started with Trino

Community statistics

28250+ commits 💻 in GitHub
5750+ stargazers ⭐ in GitHub
7350+ members 👋 in Slack
6950+ pull requests merged ✅ in GitHub
4000+ issues 📝 created in GitHub
3750+ followers 🐦 on Twitter
650+ average weekly members 💬 in Slack
1050+ subscribers 📺 in YouTube
38 Trino Community Broadcast ▶️ episodes
264 Presto + Trino 🚀 releases (not including PrestoDB releases since the fork)

Join our community

Join the Trino Slack workspace
Watch the Trino Community Broadcast
Subscribe to the Trino YouTube channel
Follow us on the trinodb Twitter account
Give us a star on the Trino GitHub repository
Follow us on the Trino LinkedIn account

Trino Summit 2022

We hope you all join us in celebrating Trino’s birthday today. If you want to learn even more, sign up for our hybrid event, Trino Summit, on the 10th of November 2022. If you have a talk you’d like to give around Trino, the call for speakers is open until September 15th.

Join our community. We look forward to having you!

A decade of query engine innovation

2022-08-04T00:00:00+00:00

It’s amazing how far we have come! Our massively-parallel processing SQL query engine, Trino, has really grown up. We have moved beyond just querying object stores using Hive, beyond just one company using the project, beyond usage in Silicon Valley, beyond simple SQL SELECT statements, and definitely also beyond our expectations. Let’s have a look at some of the great technical and architectural changes the project underwent, and how we all benefit from the commitment to quality, openness and collaboration.

Runtime and deployment

Starting with how you even run Trino and install it, numerous changes came about in the last decade. We moved from Java 7 to Java 8, then to Java 11, and only recently to the latest supported Java LTS release - Java 17. Each time we benefited from the innovations in the runtime performance as well as the improved Java language features. With Java 17, we are just about to start a lot of these improvements.

When it comes to actually running and deploying Trino, the tarball is still a good choice for simple installation and as a base for other packages. Over time we added RPM archive support, which is being replaced more and more by Docker containers. The container images also enable modern deployment on Kubernetes with our Helm chart.

And let us add one last note about deployments. Trino was always designed to work on large servers. However the actual growth in a decade in the real world has amazing to see. Machine sizes keep growing to hundreds of CPU cores and closer to a terabyte of memory, and these truly large machines are now running as clusters with many workers of that size. And more and more of these deployments take advantage of our added support for the ARM processor architecture and the increasing availability of suitable servers from the cloud providers.

Security

What is security, authentication, authorization? In the beginning none of this existed in the first releases of Trino. Two years after launch we added first simple authentication and authorization support. Today the days when Kerberos was critical, and you needed to use the Java KeyStore in most deployments are long gone. The wide adoption of Trino led to improvements such as support for automatic certificate creation and TLS for internal communication, secret injection from environment variables, and the many authentication types starting with LDAP and password file, to the modern OAuth2.0 and SSO systems. Trino supports fine-grained access control and security management SQL commands like GRANT and REVOKE. You can secure connections from client tools, and use numerous methods to ensure secured access to your data sources.

Client tools and integrations

In the very beginning all you could do is submit a query to the client REST API. Very quickly we added the Trino CLI and the JDBC driver. And while it has continued to be widely used in the community, and gathered great features such as command-completion and history, different output formats, and much more, the Trino CLI is not the only tool anymore. The JDBC driver, the Python client, the Go client, and the ODBC driver from Starburst, all expanded the support for different client tools. You can query Trino in your Java-based IDE, such as IntelliJ IDEA, or database tool, such as DBeaver or Metabase. You can take advantage of visualizations in Apache Superset, or automate with Apache Airflow, dbt, or Apache Flink. And many commercial tools such as Tableau, Looker, PowerBI, or ThoughtSpot also proudly support Trino users.

SQL

All the client tools and integrations rely on the rich SQL support of Trino, which has grown tremendously. Purely analytics-related support for SELECT and all its complexities was not enough. Trino gained support for data management to create schema and tables, but also views and materialized views. And with that write support we needed INSERT, UPDATE, and DELETE. That’s all done and MERGE is next. But the core language features were not able to satisfy the needs of our users. We added functions for a large variety of topics ranging from simple string and date functions to JSON support, geospatial functions, and many others.

From the core language perspective we added newer SQL functionality, such as window functions and MATCH_RECOGNIZE support. Currently we are on a journey to implement support for table functions, including polymorphic table functions.

Connectors and data sources

When it comes to the new SQL language features, there are two categories. There are generic functions and statements that build on top of commonly used functionality like SELECT. These typically work with any connector and therefore any data sources. And then there are SQL language features that need support in a connector. After all, inserting data in PostgreSQL and an object storage system are very different. Our community has been hard at work however, and numerous connectors have gone way beyond simple read-only access.

Looking at the number of available connectors, innovation has been tremendous. The original Hive connector with support for HDFS and a Hive Metastore Service, became a powerhouse of features. Support for object storage systems including Amazon S3 and compatible systems, Azure Data Lake Storage, and Google Cloud Storage, was supplemented by support for Amazon Glue as metastore. We also constantly added support for different file formats in these systems, and improved performance for ORC, Parquet, Avro, and others.

The initial idea to support other data sources led to connectors for over a dozen other databases, including relational systems such PostgreSQL, Oracle, SQL Server, and many others. We also gained support for Elasticsearch and OpenSearch, MongoDB, Apache Kafka, and other systems that traditionally are not available to query with SQL. Trino unlocks completely new use cases for these systems.

The wide range of supported systems includes traditional data lakes and data warehouses. With the emerging new table formats and the related Trino connectors, our project is a powerful tool to run your lakehouse system. Delta Lake and Apache Iceberg connectors are already capable of full read and write operations and include numerous other features. An Apache Hudi connector is in the works and coming soon.

We also have robust and widely used connectors for real-time analytics systems like Apache Pinot, Apache Druid and Clickhouse, that are constantly improved by the community.

Query processing and performance

Last but not least, these queries also need to be processed. From the start high efficiency and low latency were a core design goal, and with features like native compilation the resulting performance surpassed other systems. Over the years our query analyzer and planner was supplemented by more and more sophisticated algorithms and features. Connectors learned to retrieve and manage table statistics, the optimizer was created and morphed into a cost-based optimizer, and we added further improvements that benefit query processing performance. We added dynamic filtering, dynamic partition pruning, predicate pushdown, join pushdown, aggregate function pushdown and numerous others. Each of these improvements was also finely tuned, and runs in production with huge workloads providing us more data on how to improve next.

One large pivot we recently added was the addition of fault-tolerant query execution mode. Queries execution can survive cluster node failures when this feature is enabled. Parts of the execution can be retried and query processing can proceed. Trino is moving on from the best analytics engine to be the best query engine for many more use case!

Looking forward

As you can see there is a lot to look back to and celebrate. But while we are definitely proud of our successes working with the community, we see no time to rest. There are many more improvements we are working on. Just to tease you a bit, let us just mention that there will be more polymorphic table functions, new lakehouse connectors and features, more client tools, and maybe even dynamic configuration of the cluster.

What would you like to add? Join us to celebrate and innovate towards your favorite features. And who knows, we might see you in the Trino Summit in November, or in a future episode of the Trino Community Broadcast.

Why leaving Facebook/Meta was the best thing we could do for the Trino Community

2022-08-02T00:00:00+00:00

It might surprise some that our departure from Facebook was one of the simplest decisions we’ve ever made. Many posts that discuss leaving a FAANG company focus on leaving some grand sum of money or prestige of working at the company. For us, we were leaving the company where we had launched a project that we knew would quickly outgrow the walls of Facebook, and solve a much larger set of problems in the analytics domain. At the time we didn’t quite anticipate that Presto, a distributed SQL query engine for big data analytics, would be adopted around the globe by thousands of companies and an overwhelming number of industries. We appreciate Facebook for serving as the launchpad that inspired others to adopt Presto. Despite the harmonious beginnings, once the needs of the community and Facebook no longer aligned, we had to leave, but we’ll get to that part shortly.

People make up communities, not companies

When we created Presto, it was clear to us that it needed to be open source. Presto started in 2012, just before the Facebook IPO. The culture was very conducive to starting an open source project. At that time, Facebook was working on Open Compute which ended up disrupting the hardware industry, and we wanted to achieve a similar impact for the analytics industry with Presto. We lobbied for and gained approval from the VP of Infrastructure, Jay Parikh, and released Presto as an open source project. It’s something that we wanted to do from the beginning, because we had worked with open source projects and believed that the most successful projects are open source.

Getting other people and companies involved makes for a healthier project. You end up not just building something that satisfies your needs, but needs from everyone else, and in turn, you benefit. We reached out personally to people from companies like Airbnb, Dropbox, Netflix, and LinkedIn to get them involved because we wanted to bootstrap a real community. Five people at Facebook hacking away was not enough. We actually had these companies beta test Presto, so that when we launched, the problems that they had found were fixed.

It’s important to understand why that’s beneficial to really grasp our philosophy behind open source. In reality, when we say we’re getting more companies involved, that’s true, but more importantly, we’re getting people involved. Individuals in the tech space are interested in solving technology problems. Companies are interested in solving problems that benefit their board, investors, and their customers. It’s incredibly common to see an overlap in the problems that engineers, analysts, and scientists are interested in solving with the problems that companies need to solve, but it’s never guaranteed.

Moreover, the interest of a company is very susceptible to change from company growth, IPOs, acquisitions, directional pivots, and general political and cultural changes. As people start to put their time and energy into a project, their own identity starts to blend with the success of the project. This is much less the case with corporations. Since corporations include many people, it only takes a small set of people in the right position to decide that a project is no longer aligned with the direction or goals of a company.

Those of us in the Trino Software Foundation believe that individuals that work on Trino actually make up the community and not the companies who so graciously allow their employees to contribute. We view our community as visionaries that want to solve problems and build systems that last for decades into the future. We don’t allow near-sighted decisions that may affect the quality of the system, or that may diminish the value of the application to the greater problem space. Most people do not want to work on something for years, and then have the company change direction and throw away all their work.

To be clear, we’re not saying it’s a bad thing when a company moves in another direction. That is the nature of business and having corporate involvement can also be a healthy component of open source. To us, however, the core of what makes a project long-lasting and beneficial for everyone using the product are the people who are there building the system and interested in the problem space. So what happened at Facebook that caused us to leave?

Why we left Facebook

As Presto became central to the infrastructure of prominent projects in Facebook, it attracted the attention of engineers and managers at Facebook who wanted to work on this project. This is a strong sign of success, but some of these folks did not have the same commitment to the open-source community. This was the source of much of the conflict as engaging in open-source takes a lot of time and effort, and we had a strict policy of “no one is special”. This means that everyone’s code is reviewed, and just because you work for Facebook you still have to earn commit rights. Engineers at Facebook are strongly motivated to create “memorable” works to advance in the company, and this means this extra work is just slowing things down. Feedback from these engineers ultimately culminated in the managers making the decision to give automatic contributor rights to any Facebook engineer working on Presto, so that these engineers could move faster.

You may think Facebook engineers or managers are the big bad wolf in this scenario, but they really are not. Engineers at these highly competitive companies must create memorable work, or they will not get the promotions they deserve. And if you are a junior engineer and do not get promoted, you get fired. Corporate leaders also have the right to change how they allocate resources to work on open-source projects. There’s nothing inherently wrong with any of this. The problem was changing the commitment we made to keep the open-source community neutral. It was at that point we knew that we had to create a fork of the project if we wanted to keep the community’s interest at the forefront for the project to remain healthy.

It was also at this point we made our single biggest mistake. We didn’t change the name away from Presto. It was admittedly hard to walk away from a name we all knew and loved. We believed that we had set up the project, so that the name “Presto” was owned by the community and not Facebook. The truth is that once the community walked out of the project, Facebook was the only one left in Presto and they became the sole owner. But, the biggest reason this was absolutely the wrong choice is much simpler; it made the people that stayed at Facebook really angry. We expected Facebook to do what they really wanted: stop doing the extra open-source work, fork internally, and leave the community alone. Instead, they somehow found the motivation to do a lot of work to set up a competing project. Finally, we spent two additional years continuing to build the Presto name rather than building the new name and brand. In hindsight, all of this was just dumb, and we were suffering from our own sunk cost fallacy. So we continued under the Presto name with the distinguishing suffix of PrestoSQL versus the original project’s PrestoDB.

Building the Trino community

The new PrestoSQL project gave a new home to the existing Presto community. It provided a project that focused on the open source community and not just the needs of Facebook. It also gave us time to troubleshoot problems of people who used Presto. This is what we were doing internally at Facebook but instead we applied our knowledge of the system towards the community. This was one of the reasons why leaving Facebook was so beneficial. As we worked closer with everyone else, we started learning what areas of the project we should focus on and it turns out that many of the things we were working on at Facebook were simply not problems that all the other people in the community were facing. This wasn’t the only benefit to us leaving Facebook, though.

The hardest part about making a new project successful is user adoption. Building great software doesn’t organically build a community. Presto gained some of its initial popularity because Facebook used it. We never had to try very hard to develop the community initially as the Facebook brand did a great job at getting people’s attention. But this community was exclusive to Silicon Valley companies. Leaving Facebook acted as a forcing function for us to build the community in a classic grassroots way. We went out and started talking to people, getting people connected, doing more promotions and events. We were pretty motivated after we left. However, all of this is a lot of work for a few programmers and while it’s great to see people respond to your work, it takes a lot out of you. This provided the conditions that gave rise for members to step up in the new project and become more involved.

We saw the pattern repeat when we were forced to rebrand and changed the name to Trino. We doubled down again on developing the community, and again participation accelerated. It’s because of this that we believe the Trino community is stronger than ever before.

Since the split, Trino release cycles have increased and far surpassed the speed we had when we were running Presto. Once brand confusion was settled with the change to the Trino name, the community numbers skyrocketed and we saw unprecedented growth in metrics like GitHub stars, YouTube subscribers, and Slack members. We have many new community-driven features released in Trino that we will be discussing in more detail in another blog post coming soon. To name a few, Trino now supports fault-tolerant execution mode, revamped its timestamp support, dynamic partition pruning, polymorphic table functions, advanced window functions, and much much more!

Conclusion

These metrics help confirm our experience in previous open source projects and with Trino. In the long run, individual-driven open source projects tend to lead to healthier communities and healthier ecosystems over company-driven open source projects. We believe that, we practice that, and we are now reaping the benefits of it as we close the pages of the first decade of this remarkable project. We can’t begin to express how thankful we are to all of you who believed in us and have helped grow Trino to what it is today. Also, we do thank the Facebook leadership, especially Jay Parikh, who gave us the green light to create and open source Presto from the beginning. We are looking forward to the twentieth and thirtieth anniversaries as we continue to disrupt the analytics industry and improve the lives of those who work in it.

Diving into polymorphic table functions with Trino

2022-07-22T00:00:00+00:00

In the Trino community, we know that being the coolest query engine is a tough job. We boldly face the intricacies of the SQL standard to bring you the newest and most powerful features. Today, we proudly announce that as of release 381, Trino is on its way to full support for polymorphic table functions (PTFs).

In this blog post, we are explaining the concept of table functions and exploring how they can be leveraged. We also look at what we have already implemented, and take a sneak peek into the future.

Definition time

There are several kinds of functions you can call in a SQL query: scalar functions, aggregate functions, and window functions. They might process the input row by row (scalar) or all at once (aggregate). One thing they have in common is that they return scalar values. Table functions are different. They return tables. In a query, they can appear in any place where a table reference shows up such as a FROM clause:

SELECT
  *
FROM
  TABLE(my_table_function('foo'));

You can also use table functions in joins:

SELECT
  *
FROM
  TABLE(my_table_function('bar'))
JOIN
  TABLE(another_table_function(1, 2, 3))
ON true;

Polymorphic table functions (PTFs) are a subset of table functions where the schema of the returned table is determined dynamically. The returned table schema can depend on the arguments you pass to the function.

OK, but why are we so excited?

We are excited because this feature is a real game changer! Polymorphic table functions make SQL extensible, provide a framework for processing data in previously impossible ways, and can act as a bridge between the Trino engine and external systems or resources you might need for processing your data. Additionally, polymorphic table functions are standard SQL, and they are very convenient to use.

What is available in Trino today?

So far, we have added a framework for table functions which can be executed by the connector. Although this is not the full PTF feature yet, we couldn’t wait to bring it to life. We added query pass-through table functions for JDBC-based connectors and ElasticSearch. They mostly go by the name query, and they take a single argument, that being the query text:

SELECT
  *
FROM
  TABLE(
    postgresql.system.query(query =>
        'SELECT
          name
        FROM
          tpch.nation
        WHERE
          nationkey = 0'
    )
  );

And this will return:

  name
---------
 ALGERIA
(1 row)

Something you can’t notice from that example is that when you’re passing that “query” argument, it’s taking the entire query and having PostgreSQL execute it. Whatever connector you’re using, the query argument you pass needs to be written so that it works on the underlying database. On the opposite and more exciting side of that, if you have a legacy query specific to a database which has non-standard SQL syntax and would be difficult to rewrite for Trino, now you can pass that entire query down to the connector by wrapping it in the query function, skipping the need to migrate it.

Besides PostgreSQL, the query table function has equivalent implementations for Druid, MySQL, Oracle, Redshift, SQL Server, MariaDB, and SingleStore. ElasticSearch has a similar function called raw_query. You can check out the Trino docs for each supported connector for full details.

But while we’re here, another cool example to showcase is using query pass-through to take advantage the MODEL clause in Oracle:

SELECT
  SUBSTR(country, 1, 20) country,
  SUBSTR(product, 1, 15) product,
  year,
  sales
FROM
  TABLE(
    oracle.system.query(
      query => 'SELECT
        *
      FROM
        sales_view
      MODEL
        RETURN UPDATED ROWS
        MAIN
          simple_model
        PARTITION BY
          country
        MEASURES
          sales
        RULES
          (sales['Bounce', 2001] = 1000,
          sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000],
          sales['Y Box', 2002] = sales['Y Box', 2001])
      ORDER BY
        country'
    )
  );

You can pass an entire query through to leverage a feature that isn’t a part of the SQL standard, and with that MODEL clause, Oracle can do some fancy multidimensional array processing for you right then and there, returning the results as a table back into Trino. We don’t want to get too sidetracked delving into the specifics of non-Trino tech, so if you want to learn more about what you can do, check out the connectors you use, and see what cool possibilities are out there!

What’s next?

Now that we’ve discussed what PTFs are, how they work in Trino, and what they do today, it’s useful to look forward to what’s coming next. The next thing we’re working on is adding the query function to BigQuery.

Big ideas

Beyond what’s currently planned, there’s a lot that polymorphic table functions can do for us. One common function that engineers and analysts commonly request in Trino is PIVOT. This is a capability that dynamically groups different values of an input column and converts each value as a set of columns in the output table. A potential use of PTFs would enable a PIVOT-like transformation on data, which otherwise isn’t included in the standard SQL specification.

Another exciting potential is the ability to write scripts to transform or generate tables in popular languages like Python, Scala, or Javascript. These can be used to add even more new capabilities that SQL is missing.

Looking forward

The journey to full PTF support in Trino has just begun. A dedicated operator for table functions is the next big thing. Right now, Trino can handle PTFs, but they must be pushed down to the connector and executed there. The Trino engine does not yet know how to execute them. With an operator, the Trino engine will be able to control and handle table function execution, and we will be able to pass tables as arguments to table functions. This will unlock the full potential of PTFs in Trino, and empower Trino to solve a new class of problems and expand its potential for application in many new domains.

If you have any questions or ideas for table functions that you would find useful, reach out to us on the Trino Slack, and we would love to hear your thoughts and feedback. We’ll also be doing a Trino Community Broadcast on PTFs on July 28th @ 1pm EDT, so tune in then to have your questions answered live!

If you want to learn more about how to implement PTFs, we are working on another blog post for you already.

Happy querying!

Trino updates to Java 17

2022-07-14T00:00:00+00:00

You’ve already read the title, and it’s exciting news - as of Trino version 390, which releases today, Trino has officially been updated from Java 11 to Java 17. This has a few implications, the most important of which is that if you aren’t running the Docker image (which automatically comes with the correct version of Java) and you’ve been running Trino on Java 16 or older, you’ll need to update Java to run Trino versions 390 and later. It’s also worth mentioning that newer versions of Java, such as Java 18 or 19, are not supported - they might work, but they haven’t been tested or benchmarked - Java 17 is the new, recommended version for Trino.

The reason this change is exciting is that using a new and better version of Java will make Trino better, too! This initial change is an update to the runtime version, or what the Trino engine uses while it runs. Because the Java language performs slightly better on the whole with this update, you may see some small, across-the-board performance improvements when switching from Java 11 to Java 17. So when you’ve got the time, we strongly recommend making the upgrade!

The plan is to update the build to Java 17 a few weeks from now, which will also allow us to use Java 17 APIs and the changes to the language in Trino code. With new language features, there are more tools in the development toolkit, and it’ll allow us to write cleaner and better code moving forwards.

This upgrade has been in the works for a while and been a long time coming, so if you want to learn more about the specifics, one of the best places to check that out is the Trino Community Broadcast. Updating to Java 17 was the focus of episode 36, and we also talked about it previously in episode 35. If you want to check out the code changes that made this happen, you can view the tracking issue on Github for more information.

And finally, we want to give a shoutout to Mateusz Gajewski for all the hard work in driving this change.

How to use Airflow with Trino

2022-07-13T00:00:00+00:00

The recent addition of the fault-tolerant execution architecture, delivered to Trino by Project Tardigrade, makes the use of Trino for running your ETL workloads an even more compelling alternative than ever before. We’ve set up a demo environment for you to easily give it a try in Starburst Galaxy.

With Project Tardigrade providing an out-of-the-box solution with advanced resource-aware task scheduling and granular retries at the task/query level, we still need a robust tool to schedule and manage workloads themselves. Apache Airflow is a great choice for this purpose.

Apache Airflow is a widely used workflow engine that allows you to schedule and run complex data pipelines. Airflow provides many plug-and-play operators and hooks to integrate with many third-party services like Trino.

To get started using Airflow to run data pipelines with Trino you need to complete the following steps:

Install of Apache Airflow 2.10+
Install the TrinoHook
Create a Trino connection in Airflow
Deploy a TrinoOperator
Deploy your DAGs

Installing Apache Airflow in Docker

The best way to get you going, if you don’t already have an Airflow cluster available, is to run Airflow in a container using docker compose. Just be aware that this is not best practice for a production environment.

Requirements for the host:

Docker
Docker Compose 1.28+

Step 1) Create a directory named airflow for all our configuration files.

$ mkdir airflow

Step 2) In the airflow directory create three subdirectory called dags, plugins, and logs.

$ cd airflow
$ mkdir dags plugins logs

Step 3) Download the Airflow docker compose yaml file.

$ curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'

Step 4) Create an .env configuration file:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env
$ echo "AIRFLOW_GID=0" >> .env 

Step 5) Start the Airflow containers

$ docker-compose up -d

Installing the TrinoHook

If running Airflow in docker, you need to install the TrinoHook in all the docker containers using the apache/airflow:x.x.x image.

$ docker ps 
CONTAINER ID   IMAGE                  PORTS                              NAMES
cffdfaeb757e   apache/airflow:2.3.0   0.0.0.0:8080->8080/tcp             airflow_airflow-webserver_1
b0e72f479a66   apache/airflow:2.3.0   8080/tcp                           airflow_airflow-worker_1
4cdb11b3e5e3   apache/airflow:2.3.0   8080/tcp                           airflow_airflow-triggerer_1
41d3c3107ddb   apache/airflow:2.3.0   0.0.0.0:5555->5555/tcp, 8080/tcp   airflow_flower_1
229a11e9cdd3   apache/airflow:2.3.0   8080/tcp                           airflow_airflow-scheduler_1
68160240857d   postgres:13            5432/tcp                           airflow_postgres_1
a96b98da85df   redis:latest           6379/tcp                           airflow_redis_1

To install the TrinoHook you run pip install apache-airflow-providers-trino in the first five containers. Run the following command replacing the container id of each of the containers in your deployment.

$ docker exec -it <container_id> pip install apache-airflow-providers-trino

Once you have done that you need to restart all five containers:

$ docker container restart <container_id_1> ... <container_id_5>

Creating a Trino connection

After you have installed the TrinoHook and restarted Airflow you can create a connection to your Trino cluster through the Airflow web UI. If you just installed Airflow, then go to http://localhost:8080 on your browser and login. The default credentials unless changed are airflow for username and password.

Go to Admin > Connections.

Click on the blue button to Add a new record.

Select Trino from the Connection Type dropdown and provide the following information:

Connection Id	Whatever you want to call your connection.
Host	The hostname or host ip of your trino cluster, e.g., `localhost`, `10.10.10.1`, or `www.mytrino.com`
Schema	A schema in your Trino cluster.
Login	The username of the user that Airflow uses to connect to Trino. Best practice would be to create a service account like ‘airflow’. Just understand that this user access level is used to execute SQL statements in Trino.
Password	The password of the user that Airflow uses to connect to Trino if authentication is enabled.
Port	The port where the Trino Web UI can be accessed, e.g., `8080`, `8443`.
Extra	Additional settings, like `protocol:https` if using TLS, or `verify:false` if you are using a self-signed certificate.

Be aware that the test button might not actually return any feedback for Trino connections.

Deploying a TrinoOperator

At the time of writing this article there is no TrinoOperator, so you have to write your own. You find an implementation in the following section, to get you started. This operator allows you to execute any SQL statements that Trino supports such as SELECT, INSERT, CREATE, SET SESSION, and others. You can run multiple statements in a single task so they are part of a single Trino session.

To create the TrinoOperator use your favorite text editor to create a file called trino_operator.py with the following code in it and place it in the airflow/plugins directory you created earlier. Airflow automatically compiles the code and you are ready to start writing DAGs.

For those new to Airflow, DAG (Directed Acyclic Graph) is a core Airflow concept, a collection of tasks with dependencies and relationships that indicate to Airflow how they should be executed. DAGs are written in Python.

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.providers.trino.hooks.trino import TrinoHook
import logging
from typing import Sequence, Callable, Optional

def handler(cur):
    cur.fetchall()

class TrinoCustomHook(TrinoHook):

    def run(
        self,
        sql,
        autocommit: bool = False,
        parameters: Optional[dict] = None,
        handler: Optional[Callable] = None,
    ) -> None:
        """:sphinx-autoapi-skip:"""

        return super(TrinoHook, self).run(
            sql=sql, autocommit=autocommit, parameters=parameters, handler=handler
        )

class TrinoOperator(BaseOperator):

    template_fields: Sequence[str] = ('sql',)

    @apply_defaults
    def __init__(self, trino_conn_id: str, sql, parameters=None, **kwargs) -> None:
        super().__init__(**kwargs)
        self.trino_conn_id = trino_conn_id
        self.sql = sql
        self.parameters = parameters

    def execute(self, context):
        task_instance = context['task']

        logging.info('Creating Trino connection')
        hook = TrinoCustomHook(trino_conn_id=self.trino_conn_id)

        sql_statements = self.sql

        if isinstance(sql_statements, str):
            sql = list(filter(None,sql_statements.strip().split(';')))

            if len(sql) == 1:
                logging.info('Executing single sql statement')
                sql = sql[0]
                return hook.get_first(sql, parameters=self.parameters)

            if len(sql) > 1:
                logging.info('Executing multiple sql statements')
                return hook.run(sql, autocommit=False, parameters=self.parameters, handler=handler)

        if isinstance(sql_statements, list):
            sql = []
            for sql_statement in sql_statements:
                sql.extend(list(filter(None,sql_statement.strip().split(';'))))

            logging.info('Executing multiple sql statements')
            return hook.run(sql, autocommit=False, parameters=self.parameters, handler=handler)

Deploying a DAG

Now that you have deployed the TrinoOperator you can start writing DAGs for your data pipelines. Let’s write and deploy a simple sample DAG. DAGs just like the TrinoOperator are deployed into the airflow/dags directory you created earlier.

Create a file called my_first_trino_dag.py with the following code, and save it in the airflow/dags directory.

import pendulum

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

from trino_operator import TrinoOperator

## This method is called by task2 (below) to retrieve and print to the logs the return value of task1
def print_command(**kwargs):
        task_instance = kwargs['task_instance']
        print('Return Value: ',task_instance.xcom_pull(task_ids='task_1',key='return_value'))

with DAG(
    default_args={
        'depends_on_past': False
    },
    dag_id='my_first_trino_dag',
    schedule_interval='0 8 * * *',
    start_date=pendulum.datetime(2022, 5, 1, tz="US/Central"),
    catchup=False,
    tags=['example'],
) as dag:

    ## Task 1 runs a Trino select statement to count the number of records 
    ## in the tpch.tiny.customer table
    task1 = TrinoOperator(
      task_id='task_1',
      trino_conn_id='trino_connection',
      sql="select count(1) from tpch.tiny.customer")

    ## Task 2 is a Python Operator that runs the print_command method above 
    task2 = PythonOperator(
      task_id = 'print_command',
      python_callable = print_command,
      provide_context = True,
      dag = dag)

    ## Task 3 demonstrates how you can use results from previous statements in new SQL statements
    task3 = TrinoOperator(
      task_id='task_3',
      trino_conn_id='trino_connection',
      sql="select { { task_instance.xcom_pull(task_ids='task_1',key='return_value')[0] } }")

    ## Task 4 demonstrates how you can run multiple statements in a single session.  
    ## Best practice is to run a single statement per task however statements that change session 
    ## settings must be run in a single task.  The set time zone statements in this example will 
    ## not affect any future tasks but the two now() functions would timestamps for the time zone 
    ## set before they were run.
    task4 = TrinoOperator(
      task_id='task_4',
      trino_conn_id='trino_connection',
      sql="set time zone 'America/Chicago'; select now(); set time zone 'UTC' ; select now()")

    ## The following syntax determines the dependencies between all the DAG tasks.
    ## Task 1 will have to complete successfully before any other tasks run.
    ## Tasks 3 and 4 won't run until Task 2 completes.
    ## Tasks 3 and 4 can run in parallel if there are enough worker threads. 
    task1 >> task2 >> [task3, task4]

Just like with the TrinoOperator DAGs are picked up and compiled by Airflow automatically. When Airflow fails to compile your DAG it displays an error message at the top of the page in the main page where all the DAGs are listed. You can refresh this page a few times until your DAG is either added to the list or you see an error message. You can expand the message to see the source of the error. Usually the information provided is enough to understand the issue.

Once the DAG shows up on your list you can trigger a manual run, using the play button on the right to activate your DAG. I recommend switching to the Graph view, using the action links on the right to see how tasks change status as they run.

You can see logs for each task by clicking on the corresponding box and selecting Log from the options at the top.

Check out the logs for the print_command task to see the return value of select statement from task_1

As you can see, output from print() commands can be found in these logs.

Conclusion

Apache Airflow has been around for many years now. It is used by many large companies in production environments. The open source project has an active community, and I expect that in the near future we will have an official TrinoHook with additional out-of-the-box functionality. While there might be a slight learning curve for new users I think that is worth it.

On the Trino side there are some exciting enhancements for fault-tolerant execution on the roadmap of Project Tardigrade that will make Trino and Airflow an even better combination.

Stay tuned.

Note from Trino community: We welcome blog submissions from the community. If you have blog ideas, send a message in the #dev chat. We will mail you Trino swag as a token of appreciation for successful submissions. Enter the Trino Slack and join the conversation in the #project-tardigrade channel.

Discuss on Reddit

Discuss On Hacker News

Announcing the 2022 Trino Summit

2022-06-30T00:00:00+00:00

We are pleased to announce the upcoming 2022 Trino Summit. The summit is scheduled as hybrid event on the 10th of November 2022, and attendance is free! You will be able to join us online, or you can make the trip to San Francisco and meet us at the Commonwealth Club on the downtown waterfront. Please be aware that spots at the live event are limited, so register soon if you want to attend. Please also be aware that you need to register regardless of whether you’ll be joining us in-person or online.

Starburst is the lead sponsor for the summit, but they welcome other sponsors to help make this a successful event for the Trino community. If that interests you or your employer, you should contact the Starburst team for more information.

If you’d like to share your knowledge and information about Trino usage and give a talk at this year’s Trino Summit, we’re putting out a call for speakers. We will be accepting submissions from now until September 15th, but we recommend submitting soon, because slots are filling up fast.

We’re looking for intermediate to advanced-level talks on a variety of themes. If you have an interesting story about how you were able to leverage Trino, found a neat way to extend it with a custom plugin, or swapped to Trino for a performance win, we’d love to hear about it. We’re excited to expand our speaker lineup with talks from the broader Trino community. If you’re interested, you can check out the speaker registration page for more information.

And of course, we’re looking forward to seeing you there, whether in-person or online!

Update from 15th September 2022: The call for speakers is closed. Thank you for all your submissions.

Using Trino as a batch processing engine

2022-06-24T00:00:00+00:00

This past week, Andrii Rosa hosted a virtual Trino meetup on the topic of using Trino as a batch processing engine. You can view the talk from the meetup embedded below. Andrii dives into the history of Trino as an engine for Batch ETL (extract, transform, load) processing, some challenges related to that, as well as the new fault-toleration execution capabilities being added to Trino and how they improve it for Batch ETL use cases.

Andrii also gives an update on the work in progress with fault-tolerant execution, where we are today, and what’s planned for the near future. The meetup wraps up a with an attendee Q&A at the end. If you’d like to learn more, go check out the talk!

Building A Modern Data Stack for QazAI

2022-06-08T00:00:00+00:00

At QazAI, we build data lakes as a service for companies. In the original architecture, we get raw data in S3, transform the S3 data with Hive, and then delivered the data to business units via our datamart built on Clickhouse (for optimal delivery speeds). Over time, we were dragged down by the slower speeds and high costs of running Hive, and started shopping for a faster and cheaper open source engine to do our ETL data transformations.

This diagram shows our existing stack. The big problem to solve was that the Hadoop cluster was extremely inefficient. This leads to slow queries, and up to 10x higher costs.

Like many others, I was initially drawn to Trino to run analytics over Hive tables because of its speed, but found many other advantages as well. Key among them are the following characteristics.

Speed

Queries ran 10 to 100 times faster, compared to our old stack. It was fantastic, simply beyond our expectations.

Standard SQL

Standard SQL dialect that everyone already knew. Data analysts loved getting to use a dialect they were already familiar with.

Federated analytics

Ability to connect with other databases and run federated queries. After I had connected all the available data sources, I showed the results to the data analysts. They were simply amazed, some were shocked when the ‘join’ operation between the tables of various databases had been completed successfully. To emphasize - this saved days of work. You could join data from other data sources straight away, avoiding the need to create a staging layer in the data warehouse.

Simplicity of setup

Trino just works out of the box. This is what makes it great. As open source users, we’re used to going through a complicated software setup process. But with Trino, there’s no need to deploy anything else. You simply install packages from the open source repository, and things work. It’s magical. To top that off, Trino feels like a commercial product with its detailed documentation and active Slack community that is willing to help you out on everything.

Exploring Trino as an option for ETL

A great number of connectors, standard SQL, high processing speed - all these advantages raise an obvious question: ‘Why not use Trino for ETL processes as well?

At QazAI, the key blocker to using Trino for ETL was that Trino doesn’t have fault tolerance. As a result, our pipelines did not have reliable landing times, and required a lot of manual monitoring.

This is precisely what made Project Tardigrade so exciting for us. Proving that Trino is indeed a true community-driven project, Trino community members have embarked on the Tardigrade project. The main feature of this technology is the ability to divide the query into phases, and restart the failed phases. We’ve been running tests to explore this. The ETL pipeline on Trino running on 5 bare metal nodes is 20 times faster compared to ETL running on the stack consisting of Sqoop, HDFS, Hive, and custom Python scripts.

Testing Trino for ETL

Let’s play a bit with the rental database called DVD.

For instance, we create the database shown above in PostgreSQL and work with the rental table.

First, we move the table from PostgreSQL to our warehouse in HDFS and Hive.

CREATE TABLE hive.test.dvd_rental  
WITH (format = 'PARQUET')
AS (SELECT 
	rental_id,
	cast(rental_date AS timestamp) AS rental_date,
	inventory_id,
	cast(customer_id AS integer) AS customer_id,
	cast(return_date AS timestamp) AS return_date,
	cast(staff_id AS integer) AS staff_id,
	cast(last_update AS timestamp) AS last_update 
FROM postgresqldvd.public.rental)

Now we perform the same operation but we use the table of Iceberg format on S3 with hidden partitioning.

CREATE TABLE iceberg2.ice.dvd_rental  
WITH (partitioning = ARRAY['month(rental_date)', 'bucket(inventory_id, 10)'],
    format = 'PARQUET')
AS (SELECT 
	rental_id,
	rental_date,
	inventory_id,
	cast(customer_id AS integer) AS customer_id,
	return_date,
	cast(staff_id AS integer) AS staff_id,
	last_update 
FROM postgresqldvd.public.rental)

Now we perform the same operation:

CREATE TABLE hive.test.dvd_staff
WITH (format = 'PARQUET')
AS (SELECT 
	staff_id,
	first_name,
	last_name,
	cast(address_id AS integer) AS address_id,
	email,
	cast(store_id AS integer) AS store_id,
	active,
	username,
	password,
	cast(last_update AS timestamp) AS last_update,
	picture
FROM postgresqldvd.public.staff)

CREATE TABLE hive.test.dvd_customer
WITH (format = 'PARQUET')
AS (SELECT 
	customer_id,
	cast(store_id AS integer) AS store_id,
	first_name,
	last_name,
	email,
	cast(address_id AS integer) AS address_id,
	activebool,
	create_date,
	cast(last_update AS timestamp) AS last_update,
	active
FROM postgresqldvd.public.customer)

Great. What if there is a need to enrich the data with the employees’ and clients’ names? To do this, we create a table, move it to the core layer, and then apply denormalization.

Here we move the measurements table.

CREATE TABLE hive.test.dvd_staff
WITH (format = 'PARQUET')
AS (SELECT 
	staff_id,
	first_name,
	last_name,
	cast(address_id AS integer) AS address_id,
	email,
	cast(store_id AS integer) AS store_id,
	active,
	username,
	password,
	cast(last_update AS timestamp) AS last_update,
	picture
FROM postgresqldvd.public.staff)

CREATE TABLE hive.test.dvd_customer
WITH (format = 'PARQUET')
AS (SELECT 
	customer_id,
	cast(store_id AS integer) AS store_id,
	first_name,
	last_name,
	email,
	cast(address_id AS integer) AS address_id,
	activebool,
	create_date,
	cast(last_update AS timestamp) AS last_update,
	active
FROM postgresqldvd.public.customer)

Let’s union the Staff and Customers tables.

CREATE TABLE hive.test.dvd_core_rental
WITH (format = 'PARQUET')
AS (SELECT
	rental_id,
	rental_date,
	inventory_id,
	cst.first_name AS customer_name, --cast(customer_id as integer) as customer_id,
	cst.last_name AS customer_lastname,
	cast(return_date AS timestamp) AS return_date,
	stf.first_name AS staff_name, --cast(staff_id as integer) as staff_id,
	stf.last_name AS staff_lastname,
	rnt.last_update
FROM hive.test.dvd_rental rnt
LEFT JOIN hive.test.dvd_customer cst ON rnt.customer_id = cst.customer_id
LEFT JOIN hive.test.dvd_staff stf ON rnt.staff_id = stf.staff_id)

If this table is required by data analysts, then we can easily move it to the data mart (the Clickhouse layer we use to deliver data to end users).

CREATE TABLE clickhouse.default.rental_analysis_table
(
	rental_id integer NOT NULL,
	rental_date date,
	inventory_id integer,
	customer_name varchar NOT NULL, 
	customer_lastname varchar NOT NULL,
	return_date date,
	staff_name varchar,
	staff_lastname varchar,
	last_update date   
)
WITH (engine = 'MergeTree',
    order_by = ARRAY['customer_name', 'customer_lastname']);

A simple insert/select query and nothing more.

INSERT INTO clickhouse.default.rental_analysis_table
SELECT * FROM hive.test.dvd_core_rental

Alternatively we can easily move the datamart to Clickhouse directly from PostgreSQL without intermediate data layers.

INSERT INTO clickhouse.default.rental_analysis_table
SELECT
	rental_id,
	rental_date,
	inventory_id,
	cst.first_name AS customer_name, 
	cst.last_name AS customer_lastname,
	cast(return_date AS timestamp) AS return_date,
	stf.first_name AS staff_name, 
	stf.last_name AS staff_lastname,
	rnt.last_update
FROM postgresqldvd.public.rental rnt
LEFT JOIN postgresqldvd.public.customer cst ON rnt.customer_id = cst.customer_id
LEFT JOIN postgresqldvd.public.staff stf ON rnt.staff_id = stf.staff_i

Great.

One may suggest that this sample dataset is a small one with only 16 000 rows. The production ETL is mostly run over huge tables containing millions or billions of rows. Let’s test. We work with the tpch database with the scaling factor 3000.

For testing, we consider three tables: lineitem (18 billion rows), orders (450 million rows) and partsupp (2.4 billion rows).

CREATE TABLE iceberg2.ice.tpch_sf3000_customer –(450 M)
WITH (format = 'ORC')
AS
SELECT *
FROM tpch.sf3000.customer

CREATE TABLE iceberg2.ice.tpch_sf3000_lineitem –(18 B)
WITH (format = 'ORC')
AS
SELECT *
FROM tpch.sf3000.lineitem

CREATE TABLE iceberg2.ice.tpch_sf3000_partsupp –(2,4 B)
WITH (format = 'ORC')
AS
SELECT *
FROM tpch.sf3000.partsupp

Then, we try to join all three of these tables as it is shown in the ER diagram. Let’s make it more challenging by turning off one of the workers, which should result in a query failure. To enable the automatic query rerun of the failed one we set retry_policy=QUERY in config. properties.

CREATE TABLE iceberg2.ice.tpch_sf3000_lineitem_joined 
WITH (format = 'ORC')
AS
SELECT litem.orderkey ,
	litem.partkey ,
	litem.suppkey ,
	litem.linenumber ,
	litem.quantity ,
	litem.extendedprice ,
	litem.discount ,
	litem.tax ,
	litem.returnflag ,
	litem.linestatus ,
	litem.shipdate ,
	litem.commitdate ,
	litem.receiptdate ,
	litem.shipinstruct ,
	litem.shipmode ,
	litem.comment,
	psupp.availqty ,
	psupp.supplycost ,
	ord.shippriority ,
	ord.totalprice 
FROM iceberg2.ice.tpch_sf100000_lineitem litem
LEFT JOIN iceberg2.ice.tpch_sf100000_partsupp psupp ON litem.partkey = psupp.partkey and litem.suppkey = psupp.suppkey 
LEFT JOIN iceberg2.ice.tpch_sf100000_orders ord ON litem.orderkey = ord.orderkey 

The query has been completed in 4 hours. Also, at query processing, worker 22 has been turned off. The query has been automatically started over and completed successfully. At the query processing, three tables have been joined (the triple join): 18 billion rows x 2.4 billion rows x 450 million rows.

This experiment gave us the confidence to move forward in our plans to rebuild our architecture with Trino in order to perform analytical and transformational manipulations upon data directly in S3, which will allow us to exclude HDFS and Hive interference in these processes.

As a result we will achieve faster pipelines.

A huge thanks to the Trino development team and the Trino community for an excellent product, which I enjoy using and allows me to go beyond conventional usage patterns.

If you are looking for help building your data warehouse, or if you’re interested in joining us at QazAI, feel free to reach out to me at Baurzhan Kuspayev on the Trino Slack.

Note from Trino community: We welcome blog submissions from the community. If you have blog ideas, please send a message in the #dev chat. We will mail you Trino swag as a token of appreciation for successful submissions. Trino Slack.

Discuss on Reddit

Discuss On Hacker News

An opinionated guide to consolidating our data

2022-05-24T00:00:00+00:00

Maximizing your experience with zero choices.

I’m publishing this blog post in partnership with the Trino community to go along a lightning talk I’m giving for their event, Cinco de Trino. This article was originally published on Abhi’s Medium site

“My data is all over the place and attempting to analyze or query it is not only time consuming and expensive, but also emotionally taxing.”

Maybe you haven’t heard those exact words before, but data consolidation is a real problem. It is common for organizations to have correlated data stored in various silos or APIs. Performing consistent operations across these various data sources requires understanding both architecture and surgery, skills that you may not have picked up as a data practitioner. If you’re part of the Trino community and are reading this post, you’ve likely encountered unperformant queries due to unconsolidated data.

In the past, the data engineering world was not graced with the same level of love and tooling as other communities, so we were expected to make do with whatever came our way. In order to perform the wildly basic task of moving our data around, we were asked to tithe large sums of money to the closed-source ELT overlords.

So where does that leave us? Thankfully things have changed, so here’s how you can move all your data to a central location for free (well, minus the infrastructure costs) while making few architectural choices.

The tool

You don’t have too many choices for FOSS ELT/ETL.

Airbyte has been recently making waves as the main contender for open-source ELT. As of writing this article, it’s only been around for about two years, during which its established itself as one of the fastest growing startups in existence. It requires three terminal commands to deploy and is managed entirely through a UI, so it’s operable by many. It also supports syncing your data incrementally, so you don’t need to resync existing data when you want to sync new data. It is relatively new, so some of the polish that comes with an established project is not there yet. Think of it like a precocious child.

You could use Meltano to take advantage of the large Singer connector ecosystem, but it’s more complicated to set up and is more of a holistic ops platform, which may be excessive for your use case.

You could also use this esoteric project called KETL that is only available at this sketchy SourceForge link. But maybe don’t do that.

For consolidating your data, use Airbyte. It’s straightforward to setup, requires minor configuration, and has tightly scoped responsibilities.

The destination

Let’s use a data lake. Its unstructured nature leaves more flexibility for purpose and we’ll assume that our data has not been processed or filtered yet.

Data warehouses are more expensive, require more upkeep, and benefit from the ETL paradigm as opposed to ELT. Airbyte is an ELT tool focused mostly on the EL bit, which makes it easier to use with the unstructured data lakes.

Additionally, S3 supports query engines such as Trino, which will allow us to query and analyze our data once its been consolidated. Trino also functions as a powerful data lake transformation engine, so if you’re on the fence due to data malleability, this might help bring you over.

We could use Azure Blob Storage or GCS, but for this tutorial, I’ll be keeping it simple with Amazon S3. If you’ve set up an S3 bucket and IAM, skip the next paragraph.

Create a S3 bucket with default settings and grab an access key from IAM. To do this, head to the top right of the screen in the AWS Management Console where it says your email provider and then click on Security Credentials. Click Create New Access Key and save that information for later.

The deployment

Today, we’ll be deploying Airbyte locally on a workstation. Alternatively, you can deploy it on your own infrastructure, but this requires managing networking and security, which is unpalatable for a quick demonstration. If you want your syncs to continue running in perpetuity, you’ll want to deploy Airbyte externally to your machine. For a guide to deploying Airbyte on EC2 click here. For a guide to deploying Airbyte on Kubernetes, click here.

To begin, install Docker and docker-compose on your workstation.

Then clone the repository and spin up Airbyte with docker-compose.

git clone git@github.com:airbytehq/airbyte.git
cd airbyte
docker-compose up

Once you see the following banner, you’re good to go.

The data sources

Head over to localhost:8000 on your machine, complete the sign-up flow, and you’ll be greeted with an onboarding workflow. We’re going to skip this workflow to emulate a traditional usage of Airbyte. Click on the Sources tab in the left sidebar and click on +New Source. This is where we’ll be setting up all of our disparate data sources.

Search for your data sources in the drop down and fill out the required configuration. If you’re having trouble setting up a particular data source, head to the Airbyte docs. There’s a dedicated page for every connector; for example, this is the setup guide for the Google Analytics source. If you’re just testing Airbyte out, use the PokeAPI source, as it lets you sync dummy data with no authentication. If your required data source doesn’t exist, you can request it here or build it yourself by heading here (isn’t open-source great?)

Once you have all of your data sources set up, it will look something like this.

Now we just need to set up our connection to S3 and we are good to go.

The destination (again)

Head over to the Destinations tab in the left sidebar and follow the same process for setting up our connection to S3. Click on +New Destination and search for S3. Then fill out the configuration for your bucket. We’ll now use that access key that we generated earlier!

For output format, I recommend using Parquet for analytics purposes. It’s a columnar storage format, which is optimized for reads. JSON, CSV, and Avro are supported, but will be less performant on read.

The connection

Finally, head over to the Connections tab in the sidebar and click +New Connection. You will need to do this process for each data source that you have set up. Select any existing source and click your S3 Destination that you set up from the drop down. I failed to set up a connection with my GitHub source, so I navigated to the Airbyte Troubleshooting Discourse and filed an issue. Response times are really fast there, so I’ll likely be able to resolve this within a day or two.

You will then be greeted with the following connection setup page. For most analytics jobs, syncing more frequently than every 24 hours is expensive and overkill, so stick with the default. For sources that support it, click on the sync mode in the streams table to use the Incremental / Append sync mode. This ensures that every time you sync, Airbyte will check for new data and only pull in data that you haven’t synced before.

Once you hit Set up connection, Airbyte will run your first sync! You can click into your connection to get access to the sync logs, replication settings, and transformation settings if supported.

Checking our S3 bucket, we can see that our data has successfully reached! If you’re just testing things out, you’re done.

The analysis

Now that you’ve set up your data pipelines, if you want to run transformation jobs, Trino enables that use case well — Lyft, Pinterest, and Shopify have all done this to great success. There’s also a dbt-trino plugin managed by the folks over at Starburst. Alternatively, you could also accomplish this using S3 Object Lambda if you want to stay within the AWS landscape when possible.

Once your data is in a queryable state, you can now use Trino or your favorite query engine to your heart’s content! If you want to get started with querying these heterogenous data sources using Trino, here’s a getting-started guide on how to do that. Finally, join the Airbyte and Trino communities to find more about how others are consolidating and querying their data.

Cinco de Trino recap: Learn how to build an efficient data lake

2022-05-17T00:00:00+00:00

When Trino (formerly PrestoSQL) arrived on the scene almost 10 years ago, it immediately became known as the much faster alternative to the data warehouse of big data, Apache Hive. The use cases that you, as the community, have built had far exceeded anything we had imagined in complexity. Together we’ve made Trino not only the fastest way to interactively query large data sets, but also a convenient way to run federated queries across data sources to make moving all the data optional.

At Cinco de Trino, we came full circle back to the next iteration of analytics architecture with the data lake. This conference offers advice from industry thought leaders about how to use best lakehouse tools with Trino to manage that data complexity. Hear from industry thought leaders like Martin Traverso (Trino), Dain Sundstrom (Trino), James Campbell (Great Expectations), Jeremy Cohen (DBT Labs), Ryan Blue (Iceberg), Denny Lee (Delta Lake), Vinoth Chandar (Hudi). You can watch the talks on-demand on the Cinco de Trino playlist.

In this post, I’d like to cover the key items from each talk you won’t want to miss.

Keynote: Trino as a data lakehouse

Trino co-creator, Martin Traverso, covers where Trino fits into the data lake and brings you a sneak peak of the future of a Trino. Polymorphic Table Functions, adaptive query planning, are some of the many exciting features Martin walks us through.

Project Tardigrade

If you have one takeaway from the conference, let it be this: there’s a new way in town to get 60% cost savings on your Trino deployment. Cory Darby walks through how utilizing the fault-tolerant execution architecture has enabled BlueCat to auto-scale their Trino clusters, and run over spot instances, which yielded massive cost savings. Zebing Lin goes through how this happens behind the scenes, and how you can run resource-intensive ETL jobs using failure recovery delivered by the team behind Project Tardigrade.

Learn more in the Project Tardigrade blog »

Try Project Tardigrade Yourself »

Starburst Galaxy lab

Starburst Galaxy enables you to get Trino up and running rather than spending your time focusing on the setup, scaling, and maintaining the infrastructure. Trino co-creator, Dain Sundstrom, walks you through a fun-filled lab that demonstrates how to use Trino as a service solution, Starburst Galaxy, to generate database rankings by ingesting, cleaning, and analyzing Twitter and Stack Overflow data.

Engineering data reliability with Great Expectations

Let’s be honest: when we claim to have run “tests” for our data pipelines, we usually mean we checked that input !=NULL, or that the dashboard isn’t broken. James Campbell showcases the Great Expectations connector for Trino. The Great Expectations connector is officially launched as the new way to write expectations (data quality checks) for your code.

What excites us the most?

The ability to take advantage of far more sophisticated data quality tests than what any of us would write.
Having a really awesome UI to manage expectations.
The data source view that makes it easy to dynamically test your custom data quality checks against backends.

Bring your data into your data lake with Airbyte

The first step of doing any analytics is bringing your data into the data lake. Ingestion engines are a gamechanger for centralizing your data in the data lake. Up until recently, there were no open software to choose from in this category. In just 10 minutes, Abhi Vaidyanatha takes us through the journey of taking in data from various places into your choice of data lake.

Read Abhi’s article about Airbyte + Trino »

Transforming your data with dbt

Ever had 300 lines of SQL in front of you, and wasted lots of time sifting through the code to find which part of the code to edit to check for duplicate customers?

Imagine having to update decimal precision used frequently throughout that SQL statement? What we <3 the most about DBT is that data engineering becomes much more like software engineering, where you code in a much more modular way. Along the way, you get many benefits: the one we love the most? Data lineage graph and automatic documentation. That’s stuff we always say is important, but never do.

Even for dbt experts, there’s something new to learn. Jeremy Cohen goes through new capabilities Trino brings to dbt, while showcasing cool features like macros: a flexible alternative to SQL defined functions.

Check out Jeremy’s demo repo »

Choosing the best data lakehouse format for you

Ever wonder about all the hype with the new table formats? Why is everyone choosing Iceberg, Delta Lake, Hudi, over Hive? The founders of each of these modern table formats showcase each of these table formats and let you be the judge of which format makes more sense to your architecture. Below are the highlights:

Iceberg

Ryan Blue dives into important elements of your data lakehouse architecture that affect daily operations and slow down developer efficiency. He then covers how Iceberg is the solution he realized to solve those issues.

The two special elements of Iceberg is that it intentionally breaks compatibility with the Hive format to bring you features like same table partition and schema evolution. I’m the surface this may seem trivial as we’ve conditioned our minds to accepting the limitations of hive-like formats.

The second special element is that Iceberg also builds a community-driven specification that enables anyone to build out the same calls to use Iceberg library.

Delta Lake

90% of the time that our Trino data pipelines break, it was because someone committed a bad upstream change. With Delta Lake time travel (coming soon!), you won’t need to spend a whole day pinpointing that bad change: just travel back in time and identify which change that was. Denny Lee gives us a compelling argument for why users desire ACID guarantees in their data lakehouse and how Delta Lake solves for that.

Similar to Iceberg, Delta lake offers optimistic concurrency, which allows there to be multiple writers to the same Delta Lake table while maintaining ACID constrains on the data.

Hudi [Coming Soon to Trino]

The coolest part of the talk? Open up a world of new possibilities with near real-time analytics in Trino with Hudi. With Hudi, you get to serve real-time production systems, debug live issues, and more.

Vinoth Chandar showcasing the compelling use cases that drove innovation around Hudi at Uber. He then covers how he views the architecture of data lakes and lakehouses are starting to merge and the implications this has on the open versus proprietary architectures.

Touch, talk, and see your data with Tableau

Tableau is our favorite data visualization tool, and in this session, Vlad Usatin of Tableau shares how to use Tableau to directly visualize your Trino data.

Thank you to all who attended or viewed, we hope to see you again at our upcoming events later this year. Continue the conversation in our Trino Slack.

Project Tardigrade delivers ETL at Trino speeds to early users

2022-05-05T00:00:00+00:00

After six months of challenging work on Project Tardigrade, we are ready to launch. With the project we improved the user experience of running resource intensive queries that are common in the Extract, Transform, Load (ETL) and batch processing space. It required some significant and fascinating engineering to get us to the current status. The latest Trino release includes all the work from Project Tardigrade. Read on to learn how it all works, and how to enable the fault-tolerant execution in Trino.

What is Project Tardigrade?

What we love most about Trino is that you get fast query speeds, and you can iterate fast with intuitive error messages, interactive experience, and query federation.

One of the big problems that persisted a long time is that configuring, tuning, and managing Trino for long-running ETL workloads is very difficult. Following are just some of the problems you have to deal with:

Reliable landing times: Queries that run for hours can fail. Restarting them from scratch wastes resources and makes it hard for you to meet your completion time requirements.
Cost-efficient clusters: Trino queries that need terabytes of distributed memory require extremely large clusters due to the lack of iterative execution.
Concurrency: Multiple independent clients may submit their queries concurrently. Due to the lack of available resources at a certain moment some of these queries may need to be killed and restarted from zero after a while. This makes the landing time even more unpredictable.

Structuring your workload to avoid these problems can be done by a team of experts. But that is not accessible to most Trino users.

The goal of Project Tardigrade is to provide an “out of the box” solution for the problems mentioned above. We’ve designed a new fault-tolerant execution architecture that allows us to implement an advanced resource-aware scheduling with granular retries.

Following are some of the benefits and results:

When your long-running queries experience a failure, they don’t have to start from scratch.
When queries require more memory than currently available in the cluster they are still able to succeed.
When multiple queries are submitted concurrently they are able to share resources in a fair way, and make steady progress.

Trino does all the hard work of allocating, configuring, and maintaining query processing behind the scenes. Instead of spending time tuning Trino clusters to match your workload requirements, or reorganizing your workload to match your Trino cluster capabilities, you can spend your time on analytics and delivering business value. And most importantly, your heart won’t skip a beat when you wake up in the morning wondering whether that query landed on time.

What did we test so far?

Since there’s no publicly available testing query set for ETL use cases, we handcrafted more than a hundred ETL-like queries based on the TPC-H and TPC-DS datasets.

To simulate real world settings, we deployed a cluster configured for fault-tolerant execution of 15 m5.8xlarge nodes and repeatedly executed thousands of queries over datasets of different sizes (10GB / 1TB / 10TB). The queries were executed sequentially as well as with concurrency factors of 5, 10, and 20. Failure recovery capabilities were tested by crashing a random node in a cluster every couple of minutes while streaming a live workload.

To validate new resource management capabilities we submitted all 22 TPC-H based queries simultaneously with fault-tolerant execution enabled and disabled. With fault-tolerant execution disabled only two of them succeeded, while the remaining twenty queries failed with resource-related issues, such as running out of memory. With fault tolerant execution enabled all of the queries succeeded with no issues.

How do I enable fault-tolerant execution?

Fault-tolerant execution can only be enabled for an entire cluster.

In general, we recommend splitting your long-running ETL queries and short-running interactive workloads and use cases to run on different cluster. This ensures that long running ETL queries do not impact interactive workloads and cause a bad user experience. Also note that any short-running, interactive queries on a fault-tolerant cluster may experience higher latencies due to the checkpoint mechanism.

1. Add an S3 bucket for checkpointing

First you need to create an S3 bucket for spooling. We recommend configuring a bucket lifecycle rule to automatically expire abandoned objects in the event of a node crash. You can configure these rules using the s3api which is included in the tutorial below.

{
    "Rules": [
        {
            "Expiration": {
                "Days": 1
            },
            "ID": "Expire",
            "Filter": {},
            "Status": "Enabled",
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 1
            },
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 1
            }
        }
    ]
}

2. Configure the Trino exchange manager

Second you need to configure exchange manager. Add a the file exchange-manager.properties in the etc folder of your Trino installation on the coordinator and all workers with the following content:

exchange-manager.name=filesystem
exchange.base-directories=s3://<bucket-name>
exchange.s3.region=us-east-1
exchange.s3.aws-access-key=<access-key>
exchange.s3.aws-secret-key=<secret-key>

3. Enable task level retries

Lastly, you need to configure and enable task level retries by adding the following properties to config.properties:

retry-policy=TASK
query.hash-partition-count=50

Note: more than 50 partitions is currently not supported by the filesystem exchange implementation.

4. Optional recommended settings

It is also recommended to enable compression to reduce the amount of data spooled on S3 (exchange.compression-enabled=true) as well as reduce the low memory killer delay to allow the resource manager to unblock nodes running short on memory faster (query.low-memory-killer.delay=0s). Additionally, we recommend enabling automatic writer scaling to optimize output file size for tables created with Trino (scale-writers=true).

To increase overall throughput and reduce resource-related task retries, we recommend adjusting the concurrency settings based on the hardware configuration you have chosen.

Following are the settings for the hardware used in our testing (32 vCPUs, 128GB memory and 10Gbit/s network):

task.concurrency=8
task.writer-count=4
fault-tolerant-execution-target-task-input-size=4GB
fault-tolerant-execution-target-task-split-count=64
fault-tolerant-execution-task-memory=5GB

By default Trino is configured to wait up to five minutes for task to recover before considering it lost and rescheduling. This timeout can be increased or reduced as necessary by adjusting the query.remote-task.max-error-duration configuration property. For example: query.remote-task.max-error-duration=1m

Deploying on AWS with Helm and Kubernetes

To test out Tardigrade features, you need at least a cluster with a dedicated coordinator and two workers for a minimal level of parallelism and performance. The quickest and easiest way to provide all of these specifications we mentioned above is by using the Trino helm chart with a provided values.yml below and deploying a cluster to the AWS EKS cloud service. If you are not familiar with deploying Trino on Kubernetes, we recommend you take a look at the Trino Community Broadcast episodes covering local Trino on Kubernetes and deploying Trino on EKS.

Try Project Tardigrade Yourself »

Closing notes

Project Tardigrade has been a great success for us already. We learned a lot and significantly improved Trino. Now we are really ready to share this with you all, and look forward to fix anything you find. We really want you to push the limits, and let us know what you find.

If running fast batch jobs on the fastest state-of-the-art query engine interests you, consider playing around with the tutorial above and giving us your feedback. You can reach us on the #project-tardigrade channel in our Slack.

If you would like to write about your experience and results, or become a contributor, also let us know on the #project-tardigrade channel. We are happy to send you Tardigrade swag as a thank you.

Thanks for reading and learning with us today. Happy Querying!

Discuss on Reddit

Discuss On Hacker News

Tardigrade Project Update

2022-02-16T00:00:00+00:00

Over the last couple of months we’ve added support for full query retries, landed experimental support for task level retries and provided a proof of concept implementation of a distributed exchange plugin (description below). We are still working on improving scheduling algorithms as well as optimizing exchange plugin implementation to make the task level retries fully usable.

Here is a quick summary of our progress so far:

Added support for automatic query retries. This functionality is ready to use and can be enabled by setting the retry_policy=QUERY session property. Now it is possible to enable automatic retries for queries that produce more than 32MB of output. Dynamic filtering is now also fully supported with automatic query retries enabled.
Landed an initial set of changes to support task level retries. To be enabled, a plugin implementing the ExchangeManager interface has to be installed.
Landed a proof of concept implementation of the ExchangeManager interface. The implementation is fully functional, however we are still working on optimizing the read path. Also for now, only S3 compatible file systems are supported.
Added support for automatic retries in Hive and Iceberg. Supporting automatic retries for JDBC based connectors is up for grabs.
Implemented weight based split assignment for balanced work distribution between fault tolerant tasks.
Working on adaptive sizing strategy for intermediate tasks to minimize scheduling overhead while keeping the cost of a single task failure at minimum.
Making progress on introducing an advanced memory aware scheduling that would allow us to better support memory intensive queries, improve resource utilization and ensure fair resource allocation between queries.
Started working on supporting dynamic filtering for queries with task level retries enabled.
Working on accommodating failed attempts in various internal statistics reported by the engine (e.g.: QueryInfo, QueryCompletedEvent). UI changes will come next.

Over the next couple of weeks we are planning to focus on:

Optimizing read path for the reference implementation of the exchange plugin
Landing memory aware scheduling for fault tolerant execution
Landing adaptive sizing for intermediate tasks
Accommodating failed attempts into query statistics reporting
Making progress on supporting dynamic filtering for queries with task level retries enabled

The current state of development can be tracked by following this issue.

Stay tuned!

Trino 2021 Wrapped: A Year of Growth

2021-12-31T00:00:00+00:00

As we reflect on Trino’s journey in 2021, one thing stands out. Compared to previous years we have seen even further accelerated, tremendous growth. Yes, this is what all these year-in-retrospect blog posts say, but this has some special significance to it. This week marked the one-year anniversary since the project dropped the Presto name and moved to the Trino name. Immediately after the announcement, the Trino GitHub repository started trending in number of stargazers. Up until this point, the PrestoSQL GitHub repository had only amassed 1,600 stargazers in the two years since it had split from the PrestoDB repository. However, within four months after the renaming, the number of stargazers had doubled. GitHub stars, issues, pull requests and commits started growing at a new trajectory.

At the time of writing, we just hit 4,600 stargazers on GitHub. This means, we have grown by over 3,000 stargazers in the last year, a 187% increase. While we are on the subject, let’s talk about the health of the Trino community.

2021 by the numbers

Let’s take a look at the Trino project growth by the numbers:

3679 new commits 💻 in GitHub
3015 new stargazers ⭐ in GitHub
2450 new members 👋 in Slack
1979 pull requests merged ✅ in GitHub
1213 issues 📝 created in GitHub
988 new followers 🐦 on Twitter
525 average weekly members 💬 in Slack
491 new subscribers 📺 in YouTube
23 Trino Community Broadcast ▶️ episodes
17 Trino 🚀 releases
13 blog ✍️ posts
10 Trino 🍕 meetups
1 Trino ⛰️ Summit

Along with the growth we’ve seen in GitHub, we have seen a 47% growth of the Trino Twitter followers this year. The Trino Slack community, where a large amount of troubleshooting and development discussions occur, saw a 75% growth, nearing 6,000 members. Finally, the Trino YouTube channel has seen an impressive 280% growth in subscribers.

A lot of the increase on this channel was due to the Trino Community Broadcast, that brought users and contributors from the community to cover 23 episodes about the following topics:

7 episodes on the Trino ecosystem (dbt, Amundsen, Debezium, Superset)
4 episodes on the Trino project (Renaming Trino, Intro to Trino, Trinewbies)
4 episodes on Trino connectors (Iceberg, Druid, Pinot)
4 episodes on Trino internals (Distributed Hash-Joins, Dynamic Filtering, Views)
2 episodes on Trino using Kubernetes (Trinetes series)
2 episodes on Trino users (LinkedIn, Resurface)

While stargazers, subscribers, episodes, and followers tell the story of the growing awareness of the Trino project with the new name, what about the actual rate of development on the project?

At the start of the year, there were 21,924 commits. This year, we pushed 3,679 commits to the repository, sitting at over 25,600 now. Looking at the graph, this keeps us pretty consistent with 2020’s throughput.

With the project’s trajectory displayed in numbers, let’s examine the top features that landed in Trino this year.

Features

Here’s a high-level list of the most exciting features that made their way into Trino in 2021. For details and to keep up you can check out the release notes.

SQL language improvements

SQL language support is crucial for the increasing complexities of queries and usage of Trino. In 2021 we added numerous new language features and improvements:

MATCH_RECOGNIZE a feature that allows for complex analysis across multiple rows. To learn more about this feature watch the Community Broadcast show.
WINDOW clause.
RANGE and ROWS keyword for usage within a window function.
Time travel support and syntax, like FOR VERSION AS OF and FOR TIMESTAMP AS OF.
UPDATE is supported.
Subquery expressions that return multiple columns. Example: SELECT x = (VALUES (1, 'a')).
Add support for ALTER MATERIALIZED VIEW … RENAME TO …
from_geojson_geometry/to_geojson_geometry functions.
contains function for checking if a CIDR contains an IP address.
listagg function returns concatenated values seperated by a specified separator.
soundex function that checks phonetic similarity of two strings.
format_number function.
SET TIME ZONE to set the current time zone for the session.
Arbitrary queries in SHOW STATS.
CURRENT_CATALOG and CURRENT_SCHEMA session functions.
TRUNCATE TABLE which allows for a more efficient delete.
DENY statement, which enables you to remove a user or groups access via SQL.
IN <catalog> clause to CREATE ROLE, DROP ROLE, GRANT ROLE, REVOKE ROLE, and SET ROLE to specify the target catalog of the statement instead of using the current session catalog.

Query processing improvements

Added support for automatic query retries (this feature is very experimental with some limitations for now).
Transparent query retries.
Updated the behavior of ROW to JSON cast to produce JSON objects instead of JSON arrays.
Column and table lineage tracking in QueryCompletedEvent.

Performance improvements

Improved performance for the following operations:

Querying Parquet data for files containing column indexes.
Reading dictionary-encoded Parquet files.
Queries using rank() window function.
Queries using sum() and avg() for decimal types.
Queries using GROUP BY with single grouping column.
Aggregation on decimal values.
Evaluation of the WHERE and SELECT clause.
Computing the product of decimal values with precision larger than 19.
Queries that process row or array data.
Queries that contain a DISTINCT clause.
Reduced memory usage and improved performance of joins.
ORDER BY LIMIT performance was improved when data was pre-sorted.
Node-local Dynamic Filtering

Security

Added the following improvements and features relevant for authentication, authorization and integration with other security systems:

Automatic configuration of TLS for secure internal communication.
Handling of Server Name Indication (SNI) for multiple TLS certificates. This removes the need to provision per-worker TLS certificates.
Access control for materialized views.
OAuth2/OIDC opaque access tokens.
Configuring HTTP proxy for OAuth2 authentication.
Configuring multiple password authentication plugins.
Hiding inaccessible columns from SELECT * statement.

Data Sources

BigQuery connector

Added CREATE TABLE and DROP TABLE support.
Added support for case insensitive name matching for BigQuery views.
Support reading bignumeric type whose precision is less than or equal to 38.
Added support for CREATE SCHEMA and DROP SCHEMA statements.
Improved support for BigQuery datetime and timestamp types.

Cassandra connector

Mapped Cassandra uuid type to Trino uuid.
Added support for Cassandra tuple type.
Changed minimum number of speculative executions from two to one.
Support for reading user-defined types.

Clickhouse connector

Added ClickHouse connector.
Improved performance of aggregation queries by computing aggregations within ClickHouse. Currently, the following aggregate functions are eligible for pushdown: count, min, max, sum and avg.
Added support for dropping columns.
Map ClickHouse UUID columns as UUID type in Trino instead of VARCHAR.

HDFS, S3, Azure and cloud object storage systems

A core use case of Trino uses the Hive and Iceberg connectors to connect to a data lake. These connectors differ from most as Trino is the sole query engine as opposed to the client calling another system. Here are some changes that for these connectors:

Enabled Glue statistics to support better query planning when using AWS.
UPDATE support for ACID tables
A lot of Hive view improvements.
Parquet column indexes.
target_max_file_size configuration to control the file size of data written by Trino.
Streaming uploads to S3 by default to improve performance and reduce disk usage.
Improved performance for tables with small files and partitioned tables.
Transparent redirection from a Hive catalog to Iceberg catalog if the table is an Iceberg table.
Updated to Iceberg 0.11.0 behavior for transforms of dates and timestamps before 1970.
Added procedure system.flush_metadata_cache() to flush metadata caches.
Avoid generating splits for empty files.
Sped up Iceberg query performance when dynamic filtering can be leveraged.
Increased Iceberg performance when reading timestamps from Parquet files.
Improved Iceberg performance for queries on nested data through dereference pushdown.
Added support for INSERT OVERWRITE operations on S3-backed tables.
Made the Iceberg uuid type available.
Trino views made available in Iceberg.

Elasticsearch connector

Added support for reading fields as json values.
Fixed failure when documents contain fields of unsupported types.
Added support for scaled_float type.
Added support for assuming an IAM role.
Added retry requests with backoff when Elasticsearch is overloaded.
Better support for Elastic Cloud.

MongoDB connector

Added timestamp_objectid() function.
Enabled mongodb.socket-keep-alive config property by default.
Add support for json type.
Support reading MongoDB DBRef type.
Allow skipping creation of an index for the _schema collection, if it already exists.
Added support to redact the value of mongodb.credentials in the server log.
Added support for dropping columns.

MySQL connector

Added support for reading and writing timestamp values with precision higher than three.
Added support for predicate pushdown on timestamp columns.
Exclude an internal sys schema from schema listings.

Pinot connector

Updated Pinot connector to be compatible with versions >= 0.8.0 and drop support for older versions.
Added support for pushdown of filters on varbinary columns to Pinot.
Fixed incorrect results for queries that contain aggregations and IN and NOT IN filters over varchar columns.
Fixed failure for queries with filters on real or double columns having +Infinity or -Infinity values.
Implemented aggregation pushdown.
Allowed HTTPS URLs in pinot.controller-urls.

Phoenix connector

Phoenix 5 support was added.
Reduced memory usage for some queries.
Improved performance by adding ability to parallelize queries within Trino.

Features added to various connectors

In addition to the above some more features were added that apply to connectors that use common code. These features improve performance using:

Statistical aggregate function pushdown
TopN pushdown and join pushdown
Improved planning times by reducing number of connections opened
Improved performance by improving metadata caching hit rate
Rule based identifier mapping support
DELETE, non-transactional inserts and write-batch-size
Metadata cache max size
TRUNCATE TABLE
Improved handling of Gregorian - Julian switch for date type
Ensured correctness when pushing down predicates and topN to remote system that is case-insensitive or sorts differently from Trino.

Runtime improvements

There are a lot of performance improvements to list from the release notes. Here are a few examples:

Improved coordinator CPU utilization.
Improved query performance by reducing CPU overhead of repartitioning data across worker nodes.
Reduced graceful shutdown time for worker nodes.

Everything else

HTTP Event listener
Added support for ARM64 in the Trino Docker image.
Added clear command to the Trino CLI to clear the screen.
Improved tab completion for the Trino CLI.
Custom connector metrics.
Fixed many, many, many bugs!

Trino Summit

In 2021 we also enjoyed a successful inaugural Trino Summit, hosted by Starburst, with well over 500 attendees. There were wonderful talks given at this event from companies like Doordash, EA, LinkedIn, Netflix, Robinhood, Stream Native, and Tabular. If you missed this event, we have the recordings and slides available.

As a teaser, the event started with Commander Bun Bun playing guitar to AC/DC’s, “Back In Black”.

Renaming from PrestoSQL to Trino

As mentioned above, we renamed the project this year. What followed, was an outpouring of support and shock from the larger tech community. Community members immediately got to work. The project had to change the namespace practically overnight from the io.prestosql namespace to io.trino and a migration blog post was published. Due to the hasty nature of the Linux Foundation to enforce the Presto trademark, users had to adapt quickly.

This confused many in the community, especially once the ownership of old PrestoSQL accounts were taken down by the Linux Foundation. The https://prestosql.io site had broken documentation links, JDBC urls had to change from jdbc:presto to jdbc:trino, header protocol names had to be changed from prefix X-Presto- to X-Trino-, and various other user impacting changes had to be made in the matter of weeks. Even the legacy Docker images were removed from the prestosql/presto Docker repository, causing disruptions for many users who immediately had to upgrade to the trinodb/trino Docker repository.

We reached out to multiple projects to update compatibility to Trino.

Despite the breaking changes, once the immediate hurdles fell behind, not only was the community excited and supportive about the brand change, but particularly they were all loving the new mascot. Our adorable bunny was soon after named Commander Bun Bun by the community.

2022 Roadmap: Project Tardigrade

One of the interesting developments that came out of Trino Summit was a feature Trino co-creator, Martin, talked about in the State of Trino presentation. He proposed adding granular fault-tolerance and features to improve performance in the core engine. While Trino has been proven to run batch analytics workloads at scale, many have avoided long-running batch jobs in fear of a query failure. The fault-tolerance feature introduces a first step for the Trino project to gain first-class support for long-running batch queries at massive scale.

The granular fault-tolerance is being thoughtfully crafted to maintain the speed advantage that Trino has over other query engines, while increasing the resiliency of queries. In other words, rather than when a query runs out of resources or fails for any other reason, a subset of the query is retried. To support this intermediate stage data is persisted to replicated RAM or SSD.

The project to introduce granular fault-tolerance into Trino is called Project Tardigrade. It is a focus for many contributors now, and we will introduce you to details in the coming months. The project is named after the microscopic Tardigrades that are the worlds most indestructible creatures, akin to the resiliency we are adding to Trino’s queries. We look forward to telling you more as features unfold.

Along with Project Tardigrade will be a series of changes focused around faster performance in the query engine using columnar evaluation, adaptive planning, and better scheduling for SIMD and GPU processors. We also will be working on dynamically resolved functions, MERGE support, Time Travel queries in data lake connectors, Java 17, improved caching mechanisms, and much much more!

Conclusion

In summary, living this first year under the banner of Trino was nothing short of a wild endeavor. Any engineer knows that naming things is hard, and renaming things is all the more difficult.

As we head into 2022, we can be certain of one thing. Trino will be reaching into newer areas of development and breaking norms just as it did as Presto in previous eras. The adoption of native fault-tolerance to a lightning fast query engine will bring Trino to a new level of adoption. Keep your eyes peeled for more about Project Tardigrade.

Along with Project Tardigrade, we are looking forward to another year filled with features, issues, and suggestions from our amazing and passionate community. Thank you all for an incredible year. We can’t wait to see what you all bring in 2022!

Log4Shell does not affect Trino

2021-12-13T00:00:00+00:00

In the last few days we had a surge of folks in our community reaching out with concerns over the Log4Shell exploit (CVE-2021-44228), and we want to inform you that Trino is not affected. Trino does not use log4j in the core engine or runtime classes. There are some connectors that include the log4j dependency from client dependencies, but are either not used or are not versions affected by the Log4Shell vulnerability. Regular security reviews, including code and dependency analysis, are part of the regular development process. As we learn more we will update the code to keep vulnerabilities out of the code.

Trino connectors with the Log4j dependency

If you do a search in the Trino repository, you’ll notice two direct dependencies of the log4j dependency shows up in two of the connectors, Accumulo and Elasticsearch.

Accumulo

The Accumulo connector depends on log4j 1.2.17, which although isn’t vulnerable to Log4Shell, has other vulnerabilities. These vulnerabilities do not apply to how we’ve used the loggers in the connector code. To be clear, despite the small use of this logger in the Accumulo connector, there is still no threat even if you are using it. We are working on removing the uses of this log4j library to avoid any confusion in an upcoming release.

Elasticsearch

The Elasticsearch connector did have an affected dependency that was recently removed. Log4j was not being used in the connector. So despite the existence of the dependency in the Elasticsearch connector, there is no direct use of the vulnerable library.

Avoiding future introduction of Log4Shell

We take security seriously on the Trino project, as it provides a single point of access to your data sources. We’re taking precautionary measures to protect against the vulnerability from creeping its way into future versions. In version 366, we’re removing that dependency and adding a dedicated rule to the build process to ban log4j as a direct dependency.

What should you do?

Rest assured that there is no vulnerability in your Trino cluster.
If you’ve created your own plugin with one of the affected log4j libraries, you should upgrade as quickly as possible to 2.15.0 or higher.
In the coming weeks, upgrade to the 366 release at your convenience.

We know there can be a lot of concern when vulnerabilities come up. We wish you all the best of luck while you work hard to mitigate the risk of exploits in your systems. If you have any questions, reach out on the Trino Slack.

JVM challenges in production

2021-10-06T00:00:00+00:00

At Comcast, we have a large on-premise Trino cluster. It enables us to extract insights from data no matter where it resides, and prepares the company for a more cloud-centric future. Recently, however, we experienced and overcame challenges related to the Java virtual machine (JVM). We wanted to share what we encountered and learned in hopes that it might be useful for the Trino community.

JIT recompilation

Some users complained that nightly reports were taking far too long to complete. Queries that ran for six hours made very little progress.

First, we looked at the queries involved in these nightly reports. We noticed that all these queries involved two particular tables. In this post, let’s call them table A and table B.

Our initial suspicion was that there could be an issue with the table data in HDFS. Thus, we tried to reproduce the performance problem by using queries that performed simple scans against these tables.

We tried a simple table scan with no filters, range filter on a partitioned column, etc., ran these queries multiple times and execution times were consistent. This ruled out a potential problem with HDFS.

Next, we took a closer look at the portion of the slow running queries involving table A, and came up with the simplest possible query that could demonstrate the problem. We discovered that the following query did not exhibit the performance problem:

SELECT
 count(a.c1)
FROM
 hive.schema1.A a, hive.schema2.B da
WHERE
 a.day_id = da.date_id
 AND a.day_id BETWEEN '2021-03-22' AND '2021-04-21'

But adding a predicate, a.c2 = '4 (Success)', caused the performance problem to appear:

SELECT
 count(a.c1)
FROM
 hive.schema1.A a, hive.schema2.date_dim da
WHERE
 a.day_id = da.date_id
 AND a.day_id BETWEEN '2021-03-22' AND '2021-04-21'
 AND a.c2 = '4 (Success)'

We narrowed the problem down to the Scan/Filter/Project operator using the output of EXPLAIN ANALYZE from Trino. For the query that performed as expected, this stage had the following CPU stats:

CPU: 2.39h, Scheduled: 4.47h, Input: 17434967615 rows (357.47GB)

For the version of the query with the additional predicate, a.c2 = '4 (Success)', that exhibited the performance problem, the same stage has the following CPU stats:

CPU: 3.73d, Scheduled: 48.01d, Input: 17052985227 rows (413.98GB)

This shows that for roughly the equivalent amount of data, Trino used significantly more CPU (3.73 days to 2.39 hours!!). Our next step was to determine possible reasons.

We generated a few jstack and Java flight recorder (JFR) profiles of the Trino Java process from one of the worker nodes while the scan stage was running. After analyzing these profiles, we found no obvious problem. Trino performed as expected.

Next, we looked at the list of tasks in the web UI to see what the distribution of CPU times for each stage was:

Some workers have tasks that only use up a few minutes of CPU time and others have tasks that use up to 2 hours of CPU time! Different query runs would show this would happen to different workers so it was not a problem with any one individual worker.

We discussed this with Starburst engineer, Piotr Findeisen, and came to the conclusion that this could potentially be an issue with JVM code deoptimization. After re-compiling a method a certain number of times, the JVM refuses to do so any more and will run the method in interpreted mode, which is much slower.

The evidence for this is what we highlighted above: that the CPU used by the same tasks on different workers vary by a factor of approximately 30. This is the typical difference for compiled versus interpreted code, according to Piotr’s experience at Starburst.

The following JVM options were added to the Trino jvm.config file to help with this issue:

-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000

These settings increased the recompilation cutoff limit. They are now also included in the default jvm.config settings that ship with Trino since the 348 release.

Since we have been running Trino in production for some time, we did not have these settings in our jvm.config.

Initial results

Execution time observed with the JVM options in place was 4 minutes and 51 seconds. The CPU stats for the scan/filter/project stage for this query now look like:

CPU: 3.22h, Scheduled: 7.21h, Input: 17631445897 rows (428.03GB)

The CPU used by individual tasks is much more uniform:

Code cache

We noticed that the cluster’s overall CPU utilization decreased after the cluster was up for a few days, and there would be a few workers where tasks were running slow.

When looking at these workers with slow running tasks, we found that CPU usage was very high:

[root@worker-node log]# uptime
 21:36:57 up 20 days, 20:39,  1 user,  load average: 149.92, 152.83, 144.82
[root@worker-node log]#

We also noticed all these workers had messages like this in the launcher.log file:

[219756.210s][warning][codecache] Try increasing the code heap size using -XX:ProfiledCodeHeapSize=
OpenJDK 64-Bit Server VM warning: CodeHeap 'profiled nmethods' is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code heap size using -XX:ProfiledCodeHeapSize=
CodeHeap 'non-profiled nmethods': size=258436Kb used=235661Kb max_used=257882Kb free=22774Kb
 bounds [0x00007f466f980000, 0x00007f467f5e1000, 0x00007f467f5e1000]
CodeHeap 'profiled nmethods': size=258432Kb used=207330Kb max_used=216383Kb free=51101Kb
 bounds [0x00007f465fd20000, 0x00007f466f980000, 0x00007f466f980000]
CodeHeap 'non-nmethods': size=7420Kb used=1881Kb max_used=3766Kb free=5538Kb
 bounds [0x00007f465f5e1000, 0x00007f465fab1000, 0x00007f465fd20000]
 total_blobs=64220 nmethods=62699 adapters=1432
 compilation: disabled (not enough contiguous free space left)
              stopped_count=4, restarted_count=3
 full_count=3

Once the code cache is full, the JVM won’t compile any additional code until space is freed.

We were running with the -XX:ReservedCodeCacheSize JVM option set to 512M. To see what’s taking up space in the code cache, we used jcmd:

jcmd <TRINO_PID> Compiler.CodeHeap_Analytics

We ran this at various intervals so we could compare how the code cache changed over time.

30 of the top 48 non-profiled methods were PagesHashStrategy, which are generated per-query. These can’t be removed from the cache until the query is completed, so the amount of cache needed is going to be relative to the concurrency. We have a very busy cluster with significant concurrency at our busiest times.

Next, we set -XX:ReservedCodeCacheSize to 2G to see how that would help. We have not seen the code cache fill while the cluster has been running since increasing the size to 2GB. We can also monitor the size of the code cache over time using JMX. One query that can be used if you have the JMX catalog enabled on your cluster is:

SELECT
    node,
    regexp_extract(usage, 'max=(-?\d*)', 1) as max,
    regexp_extract(usage, 'used=(-?\d*)', 1) AS used
FROM
  jmx.current."java.lang:name=codeheap 'non-profiled nmethods',type=memorypool"
ORDER BY used DESC

Off heap memory usage

One final JVM issue we noticed in our production cluster was that off-heap memory on some workers grew to be quite large. We allocate approximately 85% of the physical memory on our workers for the JVM heap. Recently, we received alerts from our monitoring systems that memory consumption on our workers got dangerously close to the physical limit on the machines.

We noticed some memory related issues from the Alluxio client in the Trino worker logs on machines generating these high memory alerts. Upon further investigation, we noticed that Trino was running with the open source version of the Alluxio client. Trino ships with version 2.4.0 of the Alluxio client. We are an Alluxio customer and use it in our environment.

After discussing with Alluxio, they suggested we upgrade to version 2.4.1 of their Enterprise client which includes a fix for an off-heap memory leak bug. After upgrading to the Alluxio Enterprise client, the off-heap memory usage became a lot more stable.

Summary

This post outlined some of the JVM issues we encountered while running Trino in production. Many of these issues we only hit in our production environment and were difficult to replicate outside of it. Thus, we wanted to write up our experience with the hopes of helping other Trino users in the future!

Announcing Trino Summit

2021-09-23T00:00:00+00:00

Greetings Trino nation,

Get ready for this year’s virtual Trino Summit event! This year’s summit feels a little different as the name of the event has changed from Presto to Trino. So this will be the first event of the project hosted under the new banner of Trino.

This year’s Summit is hosted by Starburst virtually on October 21st and 22nd. We’d originally set the date on September 15th but later realized that this was conflicting with Yom Kippur. While we had originally set out to make this event a hybrid format, we had to make the difficult decision of moving the event to fully virtual in lieu of the growing health concerns around contracting and spreading the delta variant. If you haven’t registered yet, register here. If you planned on attending in person, we will still have your registration and you will still be able to attend virtually.

Get excited for our great lineup of speakers, panels, and presentations! We’re always on the lookout for speakers who are excited to share their Trino experiences.

We look forward to seeing you there!

Trino on ice IV: Deep dive into Iceberg internals

2021-08-12T00:00:00+00:00

Welcome to the Trino on ice series, covering the details around how the Iceberg table format works with the Trino query engine. The examples build on each previous post, so it’s recommended to read the posts sequentially and reference them as needed later. Here are links to the posts in this series:

So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.

Understanding Iceberg metadata

Iceberg can use any compatible metastore, but for Trino, it only supports the Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can view the same table in Iceberg.

To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.

Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.

% mc tree -f local/
local/
└─ iceberg
   └─ logging.db
      └─ events
         ├─ data
         │  ├─ event_time_day=2021-04-01
         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc
         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc
         │  └─ event_time_day=2021-04-02
         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc
         └─ metadata
            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json
            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro
            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro
            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro
            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

There are a lot of files here, but here are a couple of patterns that you can observe with these files.

First, the top two directories are named data and metadata.

/<bucket>/<database>/<table>/data//<bucket>/<database>/<table>/metadata/

As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.

/<bucket>/<database>/<table>/metadata/<file-id>.avro/<bucket>/<database>/<table>/metadata/snap-<snapshot-id>-<version>-<file-id>.avro

/<bucket>/<database>/<table>/metadata/<version>-<commit-UUID>.metadata.json

Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.

This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json.

The last blog covered the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.

SELECT manifest_list 
FROM iceberg.logging."events$snapshots";

Result:

snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

You’ll notice that the manifest list returns the Avro files prefixed with snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.

The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the jq JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let’s move them to the home directory:

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .

Result (is empty):

The second snapshot is a little more interesting and actually shows us the contents of a manifest list.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .

Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "manifest_length":6114,
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "added_data_files_count":{
      "int":2
   },
   "existing_data_files_count":{
      "int":0
   },
   "deleted_data_files_count":{
      "int":0
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   },
   "existing_rows_count":{
      "long":0
   },
   "deleted_rows_count":{
      "long":0
   }
}

To understand each of the values in each of these rows, you can refer to the Iceberg specification in the manifest list file section. Instead of covering these exhaustively, let’s focus on a few key fields. Below are the fields, and their definition according to the specification.

manifest_path - Location of the manifest file.
partition_spec_id - ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.
added_snapshot_id - ID of the snapshot where the manifest file was added.
partitions - A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.
added_rows_count - Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.

As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.

The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq '. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }'

Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":4564366177504223700
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001eI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":1
   }
}
{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   }
}

In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.

The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.

% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avro_files/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .

Result:

{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18718
         }
      },
      "record_count":1,
      "file_size_in_bytes":870,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":1
            },
            {
               "key":2,
               "value":1
            },
            {
               "key":3,
               "value":1
            },
            {
               "key":4,
               "value":1
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}
{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18719
         }
      },
      "record_count":2,
      "file_size_in_bytes":1084,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":2
            },
            {
               "key":2,
               "value":2
            },
            {
               "key":3,
               "value":2
            },
            {
               "key":4,
               "value":2
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Double oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"WARN"
            },
            {
               "key":3,
               "value":"Maybeh oh noes?"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}

Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a Manifest section in the Iceberg spec that details what each of these fields means. Here are the important fields:

snapshot_id - Snapshot id where the file was added, or deleted if status is two. Inherited when null.
data_file - Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…
data_file.file_path - Full URI for the file with FS scheme.
data_file.partition - Partition data tuple, schema based on the partition spec.
data_file.record_count - Number of records in the data file.
data_file.*_count - Multiple fields that contain a map from column id to number of values, null, nan counts in the file. These can be used to quickly filter out unnecessary get operations.
data_file.*_bounds - Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and is discussed a bit by Iceberg creator Ryan Blue in a recent meetup. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.

As mentioned above, the last set of files that you find in the metadata directory which are suffixed with .metadata.json. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed in the Iceberg specification. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support an atomic swap (check-and-put) operation required for Iceberg to support the optimistic concurrency operation.

The naming of the table metadata includes a table version and UUID: <table-version>-<UUID>.metadata.json. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:

It creates a new table metadata file using the current metadata.
It writes the new table metadata to a file following the naming with the next version number.
It requests the metastore swap the table’s metadata pointer from the old location to the new location.
1. If the swap succeeds, the commit succeeded. The new file is now the current metadata.
2. If the swap fails, another writer has already created their own. The current writer goes back to step 1.

If you want to see where this is stored in the Hive metastore, you can reference the TABLE_PARAMS table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.

SELECT PARAM_KEY, PARAM_VALUEFROM metastore.TABLE_PARAMS;

Result:

PARAM_KEY	PARAM_VALUE
EXTERNAL	TRUE
metadata_location	s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles	2
previous_metadata_location	s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type	iceberg
totalSize	5323
transient_lastDdlTime	1622865672

So as you can see, the metastore is saying the current metadata location is the 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.

% cat ~/Desktop/avro_files/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json

Result:

{
   "format-version":1,
   "table-uuid":"32e3c271-84a9-4be5-9342-2148c878227a",
   "location":"s3a://iceberg/logging.db/events",
   "last-updated-ms":1622865686323,
   "last-column-id":5,
   "schema":{
      "type":"struct",
      "fields":[
         {
            "id":1,
            "name":"level",
            "required":false,
            "type":"string"
         },
         {
            "id":2,
            "name":"event_time",
            "required":false,
            "type":"timestamp"
         },
         {
            "id":3,
            "name":"message",
            "required":false,
            "type":"string"
         },
         {
            "id":4,
            "name":"call_stack",
            "required":false,
            "type":{
               "type":"list",
               "element-id":5,
               "element":"string",
               "element-required":false
            }
         }
      ]
   },
   "partition-spec":[
      {
         "name":"event_time_day",
         "transform":"day",
         "source-id":2,
         "field-id":1000
      }
   ],
   "default-spec-id":0,
   "partition-specs":[
      {
         "spec-id":0,
         "fields":[
            {
               "name":"event_time_day",
               "transform":"day",
               "source-id":2,
               "field-id":1000
            }
         ]
      }
   ],
   "default-sort-order-id":0,
   "sort-orders":[
      {
         "order-id":0,
         "fields":[
            
         ]
      }
   ],
   "properties":{
      "write.format.default":"ORC"
   },
   "current-snapshot-id":4564366177504223943,
   "snapshots":[
      {
         "snapshot-id":6967685587675910019,
         "timestamp-ms":1622865672882,
         "summary":{
            "operation":"append",
            "changed-partition-count":"0",
            "total-records":"0",
            "total-data-files":"0",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro"
      },
      {
         "snapshot-id":2720489016575682283,
         "parent-snapshot-id":6967685587675910019,
         "timestamp-ms":1622865680419,
         "summary":{
            "operation":"append",
            "added-data-files":"2",
            "added-records":"3",
            "added-files-size":"1954",
            "changed-partition-count":"2",
            "total-records":"3",
            "total-data-files":"2",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro"
      },
      {
         "snapshot-id":4564366177504223943,
         "parent-snapshot-id":2720489016575682283,
         "timestamp-ms":1622865686278,
         "summary":{
            "operation":"append",
            "added-data-files":"1",
            "added-records":"1",
            "added-files-size":"746",
            "changed-partition-count":"1",
            "total-records":"4",
            "total-data-files":"3",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro"
      }
   ],
   "snapshot-log":[
      {
         "timestamp-ms":1622865672882,
         "snapshot-id":6967685587675910019
      },
      {
         "timestamp-ms":1622865680419,
         "snapshot-id":2720489016575682283
      },
      {
         "timestamp-ms":1622865686278,
         "snapshot-id":4564366177504223943
      }
   ],
   "metadata-log":[
      {
         "timestamp-ms":1622865672894,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json"
      },
      {
         "timestamp-ms":1622865680524,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json"
      }
   ]
}

As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!

This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try Trino on Ice(berg) yourself!

Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec

2021-07-30T00:00:00+00:00

In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the Iceberg Specification.

Concurrency Model

Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.

A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a network or file IO failure. There’s a good Trino Community Broadcast episode that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch a simulation of this error on that episode.

Aside from having issues due to the split state in the system, there are many other issues that stem from the file system itself. In the case of HDFS, depending on the specific filesystem implementation you are using, you may have different atomicity guarantees for various file systems and their operations, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s recent announcement of strong consistency in their S3 service, most object storage systems only offer eventual consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.

Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called optimistic concurrency control.

Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg creator, Ryan Blue, covers this lock-and-swap mechanism and how the metastore can be replaced with alternate locking methods. In the event that two writers attempt to commit at the same time, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.

This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a copy-on-write mechanism that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via MERGE INTO syntax are not supported in Trino, but this is in active development. UPDATE: Since the original writing of this post, the MERGE syntax exists as of version 393.

One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called time travel.

Snapshots and Time Travel

To showcase snapshots, it’s best to go over a few examples drawing from the event table we created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.

Say that you have created your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:

SELECT level, message
FROM iceberg.logging.events;

Result:

level	message
ERROR	Double oh noes
WARN	Maybeh oh noes?
ERROR	Oh noes

To query the snapshots, all you need is to use the $ operator appended to the end of the table name, and add the hidden table, snapshots:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;

Result:

snapshot_id	parent_id	operation
7620328658793169607		append
2115743741823353537	7620328658793169607	append

Let’s take a look at the manifest list files that are associated with each snapshot ID. You can tell which file belongs to which snapshot based on the snapshot ID embedded in the filename:

SELECT manifest_list
FROM iceberg.logging.“events$snapshots”;

Result:

shapshots
s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro
s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro

Now, let’s insert another row to the table:

INSERT INTO iceberg.logging.events
VALUES
(
‘INFO’,
timestamp ‘2021-04-02 00:00:11.1122222’,
‘It is all good’,
ARRAY [‘Just updating you!’]
);

Let’s check the snapshot table again:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;

Result:

snapshot_id	parent_id	operation
7620328658793169607		append
2115743741823353537	7620328658793169607	append
7030511368881343137	2115743741823353537	append

Let’s also verify that our row was added:

SELECT level, message
FROM iceberg.logging.events;

Result:

level	message
ERROR	Oh noes
INFO	It is all good
ERROR	Double oh noes
WARN	Maybeh oh noes?

Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the @ operator, followed by the snapshot you wish to read from:

SELECT level, message
FROM iceberg.logging.“events@2115743741823353537”;

Result:

level	message
ERROR	Double oh noes
WARN	Maybeh oh noes?
ERROR	Oh noes

If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called rollback_to_snapshot to move the table state to another snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 2115743741823353537);

Now that we have rolled back, observe what happens when we query the events table with:

SELECT level, message
FROM iceberg.logging.events;

Result:

level	message
ERROR	Double oh noes
WARN	Maybeh oh noes?
ERROR	Oh noes

Notice the INFO row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it, back to the future! In Trino, you use the same function call but with a predecessor of the existing snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 7030511368881343137)

And now we should be able to query the table again and see the INFO row return:

SELECT level, message
FROM iceberg.logging.events;

Result:

level	message
ERROR	Oh noes
INFO	It is all good
ERROR	Double oh noes
WARN	Maybeh oh noes?

As expected, the INFO row returns when you roll back to the future.

Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.

Iceberg Specification

Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that alternative table formats are also emerging in this space and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue, comparing Iceberg to other table formats, he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.

The strongest argument I can see for Iceberg is that it has a specification. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use cases. Doing so improves the standards and therefore the technologies that implement them.

The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try Trino on Ice(berg).

Trino on ice II: In-place table evolution and cloud compatibility with Iceberg

2021-07-12T00:00:00+00:00

The first post covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution!

You may find it a little odd that I am getting excited over tables evolving in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve.

Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!

Partition Specification evolution

In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!

In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue.

This is conveyed pretty succinctly in this graphic from the Iceberg documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.

At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. There are efforts to add this support in the near future.

Schema evolution

Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema evolution across various file types in Hive requires a lot of memorizing the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.

Hive 2.2.0 schema evolution based on file type and operation.
	Add	Delete	Rename
CSV/TSV	✅	❌	❌
JSON	✅	✅	❌
ORC/Parquet/Avro	✅	✅	❌

Currently in Iceberg, schemaless position-based data formats such as CSV and TSV are not supported, though there are some discussions on adding limited support for them. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers.

While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named severity, then later I want to add a new int column called severity, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new severity column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new severity column might not even be aware of the old severity column, if it was quite some time ago when it was dropped.

ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.

In general, Iceberg can only allow this small set of file formats due to the correctness guarantees it provides. In Trino, you can add, delete, or rename columns using the ALTER TABLE command. Here’s an example that continues from the table created in the last post that inserted three rows. The DDL statement looked like this.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6), 
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioning = ARRAY['day(event_time)']
);

Here is an ALTER TABLE sequence that adds a new column named severity, inserts data including into the new column, renames the column, and prints the data.

ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; 

INSERT INTO iceberg.logging.events VALUES 
(
  'INFO', 
  timestamp 
  '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles', 
  'es muy bueno', 
  ARRAY ['It is all normal'], 
  1
);

ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;

SELECT level, message, priority
FROM iceberg.logging.events;

Result:

level	message	priority
ERROR	Double oh noes	NULL
WARN	Maybeh oh noes?	NULL
ERROR	Oh noes	NULL
INFO	es muy bueno	1

ALTER TABLE iceberg.logging.events 
DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.events;

Result

CREATE TABLE iceberg.logging.events (
   level varchar,
   event_time timestamp(6),
   message varchar,
   call_stack array(varchar)
)
WITH (
   format = 'ORC',
   partitioning = ARRAY['day(event_time)']
)

Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.

Cloud storage compatibility

Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.

The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date.

Iceberg avoids all of these issues by tracking the data at the file level, rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus needing to do a list operation and scanning all the files.

Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.

If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in the next post.

Row pattern recognition with MATCH_RECOGNIZE

2021-05-19T00:00:00+00:00

The MATCH_RECOGNIZE syntax was introduced in the latest SQL specification of 2016. It is a super powerful tool for analyzing trends in your data. We are proud to announce that Trino supports this great feature since version 356. With MATCH_RECOGNIZE, you can define a pattern using the well-known regular expression syntax, and match it to a set of rows. Upon finding a matching row sequence, you can retrieve all kinds of detailed or summary information about the match, and pass it on to be processed by the subsequent parts of your query. This is a new level of what a pure SQL statement can do.

This blog post gives you a taste of row pattern matching capabilities, and a quick overview of the MATCH_RECOGNIZE syntax.

A regular expression and a table: a fruitful relationship

The regex matching we all know is about searching for patterns in character strings. But how does a regex match a sequence of rows? Certainly, a row of data is a more complex structure than a character. And so, row pattern matching is more expressive than regex matching in text. Unlike characters, which stay constantly in their places in a string, rows aren’t assigned up-front to pattern components. This is where the additional level of complexity comes from: whether the row is an A, B or C, is conditional. It is revealed as the pattern matching goes forward. It depends on the data in the row, but also on the context of the current match and even on the match number. Also, a row can match different labels at a time.

Consider this simple example:

PATTERN: A B+ C D?

First, let’s match it to the string "ABBCEE". There is exactly one way to match it: the prefix "ABBC" is a match.

Now, let’s see what it takes to match a pattern to rows of a table. Consider the table numbers with a single column number:

You need defining conditions to define how the rows of the table can be mapped to pattern components A, B, C and D:

DEFINE:
    A <- true (matches every row)
    B <- number is greater than previous number
    C <- number is lower or equal to A
    D <- matches every row, but only in the first match;
         otherwise doesn't match any row

As you can see, the conditions can refer to other pattern components (C depends on A), or the sequential match number (D).

When searching for a match, the engine goes row by row, and assigns labels according to the pattern. Every time the pattern shows the next component (label) to be matched, the defining condition of that component is evaluated for the current row in the context of the partial match.

After finding a match, you can step one row forward and search for another one.

So far, two matches were found in the same set of rows. Interestingly, a row that was labeled as B in the first match, became A in the second match. Let’s try to find another match.

Time to get more technical

…and use some real ~~life~~ money examples.

In the preceding examples, the pattern consisted of components A, B, C and D. They were chosen this way to capture the analogy between pattern matching in a string and pattern matching in a set of rows. According to the SQL specification, row pattern components can be named with arbitrary identifiers, as long as they are compliant with the SQL identifier semantics, so you don’t need to limit yourself to single-letter names, and instead you can use more verbose labels.

Officially, the pattern components, or labels, are called the primary pattern variables. They are the basic components of the row pattern. Consider the following example:

PATTERN( START DOWN+ UP+ )

There are three primary pattern variables: START, DOWN and UP. The + is the “one or more” quantifier you know from the regex syntax. Intuitively, this pattern should match a sequence of rows which are first “decreasing”, and then “increasing”. You need to inform the engine how it should map rows to the variables. In other words, you need to define what the “decreasing” and “increasing” rows are:

DEFINE DOWN AS price < PREV(price),
       UP AS price > PREV(price)

Now it’s clear that “decreasing” and “increasing” is about the price values. There is no defining condition for the START variable, which informs the engine that the match can start anywhere.

The preceding example shows the two key clauses of row pattern recognition: PATTERN and DEFINE. Let’s see what other keywords there are in the MATCH_RECOHNIZE clause.

Syntax overview

The MATCH_RECOGNIZE syntax is long and rich enough to capture everything that a pattern matching tool needs, and all the options which let you easily toggle your matching strategies.

Technically, MATCH_RECOGNIZE is part of the FROM clause:

SELECT ...
    FROM some_table
        MATCH_RECOGNIZE (
          [ PARTITION BY column [, ...] ]
          [ ORDER BY column [, ...] ]
          [ MEASURES measure_definition [, ...] ]
          [ rows_per_match ]
          [ AFTER MATCH skip_to ]
          PATTERN ( row_pattern )
          [ SUBSET subset_definition [, ...] ]
          DEFINE variable_definition [, ...]
          )

MATCH_RECOGNIZE can be used in the query as one of the stages of processing data. You can SELECT from its results or even stream them into another MATCH_RECOGNIZE.

The PATTERN and DEFINE clauses are the heart of row pattern recognition. They are also the only two required subclauses of MATCH_RECOGNIZE. They were touched upon in the previous section.

The pattern syntax is close to regular expression syntax. It also supports some extensions specific to row pattern recognition. They are explained in Row pattern syntax.

The PARTITION BY and ORDER BY clauses are similar to those in the WINDOW syntax. They help you structure the input data. You can use PARTITION BY to break up your data into independent chunks. ORDER BY is useful to establish the order of rows before searching for the pattern. Typically, you want to analyze series of events over time, so ordering by date is a good choice.

In the MEASURES clause, you can specify what information you need about every match that is found. In the example, if you’re interested in the order date, the lowest value of price and the sequential number of the match, this is the way to retrieve them:

MEASURES order_date AS date,
         LAST(DOWN.price) AS bottom_price,
         MATCH_NUMBER() AS match_no

date, bottom_price and match_no are exposed by the pattern recognition clause as output columns.

The expressions in the MEASURES and DEFINE clauses allow you to combine the input data with the information about the matched pattern. They support many extensions and special constructs to help you get the most of your data, both when defining the pattern, and retrieving useful information after a successful match. The special keyword LAST is one example. For the full list of the magic spells, check Expressions for special tasks.

The MATCH_RECOGNIZE clause has two useful toggles. The first of them lets you choose whether the output includes all rows of the match, or a single-row summary. For all rows, specify ALL ROWS PER MATCH. For a single row, choose the default ONE ROW PER MATCH. There are also sub-options available, enabling different handling of empty matches and unmatched rows.

Another toggle is the AFTER MATCH SKIP clause. It allows you to specify where the row pattern matching resumes after finding a match. The default option is AFTER MATCH SKIP PAST LAST ROW, but you can also skip to the next row or to a specific position in the match based on the matched pattern variables.

The SUBSET clause is where the union pattern variables are defined. They are a concise way to refer to a group of primary pattern variables:

SUBSET U = (DOWN, UP)

The following expression returns the value of price from the last row matched either to DOWN or UP primary variable:

LAST(U.price)

Row pattern syntax

The basic element of row pattern is the primary pattern variable. Other syntax components include:

Concatenation

A B C

Alternation

A | B | C

Permutation

PERMUTE(A, B, C)

Grouping

(A B C)

Partition start anchor

Partition end anchor

Empty pattern

()

Exclusion syntax

{- row_pattern -}

Exclusion syntax is useful in combination with the ALL ROWS PER MATCH option. If you find some sections of the match uninteresting, you can wrap them in the exclusion, and they are dropped from the output.

Quantifiers

Row pattern syntax supports all kinds of quantifiers: the basic ones *, +, ?, and others, which let you specify the exact number of repetitions, or the accepted range: {n}, {n, m}, {n,}, {,n}. Make sure you don’t confuse those:

{n} is for exactly n repetitions,
{n,} is equal to {n, ∞},
{,n} is equal to {0, n}.

Quantifiers are greedy by default. It means that they prefer higher number of repetitions over lower number. If you want it the other way, you can change a quantifier to reluctant by appending ? immediately after it. So, (pattern)? prefers a single match of the pattern, while (pattern)?? would rather omit the pattern altogether.

Match preference

MATCH_RECOGNIZE is supposed to produce at most one match starting from a specific row. If there are more matches available, the winner is chosen based on the order of preference. The greedy and reluctant quantifiers are one example of preference. Other pattern components have their own rules:

pattern alternation prefers the left-hand components to the right-hand ones.
pattern permutation is equivalent to alternation of all permutations of its components. If multiple matches are possible, the match is chosen based on the lexicographical order established by the order of components in the PERMUTE list. For PERMUTE(A, B, C), the preference of options goes as follows: A B C, A C B, B A C, B C A, C A B, C B A.

Expressions for special tasks

The MATCH_RECOGNIZE clause provides special expression syntax, available in the MEASURES and DEFINE clauses. Its purpose is to combine the input data with the information about the match. The syntax includes:

Pattern variable references

They allow referring to certain components of the match, for example DOWN.price, UP.order_date.

Logical navigation operations: LAST, FIRST

They allow you to navigate over the rows of a match based on the pattern variables assigned to them. For example, LAST(DOWN.price, 3) navigates to the last row labeled as “DOWN”, goes three occurrences of the “DOWN” label backwards, and gets the price value from that row. The default offset is 0: LAST(DOWN.price) gets the price value from the last row labeled as “DOWN”. If the logical navigation goes beyond the match bounds, the operation returns null.

Physical navigation operations: PREV, NEXT

They let you navigate over the rows of the partition by a specified offset. Physical navigations use logical navigations as the starting point. For example, NEXT(DOWN.price, 5) first navigates to the last row labeled as “DOWN”. Starting from there, it goes five rows forward and gets the price value from that row. In the preceding example, the logical navigation LAST is implicit, but you can specify the nested logical navigation explicitly, for example NEXT(FIRST(DOWN.price, 4), 5). The default offset is 1, which means that the physical navigations by default go one row backwards, or one row forward.

The physical navigation can retrieve values beyond the match bounds. It gives you great flexibility. For example, the defining conditions of pattern variables can peek at the values ahead. Also, when computing row pattern measures, you can refer to the wider context of the match.

The CLASSIFIER function

It returns the primary pattern variable associated with the row.

The MATCH_NUMBER function

It returns the sequential number of the match within the partition.

The RUNNING and FINAL keywords

The expressions in the DEFINE clause are evaluated when the pattern matching is in progress. At each step, the engine only knows a part of the match. This is the running semantics.

The expressions of the MEASURES clause are evaluated when the match is complete. The engine can see the whole match from the position of the final row. This is the final semantics.

However, with the ALL ROWS PER MATCH option, when the match result is processed row by row, you can choose either approach to compute the measures. To do that, you can specify the RUNNING or FINAL keyword before the logical navigation operation, for example RUNNING LAST(DOWN.price) or FINAL LAST(DOWN.price).

The running semantics is the default both in the DEFINE and MESAURES clauses. Note that FINAL only applies to the MEASURES clause.

To sum up, here’s one complex measure expression combining different elements of the special syntax:

Trino CLI show-off time!

Now, let’s see the whole machinery come to life. This is the same example data that we used before, and the same goal: detect a “V”-shape of the price values over time for different customers.

trino> WITH orders(customer_id, order_date, price) AS (VALUES
    ('cust_1', DATE '2020-05-11', 100),
    ('cust_1', DATE '2020-05-12', 200),
    ('cust_2', DATE '2020-05-13',   8),
    ('cust_1', DATE '2020-05-14', 100),
    ('cust_2', DATE '2020-05-15',   4),
    ('cust_1', DATE '2020-05-16',  50),
    ('cust_1', DATE '2020-05-17', 100),
    ('cust_2', DATE '2020-05-18',   6))
SELECT customer_id, start_price, bottom_price, final_price, start_date, final_date
    FROM orders
        MATCH_RECOGNIZE (
            PARTITION BY customer_id
            ORDER BY order_date
            MEASURES
                START.price AS start_price,
                LAST(DOWN.price) AS bottom_price,
                LAST(UP.price) AS final_price,
                START.order_date AS start_date,
                LAST(UP.order_date) AS final_date
            ONE ROW PER MATCH
            AFTER MATCH SKIP PAST LAST ROW
            PATTERN (START DOWN+ UP+)
            DEFINE
                DOWN AS price < PREV(price),
                UP AS price > PREV(price)
            );

 customer_id | start_price | bottom_price | final_price | start_date | final_date
-------------+-------------+--------------+-------------+------------+------------
 cust_1      |         200 |           50 |         100 | 2020-05-12 | 2020-05-17
 cust_2      |           8 |            4 |           6 | 2020-05-13 | 2020-05-18
(2 rows)

Two matches are detected, one for cust_1, and one for cust_2.

Empty matches explained

An empty match is a legit result of row pattern recognition. There are different pattern constructs that can result in an empty match. The empty pattern syntax () is the trivial one. Empty match can also result e.g. from quantification: A*, or alternation: A | ().

An empty match does not consume any input rows, but like every match, it is associated with a row, called the starting row. That is the row at which the pattern matching started. Note that if the pattern allows an empty match, it guarantees that no rows remain unmatched. Also, an empty match, as well as non-empty matches, gets a sequential number, which can be retrieved by the MATCH_NUMBER function.

Depending on your use case, you can consider empty matches informative or just see them as a leftover of the algorithm.

There’s one more thing linked to empty matches. Some patterns have the dangerous potential of looping endlessly over a piece that doesn’t consume any rows. It doesn’t have to be as explicit as ()*. There are complex patterns that don’t show their looping potential at first glance. We handled them carefully so that you never have to waste your time on looping queries.

In a few words, what’s so cool about row pattern matching?

From the SQL viewpoint, you can think of row pattern matching as extended window functions. Window functions allow you to capture some dependencies in rows of data based on their relative position or value. Row pattern matching allows you to detect arbitrarily complicated dependencies, based not only on the input values but also on the details of the actual match and on the match number.

Before the introduction of MATCH_RECOGNIZE, you had to feed your data to external tools to reason about trends and patterns. Now, you can achieve it directly in your query, and even build your query upon the pattern recognition clause to further process the match results.

Row pattern matching is typically used:

in trade applications for tracking trends or identifying customers with specific behavioral patterns,
in shipping applications for tracking packages through all possible valid paths,
in financial applications for detecting unusual incidents, which might signal fraud.

What’s your use case?

I hope you enjoy Trino’s new feature. Refer to Trino docs for even more details, examples and usage tips. Please do reach out to us with any questions or issues. We plan to support row pattern matching in the WINDOW clause soon, so stay tuned!

Trino on ice I: A gentle introduction To Iceberg

2021-05-03T00:00:00+00:00

Back in the Gentle introduction to the Hive connector blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem - the invisible Hive specification.

I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on which version of Hive or Hadoop you are running. It also varies if you are in the cloud or over an object store. Spark has even modified the Hive spec in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.

So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.

So why did I just rant about Hive for so long if I’m here to tell you about Apache Iceberg? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.

If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the founders of Presto left Facebook. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.

Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.

Hidden Partitions

Hive Partitions

Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with hive. uses the Hive connector, while the iceberg. table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an events table in the logging schema in the hive catalog, which is configured to use the Hive connector. Trino also creates a partition on the events table using the event_time field which is a TIMESTAMP field.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);

Running this in Trino using the Hive connector produces the following error message.

Partition keys must be the last columns in the table and in the same order as the table properties: [event_time]

The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the event_time field moved to the last column position.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time TIMESTAMP
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);

This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new VARCHAR column, event_time_day that is dependent on the event_time column to create the date partition value.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time_day VARCHAR
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time_day']
);

This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.

INSERT INTO hive.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'], 
  '2021-04-01'
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'],
  '2021-04-02'
),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??'], 
  '2021-04-02'
);

Notice that the last partition value '2021-04-01' has to match the TIMESTAMP date during insertion. There is no validation in Hive to make sure this is happening because it only requires a VARCHAR and knows to partition based on different values.

On the other hand, If a user runs the following query:

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02';

they get the correct results back, but have to scan all the data in the table:

level	event_time	message	call_stack
ERROR	2021-04-01 12:00:00	Oh noes	Exception in thread “main” java.lang.NullPointerException

This happens because the user forgot to include the event_time_day < '2021-04-02' predicate in the WHERE clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02' 
AND event_time_day < '2021-04-02';

Result:

level	event_time	message	call_stack
ERROR	2021-04-01 12:00:00	Oh noes	Exception in thread “main” java.lang.NullPointerException

Iceberg Partitions

The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6),
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  partitioning = ARRAY['day(event_time)']
);

Taking note of a few things. First, notice the partition on the event_time column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the event_time field. The partition specification is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called hidden partitioning , where only the table creator/maintainer has to know the partitioning specification. Here is what the insert statements look like now:

INSERT INTO iceberg.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??']
);

The VARCHAR dates are no longer needed. The event_time field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.

SELECT *
FROM iceberg.logging.events
WHERE event_time < timestamp '2021-04-02';

Result:

level	event_time	message	call_stack
ERROR	2021-04-01 12:00:00	Oh noes	Exception in thread “main” java.lang.NullPointerException

So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…

If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. The next post covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.

Trino: The Definitive Guide

2021-04-21T00:00:00+00:00

Just over a year ago we announced the availability of the first book about Trino - our definitive guide. Back then the project was still called Presto, and the rename with the end of 2020 was a good reason for us to give the book a refresh.

Today, we are happy to announce that a new edition now titled Trino: The Definitive Guide is available.

Get a free copy of Trino: The Definitive Guide from Starburst now!

The new edition of the book from O’Reilly is available in digital formats as well as physical copies. You can find more information about the book on our permanent page about it.

The book is now updated to Trino release 354 for all filenames, installation methods, command, names and properties. We addressed all problems found by our readers and reported to us as well.

Our major supporter, Starburst, allowed us to work on the book and bring it across the finish line again. You can get a free digital copy from Starburst.

So what are you waiting for? Go get a copy, check out the updated example code repository, provide feedback and contact us on Slack.

Looking forward to it all!

Matt, Manfred and Martin

Trino at Writing Day

2021-04-14T00:00:00+00:00

First time Trino blogger, long time lurker on the Trino slack. My name is Rose Williams and I’m an open source docs enthusiast! I’ve had the pleasure of contributing to this community for the past few months. Recently I’ve been working with Brian Olsen, our fearless developer advocate, as well as some of our other Trino doc contributors, to get Trino ready for the Write the Docs Writing Day open source event!

If you’re not familiar with Write the Docs, it’s a global community of people who care about documentation.

“We consider everyone who cares about communication, documentation, and their users to be a member of our community. This can be programmers, tech writers, developer advocates, customer support, marketers, and anyone else who wants people to have great experiences with software.”

Writing Day is the first day of their upcoming virtual documentation conference, Write the Docs Portland (PST) April 25-27, 2021. The goal of Writing Day is to get a bunch of interesting people in a room together and introduce them to cool open source projects that they can onboard and contribute to.

Writing Day is open to all conference attendees and several Trino enthusiasts are attending as mentors. Leading up to the conference, we’re focused on identifying docs issues that are ideal for first time contributors. If you’re a regular Trino contributor, you might notice that we’re going through and tagging items as “good first issue” and “docs” - we’ll be using those tags to create an issues filter for the event. We’re also doing some work on the Trino docs readme to help folks onboard faster.

Snag a ticket if you’re interested in participating, we hope to see you there! Our goal is to continue curating good first issues for future writers and developers.

Join the new #documentation channel on the Trino slack and favorite the Trino project on GitHub.

If you’re interested in learning more about Write the Docs or Writing Day, feel free to reach out to me (Rose Williams), Brian Olsen, or Manfred Moser on twitter or the Trino slack. You can also check out the Write the Docs slack community.

If you have an open source project that you’re interested in bringing to Writing Day, chat with me, Rose Williams, on twitter or on the Trino or Write the Doc slack communities.

Introducing new window features

2021-03-10T00:00:00+00:00

In Trino, we are thrilled to get feedback and feature requests from our fantastic community, and we’re tirelessly motivated to meet the expectations! The SQL specification is another source of inspiration. From time to time, we go through those encrypted scrolls to give you a new feature that you didn’t even know you needed!

Recently, there was a push in Trino to extend support for window functions. In this post, we explain the complexities of window function, and describe a couple of our recent additions. If “window” doesn’t sound familiar, read on. Already a window expert? Skip to what’s new.

A window is the structure you run your window function OVER. It has three components:

partitioning
ordering
frame

You use partitioning to break your input data into independent chunks. Ordering is to order rows within the partition. And frame is a kind of “sliding window”. For every processed row, the frame encloses a certain portion of the sorted partition. Your window function processes this portion and yields the result for the row.

A “running average” is one simple example:

SELECT avg(totalprice) OVER (
    PARTITION BY custkey
    ORDER BY orderdate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM orders

For a particular customer identified by custkey, it sorts their orders by date and computes a sequence of average prices since the beginning up to each consecutive entry. The window frame for a row includes all rows from the start up to and including that row.

According to standard SQL, there are 3 ways to specify the frame. The first way is ROWS (like in the example). With ROWS, you can specify frame bounds by a physical offset from the current row. While ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means “between the beginning of the partition and the current row”, you can also specify precisely where the frame starts and ends, for example with: ROWS BETWEEN 10 PRECEDING AND 5 FOLLOWING.

RANGE is a more complicated way of defining frame on ordered data. It does not rely on physical offset (in rows), but on logical offset (in value). That is, the frame includes rows where the value is within a certain range from the value in the current row.

Until recently, Trino only supported RANGE in limited cases. You could use RANGE UNBOUNDED PRECEDING, CURRENT ROW and UNBOUNDED FOLLOWING:

UNBOUNDED PRECEDING includes all rows since the partition start,
UNBOUNDED FOLLOWING includes all rows until the partition end,
CURRENT ROW is trickier. It includes all rows where values of the sort key are the same as in the current row. We call them a peer group.

It’s time to introduce the first new feature:

Full support for frame type RANGE

Since version 346, it is possible to specify RANGE with an offset value. The frame includes all rows whose value is within this range from the current row.

Let’s modify our example:

SELECT avg(totalprice) OVER (
    PARTITION BY custkey
    ORDER BY orderdate
    RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders

Now, for every row, we get the average price from the preceding month. Note that the offset interval '1' month applies to orderdate, which is the sorting column.

Of course, we don’t have to order by date. The sorting column can be of any numeric or date/time type, and the offset must be compatible. Also, the offset doesn’t have to be a literal. It can come in another column of a table or, generally, it can be any expression, as long as the type matches.

A frame of type RANGE does not quite fit in the abstraction of a “sliding window”. Frames can be bigger or smaller depending not only on the offset values but also on the actual input data. A long series of similar entries can produce a huge frame, while a gap in input values can result in an empty frame.

For illustration, imagine a group of students, and the results of some test they took. Our table has two columns: student_id and result, which is the number of points. For each student, let’s find how many students did better by 1 to 2 points:

WITH students_results(student_id, result) AS (VALUES
    ('student_1', 17),
    ('student_2', 16),
    ('student_3', 18),
    ('student_4', 18),
    ('student_5', 10),
    ('student_6', 20),
    ('student_7', 16))
SELECT
    student_id,
    result,
    count(*) OVER (
        ORDER BY result
        RANGE BETWEEN 1 FOLLOWING AND 2 FOLLOWING) AS close_better_scores_count
FROM students_results;

 student_id | result | close_better_scores_count
------------+--------+---------------------------
 student_5  |     10 |                         0
 student_7  |     16 |                         3
 student_2  |     16 |                         3
 student_1  |     17 |                         2
 student_3  |     18 |                         1
 student_4  |     18 |                         1
 student_6  |     20 |                         0
(7 rows)

Note that the frame does not contain the current row. For a particular student, it only includes students with better results, and not themselves. For the unfortunate student_5, there are no students with similar test results. The frame is also empty for the lucky student_6 who scored the most points.

Besides ROWS and RANGE, there is another way to specify the frame on ordered data. And yes, Trino supports this mechanism! Let me introduce the second of our recent additions:

Support for frame type GROUPS

This feature, added in version 346, allows you to include or exclude the whole peer groups of rows in ordered data.

For illustration, let’s consider again the students_results table. For each student, let’s find the gap between their result and the result of a student (or students) who did slightly better.

WITH students_results(student_id, result) AS (VALUES
    ('student_1', 17),
    ('student_2', 16),
    ('student_3', 18),
    ('student_4', 18),
    ('student_5', 10),
    ('student_6', 20),
    ('student_7', 16))
SELECT
    student_id,
    result,
    max(result) OVER (
        ORDER BY result
        GROUPS BETWEEN CURRENT ROW AND 1 FOLLOWING) - result AS gap_till_better_score
FROM students_results;

 student_id | result | gap_till_better_score
------------+--------+-----------------------
 student_5  |     10 |                     6
 student_7  |     16 |                     1
 student_2  |     16 |                     1
 student_1  |     17 |                     1
 student_3  |     18 |                     2
 student_4  |     18 |                     2
 student_6  |     20 |                     0
(7 rows)

The window function for each student returns the closest better result. The frame of type GROUPS used here, includes all entries equal to the current entry in terms of points (that is the student’s peer group), and the next group.

In frames of type GROUPS, like in other frame types, the offset doesn’t have to be constant. It can be any expression, as long as its type is exact numeric with scale 0. Simply put, we can skip any integer number of groups.

Under the covers

How do we deal with finding the frame bounds effectively? With ROWS it’s easy. We only need to skip a determined number of rows forward or backwards.

With RANGE, we need to examine the actual values to see if they fall within the given range. Our approach is optimized for the case where the offset values are constant for all rows. Our solution involves caching frame bounds computed for the preceding row, and using them as the starting point to find frame bounds for the current row. Ideally, we never have to move the frame bounds back as we process subsequent rows. In such a case, the amortized cost of frame bound calculations per row is constant.

Our strategy for determining frame bounds for GROUPS is similar. We cache the frame bounds computed for the preceding row and use them as the starting point for the current row. If the frame offset is constant, frame bounds slide from one peer group to another every time the processed row leaves one peer group and enters the next one.

Support for WINDOW clause

As all the preceding examples show, a window function is a big chunk of syntax. What if we wanted to use several window functions over the same window? Say, we need an average price and a total price from the preceding month. And the top price. Does it have to look like the below?

SELECT
    avg(totalprice) OVER (
        PARTITION BY custkey 
        ORDER BY orderdate
        RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW),
    sum(totalprice) OVER (
        PARTITION BY custkey 
        ORDER BY orderdate
        RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW),
    max(totalprice) OVER (
        PARTITION BY custkey 
        ORDER BY orderdate
        RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders

Well, no more. Starting with Trino 352, you can predefine a window specification, and then use it or redefine it wherever you need. This is thanks to the third of our new additions: support for WINDOW clause.

Technically speaking, the WINDOW clause is part of the FROM clause:

SELECT …
    FROM …
        WHERE …
        GROUP BY …
        HAVING …
        WINDOW …
ORDER BY …
OFFSET …
LIMIT / FETCH …

In the WINDOW clause, you can define any number of named windows. Then you can simply refer to them by their names in the SELECT list or an ORDER BY clause.

Let’s check how the WINDOW clause helps with our example query:

SELECT 
	avg(totalprice) OVER w,
	sum(totalprice) OVER w,
	max(totalprice) OVER w
FROM orders
WINDOW w AS (
    PARTITION BY custkey
    ORDER BY orderdate
    RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)

To be even more concise, the WINDOW clause allows you to define more specialized windows from existing window definitions:

WINDOW 
	w1 AS (PARTITION BY custkey),
	w2 AS (w1 ORDER BY orderdate),
	w3 AS (w2 RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)

Alternatively you can define the window only partially and then complete it where it’s used:

SELECT 
	avg(totalprice) OVER (w ROWS BETWEEN 10 PRECEDING AND CURRENT ROW) AS recent_average,
	sum(totalprice) OVER (w ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING) AS next_buys,
FROM orders
    WINDOW w AS (PARTITION BY custkey ORDER BY orderdate)

There are some ANSI rules, though, you need to follow when redefining windows:

PARTITION BY is only allowed in the base definition,
ORDER BY can only be specified once in the named windows reference chain,
frame can only be specified in the final definition.

In case you wonder, there’s no need to worry if some predefined windows are eventually unused. Unused windows do not affect the efficiency of your query execution. Partitioning, sorting and frame bound computations are costly operations. That’s why we made sure that unused window parts do not appear in the query plan.

There’s one last detail about the WINDOW clause that needs clarification. The columns referenced in the WINDOW clause are columns of the input table. In the following example, country_code is clearly a column of the table countries:

... FROM countries WINDOW w AS (ORDER BY country_code)

Obvious enough. Why am I telling this?

Window functions can be used in two different clauses of a query, SELECT and ORDER BY. With the ORDER BY clause, there is a rule that column references used there refer to the output table rather than the input table. Consider this query:

WITH countries(country_code) AS (VALUES 'pol', 'CAN', 'USA')
SELECT upper(country_code) AS country_code
    FROM countries
    WINDOW w AS (ORDER BY country_code)
ORDER BY row_number() OVER w

Window w is used in the ORDER BY clause. So, does the window’s ordering use the original country_code column from the input table, or does it “see” the uppercased country_code from the output table?

The SQL spec is clear about it: a column reference in the named window always refers to the original column, no matter where you use this window. In the example, the result is ordered according to the original values: lowercase pol after uppercase USA:

As expected:

 country_code
--------------
 CAN
 USA
 POL
(3 rows)

And here the story ends. Thanks for your attention! I hope you enjoy Trino’s new superpowers. In case of questions or issues — you know where to find us. More goodies are on the way, so stay tuned! How about regex matching on tables?

Trino in 2020 - An amazing year in review

2021-01-08T00:00:00+00:00

Wow! If you would have to sum up what happened in the last year in this great community, wow would be it. It is truly awe-inspiring to be part of this incredible journey of Trino. Oh yeah, on that note. Our community and project chose the new name Trino, to be able to continue to innovate and develop freely as a community of peers. Presto® and Presto® SQL are a thing of the past.

Now that is out of the way, let’s dive right in and see what all our community members across the globe have created with us!

2019 was a big year for us, but check out how 2020 eclipsed even that!

By the numbers

Even the size and growth of our community on Slack is impressive:

Started in January 2020 with ~1600 members and 280 weekly active
Over 3200 members by December 2020
560 members active weekly

The innovation and change of the source code on GitHub is a result of the hard work of the community:

Over 4000 commits merged
More than 2800 pull requests received
23 releases, nearly every two weeks basically!

As you can see, much of the excitement around the name change has quickly increased the number of stars we have on GitHub. While some of this certainly stems from an initial buzz around a shiny new name, we also believe that this name change has brought clarity to the community. Trino is an improved version, supported by the founders and creators of Presto®, along with the major contributors.

And if you have not done so already, make sure to star the repository and join us on slack.

Features and code

While everything mentioned is already exciting, the true work is visible in the new features and improvements in Trino. It is a long list, but read on. You won’t want to miss anything.

Improvements to ANSI SQL support

A core feature of Trino is the ability to use the same standard SQL for any connected data source. These improvements empower all users.

Variable-precision temporal types, with precision down to picoseconds (10⁻¹²s). This a very important feature for any time critical systems such as financial transactions processing
Correct, and now SQL specification compliant timestamp semantics, making migration of SQL statements from other compliant systems such as many RDBMSs easier
Implicit coercions for INSERT clause
Support for RANGE and GROUPS-based window frames
More support for various shapes of correlated subqueries
Support for INTERSECT ALL and EXCEPT ALL
Parameter support in LIMIT, FETCH FIRST, and OFFSET clause
Experimental support for recursive queries
Enforcement of NOT NULL constraints when inserting data
Quantified comparisons (e.g., > ALL (...)) in aggregation queries

Other query improvements

A number of other features were added to make querying your data sources with Trino even more powerful:

T-digest data type and functions for approximate quantile computations
Support for setting and reading column comments
Numerous new functions including concat_ws(), regexp_count(), regexp_position(), contains_sequence(), murmur3(), from_unixtime_nanos(), from_iso8601_timestamp_nanos(), human_readable_seconds(), bitwise operations, luhn_check(), approx_most_frequent(), translate(), starts_with()

Performance

Trino is already ludicrously fast. But then again, even faster is better, so we worked on that:

Improved pushdown of complex operations into connectors, including aggregation pushdown and TopN pushdown.
Dynamic filtering and partition pruning, which can improve performance of highly selective joins manyfold.
Cost-based decisions for queries containing IN <subquery> in WHERE clause.
Information_schema performance improvements, which benefit third-party BI tools that need to inspect table metadata, for example DBeaver, Datagrip, Power BI, Tableau, Looker, and others.
Faster queries on nested data in Parquet and ORC.
Faster and more accurate approx_percentile, based on t-digest data structure.
Support of Bloom filters in ORC.
Experimental, optimized Parquet writer.

Security

The more data you access with Trino, the more it becomes critical to secure it. With that in mind we added a lot of improvements:

The Web UI now requires authentication. Various actions such as viewing query details, killing queries, etc., are protected with authorization checks based on the identity of the user. Additionally, the UI now supports OAuth2 for user identification.
External and internal APIs are now properly secured with authentication and authorization checks. Importantly, this fixes a CVE reported vulnerability that affects all older versions of Presto®.
A new mechanism to externalize secrets in configuration files that makes it easier to integrate with third-party secret managers and deployment tools.
Support for JSON Web Key (JWK) authentication and pluggable certificate authenticators.
Add new Salesforce authenticator.
The query engine and access control SPIs now support injecting row filters and column masks.
New syntax for managing permissions (GRANT/REVOKE on schema, ALTER TABLE/SCHEMA/VIEW ... SET AUTHORIZATION).

Data sources

Trino empowers you to use one platform to access all data sources. Connectors enable this and we added numerous new connectors:

All other connectors received a large host of improvements. Let’s just look at two popular connectors:

Hive connector for HDFS, S3, Azure and cloud object storage systems

Complex Hive views, allows integration with Hive or simplifying migration from Hive
ACID transactional tables with INSERT and DELETE support
Built-in storage caching and support for external caching with Alluxio
New procedures: system.drop_stats(), register_partition(), unregister_partition()
Support for Azure object storage
Support for S3 encrypted files, flexible S3 security mappings and Intelligent-Tiering S3 storage

Elasticsearch connector

The Elasticsearch connector received numerous powerful improvements:

Password authentication
Support for index aliases
Support for array types, Nested, and IP type
Support for Elasticsearch 7.x

Runtime improvements

Operating and maintaining a Trino cluster takes a significant amount of resources. So any work to improve the runtime needs have a significant positive impact:

Requirement to use Java 11, with better GC performance, overall performance, and improved container support
Support for ARM64-based processors to run Trino
Support for minimum number of workers before query starts, useful for implementing autoscaling
Data integrity checks for network transfers to prevent data corruption during processing

Everything else

There is so much more to capture, and you really would have to read all the release notes in detail to know it all. To safe you from that, here are a few more noteworthy changes:

Experimental support for materialized views in Iceberg connector
JDBC driver backward compatibility tests
Support for multiple event listeners
Added Python client support for exec with parameters
New look and navigation for the documentation, and lots of new content

Community resources and events

Beyond the raw code and helping each other, the community collaborated on other helpful resources like books and in-depth video tutorials.

Matt, Manfred, and Martin published the book Trino: The Definitive Guide with O’Reilly. Over 5000 readers took advantage of the free digital copy.

Brian and Manfred launched the live streaming event Trino Community Broadcast, and grew their audience and back catalog to include some very useful material. If you have not seen it yet, go and watch some old episodes and join us in the next ones.

We also had a number of other online events and presentations, with direct participation of our community members:

A dedicated conference event for the community in Japan was very successful.
The Argentina Big Data Meetup had a large audience from the community in South America

A series of virtual events around the project started with a roadmap and overview meeting and included a number real world use case examples at scale:

Another series of training classes with the project founders was hugely successful. It includes very valuable content for any Trino user, from beginners to experts, that you should not miss:

Conclusion

2020 was a wild ride for us all. Trino and the Trino community definitely emerged as a winner, and we are looking forward to a very bright future with you all.

A couple of ongoing work is already underway and very promising:

Optimized Parquet reader, on par with ORC reader support
Support for SQL UPDATE and MERGE statements
Oauth2 support for JDBC
Support for SQL WINDOW clause and MATCH_RECOGNIZE usage

We’re starting the new year with a shiny new name, a cute little bunny, and a very vibrant community. The future is looking great for Trino!

Don’t hesitate and miss out on all the benefits of Trino. Join us on Slack to get started!

Migrating from PrestoSQL to Trino

2021-01-04T00:00:00+00:00

As we previously announced, we’re rebranding Presto SQL as Trino. Now comes the hard part: migrating to the new version of the software. We just released the first version, Trino 351, which uses the name Trino everywhere, both internally and externally. Unfortunately, there are some unavoidable compatibility aspects that administrators of Trino need to know about. We hope this post makes the transition as smooth as possible.

Things that haven’t changed

Let’s start with the good news. For end users running queries against Trino, everything should be the same. There are no changes to the SQL language, SQL functions, session properties, etc.

Users now see Trino in error messages, a different logo in the web UI, and error stack traces have a different package name, but otherwise they won’t know that anything has changed. All of their views, reports, or other stored queries will work as before.

Similarly for administrators, except for a few things noted in the Trino 351 release notes, all the configuration properties are the same.

Client protocol compatiblity

The client protocol is how clients, such as the CLI or JDBC driver, talk to Trino. It uses standard HTTP as the underlying communications protocol, with some custom HTTP headers to communicate values to and from Trino. Unfortunately, those header names started with X-Presto- and thus had to be changed to X-Trino-.

The Trino CLI and JDBC driver send the new headers, so they are only compatible with Trino versions 351 and newer. Users should wait to upgrade the CLI or JDBC driver until the Trino servers they talk to have been upgraded.

Out of the box, the Trino server does not work with older clients. However, in order to support a graceful transition, you can allow the server to support older clients by adding a configuration property:

protocol.v1.alternate-header-name=Presto

We recommend using version 350 of CLI and JDBC driver as the transition version. It has all the newest features such as variable precision timestamps, has been tested with a range of older server versions, and is the last version to support older servers.

JDBC driver

The URL prefix for the JDBC driver now starts with jdbc:trino: instead of jdbc:presto:. This means that any client applications using the JDBC driver need to update their connection configuration. The old prefix is still supported, but will be removed in a future release.

The class name of the driver is now io.trino.jdbc.TrinoDriver. This is of no concern to most users, as the driver is normally accessed via the standard JDBC auto-discovery mechanism based on the URL. As with the URL prefix, the old name is still supported, but will be removed in a future release.

Server RPM

The name of the RPM has changed, so it is treated as a different RPM, and thus you cannot simply upgrade from the old version to the new version. All of the directories for the RPM that contained the name presto now use trino instead. You likely want to uninstall the old RPM, rename the config and log directories, then install the new RPM.

Docker image

The Trino Docker image is now published as trinodb/trino. The supported configuration directory is now /etc/trino. The CLI is now named trino instead of presto.

JMX MBean naming

Trino runs on the JVM, which has the JMX framework as a standard way to expose system and application metrics. Trino exposes a huge number of JMX metrics for administrators to monitor their clusters. You might be using these metrics via your monitoring system, or perhaps you are accessing them in SQL via the Trino JMX connector.

The metrics for Trino server now start with trino instead of presto. You might need to update this name in your monitoring system, or you can revert to the old name:

jmx.base-name=presto

Similarly, the metrics for the Elasticsearch, Hive, Iceberg, Raptor, and Thrift connectors now start with trino.plugin instead of presto.plugin. Again, you might need to update these names in your monitoring system, or you can revert to the old name. For example, for the Hive connector:

jmx.base-name=presto.plugin.hive

Thrift connector

The Thrift connector had many backwards incompatible changes to both the Thrift service interface and the configuration properties. You need update all of your implementations of the Thrift service used by the connector.

SPI

If you have any custom plugins for Trino, such as connectors or functions, these need to be updated. The package name is now io.trino.spi, and a few classes were renamed:

PrestoException to TrinoException
PrestoPrincipal to TrinoPrincipal
PrestoWarning to TrinoWarning

There are no functional changes, so all you should need to do is update your imports and rename the references to the above class names.

Migration guide

Now that you understand what is different and what you need to change, you can start thinking about the list of steps needed to perform the migration. The following is a rough plan for upgrading your environment.

Step 1: Prepare to deploy the new version

Let users know the name is changing, so they are not surprised by the logo changes in the UI.
Make sure that users are using recent client versions. Ideally, upgrade them all to version 350, as mentioned above. You can check the HTTP request logs for the coordinator to see what client versions are in use.
Update your server configuration with protocol.v1.alternate-header-name=Presto to allow supporting all of your existing Presto clients.
If you are using the RPM, have a plan to deal with the new RPM name and the trino directory names.
If you are using Docker, use the new image name, make sure your configuration will be mounted using the trino path name, and remember that the CLI is now named trino.
Update any custom plugins to use the new SPI.
Check if you have anything using JMX to monitor your clusters, and decide if you will update them to the new names or set a Trino config to revert to the old names.

Step 2: Upgrade your servers to Trino 351+

Upgrade development and staging servers.
Upgrade production servers. If you have multiple clusters, you can do them one at a time, and verify everything is working before moving on to the next one.

Step 3: Upgrade clients

Upgrade all clients including the CLI, JDBC driver, Python, etc., to the Trino versions.
Update any applications using JDBC to use the new jdbc:trino: connection URL prefix.

Step 4: Cleanup

Remove the protocol.v1.alternate-header-name configuration property.
If you configured Trino to use the old JMX names, convert your monitoring system to use the new JMX names and remove the fallback configs.

Getting help

We’re here to help! If you run into any issues while upgrading, or having any questions or concerns, ask on Slack.

We’re rebranding PrestoSQL as Trino

2020-12-27T00:00:00+00:00

We’re rebranding PrestoSQL as Trino. The software and the community you have come to love and depend on aren’t going anywhere, we are simply renaming. Trino is the new name for PrestoSQL, the project supported by the founders and creators of Presto® along with the major contributors – just under a shiny new name. And now you can find us here:

GitHub: https://github.com/trinodb/trino. Please give it a star!
Twitter: @trinodb
Slack: https://trino.io/slack.html

If you want to learn why we’re doing this, read on…

In 2012, Dain, David and Martin joined the Facebook data infrastructure team. Together with Eric Hwang, we created Presto® to address the problems of low latency interactive analytics over Facebook’s massive Hadoop data warehouse. One of our non-negotiable conditions was for Presto® to be an open source project. Open source is in our DNA - we had all used and participated in open source projects to various degrees in the past, and we recognized the power of open communities and developers coming together to build successful software that can stand the test of time.

Over the next six years, we worked hard to build a healthy open source community and ecosystem around the project. We worked with developers and users all over the world and welcomed them into the Presto® community. Presto® was on a path of increasing growth and success, in large part because of the contributions from developers across many fields and all over the world.

Unfortunately in 2018, it became clear that Facebook management wanted to have tighter control over the project and its future. This culminated with their decision to grant Facebook developers commit rights on the project without any prior experience in Presto®. We strongly believe that this kind of decision is not compatible with having a healthy, open community. Moreover, they made this decision by fiat without engaging the Presto® community. As a matter of principle, we had no choice but to leave Facebook in order to focus on making sure Presto® continued to be a successful project with an open, collaborative and independent community. In reality, the choice was easy.

We started the Presto Software Foundation in January 2019 as an independent entity to oversee the development of the software and community, continuing the meritocratic system that had been in place over the previous 6 years. The community quickly consolidated under this new home. We intentionally stayed unemployed over the next 10 months to focus on expanding and strengthening the community by working directly with major users and contributors, as well as reaching out to a wider group of users and developers across the globe. This resulted in new use cases and an injection of energy, making the project more vibrant than ever before as even more new users and developers became engaged. But, don’t take our word for it, let the data speak for itself:

Months after this consolidation, Facebook decided to create a competing community using The Linux Foundation®. As a first action, Facebook applied for a trademark on Presto®. This was a surprising, norm-breaking move because up until that point, the Presto® name had been used without constraints by commercial and non-commercial products for over 6 years. In September of 2019, Facebook established the Presto Foundation at The Linux Foundation®, and immediately began working to enforce this new trademark. We spent the better part of the last year trying to agree to terms with Facebook and The Linux Foundation that would not negatively impact the community, but unfortunately we were unable to do so. The end result is that we must now change the name in a short period of time, with little ability to minimize user disruption.

On a personal note, and as the founders who named the project Presto® in the first place, this is an incredibly sad and disappointing turn of events. And while we will always have fondness for the name Presto®, we have come to accept that a name is just a name. To be frank, we’re tired of this endless distraction, and we intend to focus on what matters most and what we are best at doing – building high quality software everyone can rely on and fostering a healthy community of users and developers that build it and support it. We’re not going anywhere – we’re the same people, the same amazing software, under a new name: Trino.

If you love this project, you already love Trino. ❤️

Facebook is a registered trademark of Facebook Inc. The Linux Foundation and Presto are trademarks of The Linux Foundation.

A Report about Presto Conference Tokyo 2020 Online

2020-11-21T00:00:00+00:00

On Nov 11th, 2020, Japan Presto Community held the 2nd Presto Conference welcoming Martin Traverso and Brian Olsen. The conference was hosted at Youtube Live. This article is the summary of the conference aiming to share their great talks.

Presto Community Updates

First of all, Martin introduced recent Presto updates in these days. It covers recent changes and enhancements achieved by the community activities. Attendees also learned several new functions that will be available soon.

Update / Merge (https://github.com/prestosql/presto/issues/3325)
Materialized Views (https://github.com/prestosql/presto/pull/3283)
Dynamically resolved functions
Optimized Parquet reader

In addition, at Q&A, he suggests new developers who want to contribute to PrestoSQL to check “good first issue” tag on Github. The tag is a good first step for a new joiner to contribute. Ref. link

Presto Community - How to get involved

To make attendees get used to Presto Community, Martin provided a guide for walking around Presto community. He gives us their team’s principles about the Presto community, and talk about their education strategy for new Presto users. I would like to quote the pricinpals here.

We are passionate about open source
We help others be succesfful with what we create
We create robust long-lasting software
We are egalitarian (nobody is more important than the other)

Support Presto as a feature of SaaS

Then, Satoru Kamikaseda, Technical Support Engineer at Treasure Data, provides an overview of how Treasure Data supports Presto in their service. Presto is heavily used to support many enterprise use cases as a customer data platoform, and it is becoming the hub component processing high throughput workload from many kinds of clients such as Spark, ODBC and JDBC.

He described statistics about Presto queries on their platform, and how to support each cases. In the stats, 1/3 is any investigation of job failure and query result, 1/3 is a request to help their client’s SQL, and others are a sort of notifications to their clients and performance investigation. His talk must be useful for any SaaS companies that provides a query engine to their clients to learn how difficult it is to support a distibuted query engine.

Support Presto as a feature of SaaS from SatoruKamikaseda

How to use Presto with AWS efficiently

We could learn how to use Presto with AWS including Presto on EMR, Presto on EC2, Presto by Athena and AWS Glue. Noritaka Sekiyama, Sr. Big Data Architect at Amazon Web Service, Japan, also shares a comparison of Presto on AWS (EC2, EMR, Athena). If you are a new to Presto, his talk gives you an insight to choose your first Presto environement.

AWS で Presto を徹底的に使いこなすワザ from Noritaka Sekiyama

Presto @ LINE 2020

LINE is the biggest company providing the mobile communication tool in Japan (say WhatsApp in Japan). HYuya Ebihara, one of Presto maintainers, gives us how they improve Presto at their platform since they presented in the previous conference. Their Presto usage significantly increases from 2019. Num of Presto workers from 100 to 300 and Num of daily queries reaches to 50,000 queries from 20,000 queries. We could learn how to upgrade Presto from 314 to 339 and how they resolved issues through Presto upgrade.

Dive into Amazon Athena - Serverless Presto, 2020

Makoto Kawamura, Solution Architect at Amazon Web Service Japan, introduces the latest features of AWS Athena and performance tuning tips. It must be helpful for developers who tied to AWS to explore Amazon Athena.

Presto Cassandra Connector Hack at Repro

Repro provides Customer Engagement Platform that enables companies to personalize their communication strategies with the right message at the right time to drive better retention and lifetime value. They use Presto for a segmentation backend system in their service to make a list of audiences with a certain condition.

Takeshi Arabiki gives us an in-depth presentation on the modification of Presto Casandra to stabilize and improve the performance of Presto, in addition to the use of Presto in Repro. His talk covers a wide range of topics from investigation of the bottleneck to its resolution.

Testing Distributed Query Engine as a Service

At the end, Naoki Takezoe from Treasure Data, talks their challenges towards Presto upgrade and how hard to migrate variety of workload with performance stability. In actual production-scale enviroment that are running multiple client, testing is one of big challenges. He shows how they simulate their client workload with theier developed query simulator to cover various corner cases and to verify data correctness.

Testing Distributed Query Engine as a Service from takezoe

Wrap Up

This conference was the first online Presto conference in Tokyo. Unfortunately, We couldn’t have a chance to discuss with the community developers and creators in face-to-face. We hope we’ll get such a great opportunity in the near future. However, that was a great time to have many presentations with the community members to learn a lot of new things from their wornderful experience. During the conference, the average number of Youtube Live viewers are over 100 people, and the total of attendees are around 180 people. In the previous conference, there were 89 attendees. I think the number of Presto developers/users in Japan has been increasing gradually. We really appreciate developers in the community and creators. Thank you so much for coming to the conference and see you next time!

Youtube Live link

The event is mainly talked in Japanese.

Presto Conference Tokyo 2020 Online

Announcing Presto Conference Tokyo 2020

2020-10-21T00:00:00+00:00

Last year, Presto Conference Tokyo 2019 was held in Japan with Martin Traverso, Dain Sundstrom and David Phillips, the founders of the Presto Software Foundation.

This year, the event changes to be an online only event. Presto Conference Tokyo 2020 is happening on the 20th of November. You can find out details and register right now!

The event includes six sessions from Treasure Data, Amazon Web Services Japan, Repro and LINE, as well as open sessions with Martin and Brian Olsen, a Developer Advocate at Starburst Data. This is a valuable opportunity to hear from engineers who are actually using Presto. It has something for those who are using Presto for data engineering and those who don’t use Presto yet but are interested in it.

A gentle introduction to the Hive connector

2020-10-20T00:00:00+00:00

TL;DR: The Hive connector is what you use in Trino for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.

One of the most confusing aspects when starting Trino is the Hive connector. Typically, you seek out the use of Trino when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of Trino, formerly known as Presto, came about due to these slow Hive query conditions at Facebook back in 2012.

So when you learn that Trino has a Hive connector, it can be rather confusing since you moved to Trino to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.

Hive architecture

To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high level components of the Hive architecture.

You can simplify the Hive architecture to four components:

The runtime contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.

The storage component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed Filesystem (HDFS). As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.

In order for Hive to process these files, it must have a mapping from SQL tables in the runtime to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore to manage the metadata about the files such as table columns, file locations, file formats, etc…

The last component not included in the image is Hive’s data organization specification. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.

Trino reuses all of these components except for the runtime. This is the same approach most compute engines take when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.

Trino runtime replaces Hive runtime

In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.

Trino Architecture

The Hive connector nomenclature

Notice, that the only change in the Trino architecture is the runtime. The HMS still exists along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of the storage component used the runtime makes use of the HMS and that is the reason this connector is the Hive connector.

Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware the metastore is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.

The Hive Metastore Service

The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.

Getting started with the Hive Connector on Trino

To drive this point home, I created a tutorial that showcases using Trino and looking at the metadata it produces. In the following scenario, the docker environment contains four docker containers:

trino - the runtime in this scenario that replaces Hive.
minio - the storage is an open-source cloud object storage.
hive-metastore - the metastore service instance.
mariadb - the database that the metastore uses to store the metadata.

You can play around with the system and optionally view the configurations. The scenario asks you to run a query to populate data in MinIO and then see the resulting metadata populated in MariaDB by the HMS. The next step asks you to run queries over the mariadb database which holds the generated metadata from the metastore.

If you have any questions or run into any issues with the example, you can find us on slack on the #dev or #general channels.

Have fun!

Launching Presto First Steps training

2020-10-07T00:00:00+00:00

Writing the book Trino: The Definitive Guide with Matt and Martin earlier this year, and then publishing it with O’Reilly was a great experience and has been a great success. Lots of readers took advantage of getting a free digital copy of the book from Starburst.

Now it is time to follow up with a training class. I am pleased to let you know that you can join me for three hours of Presto First Steps in November.

The new course is aimed at beginners with Presto, who want to accelerate their initial understanding and adoption. You ramp up quickly to install and configure Presto, use the CLI, and learn how to query connected data sources with SQL. The class is completely interactive, and I look forward to many of you joining me and bring lots of great questions to ask.

The class includes three interactive training exercises on Katacoda. They allow you to get hands on experience with Presto immediately. Lots of useful tips and tricks are covered in my material, and of course I plan to run a bunch of additional demos. You can find more details about the content of the class in the registration page.

Don’t miss out and make sure you reserve your ticket now!

Hello I'm Brian, Presto Developer Advocate

2020-10-01T00:00:00+00:00

Hello, Presto nation!

My name is Brian, and I’m a new developer advocate working at Starburst. Let me give you a little background on how I got here, and cover how my role can help the Presto community.

My career in computation and databases started in the military. As luck would have it, I worked on a big data team as my first job out of college! I was in a Hive shop that dealt with the typical outdated runtime and slow query turnaround. Eventually, our architect introduced us to Presto as an alternative. I worked with him to start testing and moving our existing use cases built on Hive to use Presto. We also used Elasticsearch and had a few cases that needed to perform joins and unions over the datasets in both Elasticsearch and Hive. There were a few use cases that were not going to immediately be transferable without some modification to the Presto Elasticsearch connector.

Joining the Presto community

The first modification was adding support for Elasticsearch array types, and the second was, support for nested types. My first interaction with the Presto community was incredible! As a serial open-source attempter, I always wanted to get invested in an open-source project. I had started pull requests in various projects. Sometimes I ran into unpleasant maintainers, in other cases the rules were daunting or too confusing to start. I created a pull request only to have it sit there with no communication as to why it wasn’t accepted or even looked at. However, when I first joined Slack, I searched to see if there was already a discussion about array types in the history. I ran into a discussion between Dain and Martin about this issue. I conversed with Martin, who was incredibly polite and willing to take time to discuss how this should be implemented.

Contributing

When I actually pulled the code, I saw how well written and maintained it was compared to many open-source projects I had seen in the past. I made a few changes, wrote a test around my use case, and signed a CLA agreement. After a couple of weeks, my pull request was merged and I had finally contributed to an open-source project. After that interaction, and seeing the code, I wanted to do more. I really saw something special with this community.

While many Presto contributors are doing amazing work contributing code, I noticed there were some holes in other areas of the community that needed to be filled. I started answering questions on Slack, LinkedIn, and Twitter and I planned out a Udemy course for Presto. The initial video I piloted is about tuning the memory configuration of Presto.

Becoming a developer advocate

Around this time I got into contact with some folks at Starburst about joining them to work with the community and Presto full-time! As I joined, we hadn’t figured out what my exact role was at Starburst. Eventually, we decided I would best serve as a developer advocate. What I’ve come to find is this role is aiming to do exactly what I set out to do before I joined. As a developer advocate, I serve the community and act as a liaison between Starburst and the Presto community. Up until this time, that responsibility has been unofficially shared by many of the maintainers of Presto. I am here to simply take some of that responsibility from them and focus all of my efforts on community growth and health.

The health of a community is difficult to define and is generally subject to various signals that we can observe. These signals include an increase in helpful interactions within the community, new members joining the community, members who are actively engaging in the community, diversity of the community, and more. If we start by focusing on making the community successful, the success of the project will follow. Keeping the goal in mind that co-creator David Phillips mentions:

This is the type of project that we look at Postgres as the inspiration. Postgres started in the eighties, it became a SQL system in the nineties, and it’s still in active use and active development today. We say we want Presto to have the same kind of history. - David Phillips

Next Steps

My first goal is to create a larger set of free learning materials, that expand upon my initial goals when planning for my Udemy course. I recently started a show with Manfred Moser called the Presto Community Broadcast. The show landing page is here and contains all the information about the show schedule and where to find new and old episodes. This helps as we can use any relevant material we create on this show for future teaching or blogs. We want these live sessions to be interactive, and look forward to your feedback to understand if our efforts are actually helping, or if you have ideas to improve the show. This show, along with blogs, documentation, and interactive tutorials are how I initially intend to fill some common questions that are received through our Slack and Stack Overflow channels. Another goal of adding these materials is to attract new members to the community. Not all the material may be super relevant to the existing members of the community, but this makes the community much more viable for newer members.

Outside of providing new learning materials, your feedback helps us to understand common problems and allows us to fix them. This feedback will aid us in focusing on issues commonly voiced within the community but somehow get lost in translation. This could be improving the Presto code itself, or it could be making the documentation better, or to address common confusion, even if the confusion comes from a force outside of the Presto community.

For example, I recently wrote a blog about some shady benchmarketing practices that were painting Presto in a bad light. The goal here was to make fun of the wildly bogus claims brought against Presto and the community. What better way to do that than to write a nerdy Justin Bieber parody?

While I have hopefully convinced you all of my mission here. I can’t accomplish any of this in a vacuum. The whole point of my work starts and ends with all of you. I look forward to speaking with and one day post COVID-19, meeting you all at meetups and conferences. For now virtual meetups and the Presto Community Broadcast are a great start. If you have ideas or want to reach out to introduce yourself, you can find me on Slack or Twitter.

Thanks for reading this and being a part of this community. One last thing to tell you about myself, I’m a sucker for cheesy sign-offs so…

For fast data at resto, Presto is the besto!

Presto at Argentina Big Data Meetup 2020-09-23

2020-09-28T00:00:00+00:00

Martin made a guest appearance at the Argentina Big Data Meetup (online) where in the first hour Martin talks about Presto’s past, present, and future. This includes the history from Facebook to Starburst, some context to some early architectural decisions, as well as, why Presto was open-sourced. Finally, Martin covers recent changes along with some upcoming changes on the roadmap.

Slides

The next hour is an interesting talk given by Federico Palladoro covering his company, Jampp’s, migration strategy from EMR Presto to Docker using Nomad vs Kubernetes.

Slides

These presentations are in Spanish.

Read support for original files of Hive transactional tables in Presto

2020-09-23T00:00:00+00:00

In Presto 331, read support for Hive transactional tables was introduced. It works well, if a user creates a new Hive transactional table and reads it from Presto. However, if an existing table is converted to a Hive transactional table, Presto would fail to read data from such a table because read support for original files was missing. Original files are those files in a Hive transactional table that existed before the table was converted into a Hive transactional table. Until version 340, Presto expected all files in a Hive transactional table to be in Hive ACID format. Users would have to perform a major compaction to convert original files into ACID files (i.e. base files) in such tables. This is not always possible as the original flat table (table in non-ACID format) could be huge and converting all the existing data into ACID format can be very expensive.

This blog is an extension of the blog Hive ACID and transactional tables’ support in Presto. It first describes original files and then goes into details of read support for such files that was added in Presto 340.

What are the original files?

Files present in non-transactional ORC tables have the standard ORC schema. When a flat table is converted into a transactional table, existing files are not converted into Hive ACID format. Such files, in a transactional table, that are not in Hive ACID format, are called original files. These files are named as 000000_X, 000000_X_copy_Y. These files don’t have ACID columns and have differences in the schema as follows:

Table Schema

n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string

Original File Schema

struct {
    n_nationkey : int,
    n_name : string,
    n_regionkey : int,
    n_comment : string
}

Delta File Schema

struct {
    operation : int,
    originalTransaction : bigint,
    bucket : int,
    rowId : bigint,
    currentTransaction : bigint,
    row : struct {
        n_nationkey : int,
        n_name : string,
        n_regionkey : int,
        n_comment : string
    }
}

Before Presto 340, Presto used to fail the query if it reads from a Hive transactional table having original files.

Update and delete support on original files

Hive achieves updates/deletes on a row in original files by synthetically generating ACID columns for original files. Presto follows the same mechanism of generating ACID columns synthetically as discussed later.

ACID column generation on original files

Files in Hive ACID format have 5 ACID columns, but we need only 3 columns i.e. originalTransactionId, bucketId and rowId to uniquely identify a row. In this section, we will see how these 3 columns are synthetically generated for original files.

Original transaction ID

An original transaction ID is the write ID when a record is first created. For original files, the original transaction ID is always 0.

Bucket ID

Bucket ID is retrieved from the original file name. For the original file 0000ABC_DEF or 0000ABC_DEF_copy_G, the bucket ID will be ABC.

Row ID

To calculate the row ID, the total row count of all the original files, which come before the current one in lexicographical order, is calculated. Then, the row ID is equal to the sum of the value calculated and local row ID in the current original file.

Here is an example to calculate the global Row ID of the 3rd row of an original File 000000_0_copy_2.

000000_0            -> 	X1 Rows (returned by ORC footer field numberOfRows)

000000_0_copy_1     -> 	X2 Rows (returned by ORC footer field numberOfRows)

000000_0_copy_2     ->	[ Row 0 ]
                        [ Row 1 ]
                        [ Row 2 ]   <- Local Row ID (returned by filePosition in OrcRecordReader) = 2
                                       Global Row ID = (X1+X2+2)
                        [ Row 3 ]

000000_0_copy_3     ->  X4 Rows

Note: As we see, additional computations are required to generate row IDs while reading original files, therefore, read is slower than ACID format files in the transactional table.

Once Presto has the 3 ACID columns for a row, it can check for update/delete on it. Delete deltas, written by Hive for Original files, have row IDs generated by following the same strategy as discussed above, hence, the same logic of filtering out deleted rows as discussed in Hive ACID and transactional tables’ support in Presto works with the original files too.

Changes in Presto to support reading original files

Presto split generation logic and ORC reader is modified to add read support for original files. Following are the changes done at coordinator and worker level:

Split generation

We use a new class named AcidInfo to store OriginalFiles, DeleteDeltaFiles for HiveSplit. BackgroundSplitLoader.loadPartitions is called in an executor to create splits for each partition. In addition to the steps mentioned in blog Hive ACID and transactional tables’ support in Presto, Presto does the following:

Original files, ACID subdirectories (base, delta, delete_delta) are figured out by listing the partition location by Hive AcidUtils Helper class.
Registry for delete deltas DeleteDeltaInfo is created which has minimal information through which delete_delta path can be constructed by the workers.
Registry for original files OriginalFileInfo is created which has information like file name, size and bucket ID.
AcidInfo.Builder keeps a map AcidInfo.Builder.bucketIdToOriginalFileInfoMap of bucket ID to the list of original files belonging to the same bucket.
Hive splits are created for each original file, base and delta directories. Each hive split has a construct AcidInfo.
For an original file split, AcidInfo has:
1. Bucket ID: Bucket ID of the original file.
2. OriginalFilesList: List of all the original files belong to the same bucket calculated from AcidInfo.Builder.bucketIdToOriginalFileInfoMap.
3. DeleteDeltaFilesList: List of delete deltas.
For an base/delta file split, AcidInfo has:
1. DeleteDeltaFilesList: List of delete deltas.

Reading Hive original files data in workers

Hive splits generated during the split generation phase make their way to worker nodes where OrcPageSourceFactory is used to create PageSource for TableScan operator. In addition to the steps mentioned in blog Hive ACID and transactional tables’ support in Presto , Presto does the following:

OrcDeletedRows is created for delete_delta locations, if any.
For original file split, OrcPageSourceFactory fetches originalFilesList from AcidInfo and calculates originalFileRowId by calling OriginalFilesUtils.getPrecedingRowCount and sends this information to OrcPageSource.
OrcPageSouce returns rows from OrcRecordReader which are not present in OrcDeletedRows.

Follow up

For an original file split, the current implementation may take quadratic time in the worst case to calculate global row ID by reading row IDs from the original files’ footer. It may be optimized by keeping a query level cache in worker nodes or by precomputing global row IDs in coordinator during split computation.

Acknowledgements

I would also like to express my gratitude to everyone who helped me throughout developing the feature. Thank you Shubham Tagra for brainstorming sessions and providing continuous guidance on Presto Hive ACID. Thank you Piotr Findeisen for helping me further refine the code with insightful code reviews.

Configuring and Tuning Presto Performance with Dain

2020-08-27T00:00:00+00:00

With the help of David’s training about advanced SQL, you composed a number of useful queries. You gained valuable insights from the resulting data. However these complex queries take time to run. If only you could make them run faster. I think we have just what you need:

Join us for a free webinar Understanding and Tuning Presto Query Processing with Dain Sundstrom.

Update:

We did it again! Joined by over 120 eager students we discussed all sorts of aspects of sizing and tuning your Presto cluster. Yet again we received so many questions that we went over our planned time budget. The material covered is crucial to run a Presto deployment successfully in production, so make sure you check out the recording and the slide deck:

Download the slides

In our new Presto Training Series we give Presto users an opportunity to learn advanced skills from the co-creators of Presto – David Phillips, Martin Traverso and Dain Sundstrom. Beyond the basics, each of the four training sessions covers critical topics for scaling Presto to more users and use cases.

This training session is geared towards helping users tune and size their Presto deployment for optimal performance. Delivered by Dain Sundstrom, this session covers the following topics:

Cluster configuration and node sizing
Memory configuration and management
Improving task concurrency and worker scheduling
Tuning your JVM configuration
Investigating queries for join order and other criteria
Tuning the cost-based optimizer

Date: Wednesday, 9 September 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Duration: 2h

Register now!

We look forward to many Presto users joining us.

Faster Queries on Nested Data

2020-08-14T00:00:00+00:00

Presto 334 adds significant performance improvements for queries accessing nested fields inside struct columns. They have been optimized through the pushdown of dereference expressions. With this feature, the query execution prunes structural data eagerly, extracting the necessary fields.

Motivation

RowType is a built-in data type of Presto, storing the in-memory representation of commonly used nested data types of the connectors, eg. STRUCT type in Hive. Datasets often contain wide and deeply nested structural columns, i.e. a struct column having hundreds of fields, with the fields being nested themselves.

Although such RowType columns can contain plenty of data, most of the analytical queries access just a few fields out of it. Without dereference pushdown, Presto scans the whole column, and shuffles all that data around before projecting the necessary fields. This suboptimal execution causes higher CPU usage, higher memory usage and higher query latencies, than required. The unnecessary operations get even more expensive with wider/deeper structs and more complex query plans.

LinkedIn’s data ecosystem makes heavy usage of nested columns. It is common to have 2-3 levels of nesting, and up to 50 fields in most of our tracking tables. Because of the query execution inefficiency for nested fields, ETL pipelines were set up at LinkedIn to copy the nested columns as a set of top-level columns corresponding to subfields. This step added overhead in our ingestion process and delayed data availability for analytics. It also caused ORC schemas to be inconsistent with the rest of the infrastructure, making it harder to migrate from existing flows on row-oriented formats.

Similarly, Lyft’s schemas make heavy use of nested data to decompose a ride into its routes, riders, segments, modes, and geo-coordinates. Prior to the performance improvements, analytical queries would either need to be run on clusters with very long timeouts, or the data would have to be flattened before being analyzed, adding an extra ETL step. Not only would this be costly, it would also cause the original schema to diverge in our data warehouse making it more difficult for data scientists to understand.

The dereference pushdown optimization in Presto is having a massive impact on the ingestion story at both LinkedIn and Lyft. Nested data is now being made available faster for consumption with a consistency of structure across all stores, while maintaining performance parity for analytical queries.

Example

Say we have a Hive table jobs, with a struct-typed column job_info in the schema. The column job_info is wide and deeply nested, i.e. ROW(company varchar, requirements ROW(skills array(...), education ROW(...), salary ...) , ...). Most queries would access a small percentage of data from this struct using the dereference projection (the . operation). Consider such a query Q below.

SELECT A.appid id, J.job_info.company c
FROM applications A JOIN jobs J
ON A.jobid = J.jobid
LIMIT 100

It should suffice to scan only one field company from J.job_info for executing this query. But, without dereference pushdown, Presto scans and shuffles everything from job_info, only to project a single field at the end.

Solution: Pushdown of Dereference Expressions

With dereference pushdown, Presto optimizes queries by extracting the sufficient fields from a ROW as early as possible. This is enforced by modifying the query plan through a set of optimizers, and can be broadly divided into two parts.

First, dereference projections are extracted in the query plan and pushed as close to the table scan as possible. This happens independent of what the connector is. Secondly, there is a further improvement for Hive tables. The Hive Connector and ORC/Parquet readers have been optimized to scan only the sufficient subfield columns.

Pushdown of predicates on the subfields is also a crucial optimization. For example, if a query has filters on subfields (i.e. a.b > 5), they should be utilized by ORC/Parquet readers while scanning files. The pushdown helps with the pruning of files, stripes and row-groups based on column-level statistics. This optimization is achieved as a byproduct of the above two optimizations.

With the dereference pushdown, queries observe significant performance gains in terms of CPU/memory usage and query runtime, roughly proportional to the relative size of nested columns compared to the accessed fields.

Pushdown in Query Plan

The goal here is to execute dereference projections as early as possible. This usually means performing them right after the table scans.

A projection operation that performs dereferencing on input symbols (i.e. job_info.company) reduces the amount of data going up the plan tree. Pushing dereference projections down means that we are pruning data early. It reduces the amount of data being processed and shuffled in query execution. For the example query Q, the query plan looks like the following when dereference pushdown is enabled.

The projection job_info.company now directly follows the scan of jobs table, avoiding the propagation the job_info through Limit and Join nodes. Note that all of job_info is still being scanned, and pruning it in the reader requires connector-dependent optimizations.

Pushdown in the Hive Connector

In columnar formats like ORC and Parquet, the data is laid out in a columnar fashion even for subfields. If we have a column STRUCT(f1, f2, f3), the subfields f1, f2 and f3 are stored as independent columns. An optimized query engine should only scan the required fields through its ORC reader, skipping the rest. This optimization has been added for Hive connector.

Dereference projections above a TableScanNode are pushed down in the Hive connector as “virtual” (or “projected”) columns. The query plan is modified to refer to these new columns. For the query Q, jobs table would be scanned differently with this optimization, as shown below. The projection is now embedded in the Hive connector. Here, job_info#company can be thought of as a virtual column representing the subfield job_info.company.

The Hive connector handles the projections before returning columns to Presto’s engine. It provides the required virtual columns to format-specific readers. ORC and Parquet readers optimize their scans based on subfields required, increasing their read throughput. Subfield pruning is not possible for row-oriented format readers (e.g. AVRO). For them, Hive connector performs adaptation to project the required fields.

Pushdown of Predicates on Subfields

Columnar formats store per-column statistics in the data files, which can be used by the readers for filtering. eg. if a query contains filter y = 5 for a top-level column y, Presto’s ORC reader can skip ORC stripes and files by looking at the upper and lower bounds for y in the statistics.

The same concept of predicate-based pruning can work for filters involving subfields, since the statistics are also stored for subfield columns. i.e. Presto’s ORC/Parquet reader should be able to filter based on a constraint like x.f1 = 5 for more optimal scans. Good news! In the final optimized plan, predicates on a subfield are pushed down to the hive connector as a constraint on the corresponding virtual column, and later used for optimizing the scan. The complete logic is a bit complicated to explain here, but can be illustrated through the following example.

Given an initial plan with a predicate on a dereferenced field (x.f1 = 5), a chain of optimizers transform it to a more optimal plan with reader-level predicates. In the future, the same optimization will be added to the Parquet reader.

In the final plan, Hive connector knows to scan the column y and the subfield x.f1. It also takes advantage of the “virtual” column constraint x#f1 = 5 for reader-level pruning.

Performance Improvement

Dereference pushdown improves performance for queries accessing nested fields in multiple ways. First, it increases the read throughput for table scans, reducing the CPU time. The pruning of fields during the scan also means lesser data to process for all downstream operators and tasks. So the early projections result in more optimal execution for any operations that involve shuffle or copy of data. Moreover, for ORC/Parquet, the read performance improves in the case of selective filters on subfields.

Below are some experimental results on a production dataset at LinkedIn which contains 3 STRUCT columns, having ~20-30 small subfields in each. The example queries used in the analysis access only a few subfields. The queries have been listed as their approximate query shape for the sake of brevity. The plots compare CPU usage, peak memory usage and averaged query wall time.

CPU usage and peak memory usage show orders-of-magnitude improvement in presence of dereference pushdown. Query wall times also reduce considerably, and this improvement is more drastic for the relatively complex JOIN query, as expected.

Please note that these are not benchmarks! The performance improvement you’ll see will vary depending on how many columns are contained in your nested data versus how many you’ve referenced. At Lyft we saw improvements of 50x for some queries!

Future Work

The pushdown of dereference expressions can be extended to arrays. i.e. dereference operations applied after unnesting an array should also get pushed down to the readers. For example, using our jobs table from before, our jobs.job_info structure may contain a repeating structure such as required_skills. With the following query, the entire required_skills structure would be read even though only a small part of it is being referenced.

SELECT S.description
FROM jobs J
CROSS JOIN UNNEST (job_info.required_skills) S
WHERE S.years_of_experience >= 2

The work for this improvement is being tracked in this issue.

Similar to Hive Connector, connector-level dereference pushdown can be extended to other connectors supporting nested types.

Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader.

Conclusion

Pushing down dereference operations in the query provides massive performance gains, especially while operating on large structs. At LinkedIn and Lyft, this feature has shown great impact for analytical queries on nested datasets.

We’re excited for the Presto community to try it out. Feel free to dig into this github issue for technical details. Please reach out to us on Slack for further disucssions or reporting issues.

Securing Presto with Dain

2020-08-13T00:00:00+00:00

All the useful and fast running queries your created with the knowledge from David’s training about advanced SQL and Martin’s training about query tuning created a problem. You now have lots of users on your Presto cluster that want to access all sorts of different data source, have different privileges and corporate security asked about your plans. How about you tap into some help from Dain:

Join us for a free webinar Securing Presto with Dain Sundstrom.

Update:

What a great training session! Dain captured the audience and lots of questions were covered beyond all the great material from the slides. Everything is now available for your convenience:

Download the slides

In this training session Dain teaches you how to securely deploy Presto at scale. We cover how to secure Presto itself, access to Presto, and access to your underlying data. This session covers the following topics:

Presto authentication, including password & LDAP Authentication
Authorization to access your data sources
Encryption including Presto client-to-coordinator communication
Secure communication in the cluster
Support for Kerberos
Secrets usage for configuration files including catalogs

Date: Wednesday, 26 August 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Duration: 2h

Register now!

We look forward to many Presto users joining us.

Happy Eighth Birthday Presto!

2020-08-08T00:00:00+00:00

Today, Presto turned eight years old! As Presto co-creator Dain Sundstrom points out, there’s a reason why the eighth birthday is a little special:

Even though Presto is a relatively young project, countless consumers, developers, and business personnel have felt its impact. It’s pretty clear that there’s a lot going on with this project since its inception eight years ago. Recently, the Presto project hit a stunning twenty thousand commits:

It makes you ponder how Presto became so successful in such a short amount of time. Should the credit be given to the four founders who brought Presto to life? Perhaps the supporting companies that provided the conditions that called for such innovation? Or was it the community built around Presto since its inception that has enabled this radical success?

In my mind, it’s a combination of these conditions but with a special emphasis on the latter. Without the founders’ dedication to designing Presto for speed and extensibility and putting emphasis on a welcoming and inclusive open-source community we wouldn’t have seen Presto outside the walls of Facebook. Without companies like Facebook, Teradata, Netflix, and Treasure Data that acted as a catalyst to this change, we wouldn’t have the initial use cases that tested Presto’s scalable design and shined a light on Presto to bring the awareness to the masses. Finally, without the passionate community of developers who took an interest in giving back their time and efforts, Presto wouldn’t be anywhere near as robust or flexible as it is today. Now Presto has reached an unprecedented level of maturity and helped many developers, scientists, and analysts find the answers they were looking for. It speaks volumes about just how special the project really is.

This community of developers is really special in that the level of expectations for developers new to OSS (open source software) is really a low bar. Speaking from personal experience as a serial OSS attempter, when I joined I noticed everyone treating each other with respect, a willingness to teach, and a deliberate openness to new ideas. I interfaced with engineers working at Starburst, the founders of Presto, and many passionate developers like myself who also knew a thing or two about the project that were so helpful to me. This was unlike other experiences I had in the past that made joining an open source community an elite club that only existing members had access to. To me, this inclusiveness is why the presto community is thriving.

The Presto community is most vibrant in the slack channel. Here users and developers may ask questions such as installing and using presto, discussing bug fixes or design changes, or sometimes just sharing great experiences or news related to presto. This slack channel has recently grown to 2300 users with around 500 active users at any given time.

To celebrate Presto really means to celebrate this community, and while we can’t thank every individual who has contributed, we want to thank just a handful of you for your hard work. Thanks to these engineers for their contributions to the Presto project!

While linking you to a blog post may not be a satisfactory thank you, the gratitude is perhaps best stated on the presto-users google group by co-creator Martin Traverso:

When Dain, David, Eric and I started the project that many years ago, we had the goal to make it open source and build a community around it. What we never imagined was how far it would go, how widely it would be adopted across the entire world, and how many amazing people we would meet and get a chance to work with along the way.

Congratulations to everyone who played a part in that journey. It’s been a great ride so far. Here’s to another 8 years!”

Thanks to everyone who has contributed to Presto, congratulations to the founders for starting such an amazing project. Together let’s make Presto the most useful analytics tool yet!

Understanding and Tuning Presto Query Processing with Martin

2020-07-30T00:00:00+00:00

With the help of David’s training about advanced SQL you composed a number of useful queries. You gain valuable insights from the resulting data. However these complex queries take time to run. If only you could make them run faster. I think we have just what you need coming up.

Join us for a free webinar Understanding and Tuning Presto Query Processing with Martin Traverso.

Update:

We are delighted that such an advanced topic attracted close to 150 attendees. Everyone learned a lot and many additional questions came up during class and in the Q&A overtime. Take advantage of the slides and recording to recapture, or if you could not attend:

Download the slides

In this training session Martin helps to understand how Presto executes query. That knowledge can help you improve query performance. For example, the explain plan is a powerful tool, but reading the plans and make sense of them can be overwhelming. We explore how to create an explain plan for you query and how to read it. We look at the work the cost-based optimizer performs and how you can potentially help Presto run your queries even faster. This session covers to following topics:

Explain the EXPLAIN
Learn how queries are analyzed and executed
Understand what the optimizer does, including some of its limitations
Showcase the cost-based optimizer

Date: Wednesday, 12 August 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Duration: 2h

Register now!

We look forward to many Presto users joining us.

Presto for Analytics at Pinterest

2020-07-22T00:00:00+00:00

After State of Presto and the two real world examples from Zuora and Arm Treasure Data, I hope you are ready to hear from a well known brand using Presto in their analytics ecosystem – Pinterest:

Presto: A key component for analytics at Pinterest

Update:

Our webinar was well received and caused a whole bunch of questions. Check out the slides and video recording:

Download the slides

Join us to learn how Pinterest uses Presto to meet the company’s rapidly increasing analytics need, while keeping the cost low.

Presto plays an important role in Pinterest’s analytics ecosystem. Find out how runs Presto at the company, how Pinterest leverages warning systems to guide users to write better queries, and how Pinterest scales up their clusters to meet with their rapid growing and complex workloads.

The following topics are discussed:

Presto integrated with Pinterest infrastructure
Setup of a warning systems to guide users write better queries
Management of complex workloads

Speakers:

Pucheng Yang is a software engineer at Pinterest working on the Presto, SparkSQL and Hive query engines. He joined the company two years ago as a new grad.
Yi He is a software engineer at Pinterest. Prior to Pinterest, he worked at Facebook on Presto OLAP and query federation.

Date: Wednesday, 19 August 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Register now!

We look forward to many Presto users joining us and participating in the webinar with their questions.

Advanced SQL in Presto with David

2020-07-15T00:00:00+00:00

You have read our book Trino: The Definitive Guide, practiced with various SQL examples, and consulted our Presto documentation. Great steps to become a Presto and SQL expert. However, learning efficient and advanced SQL can take years of experience. Luckily we have some help from an expert coming your way.

Join us for a free webinar Advanced SQL in Presto with David Phillips.

Update:

With nearly 200 live attendees and a two hour session we ended with lots of questions from the engaged audience. After 20 minutes overtime we wrapped up the successful event. Check out the presentation slides and the recording:

Download the slides

Our first session with David is geared towards helping users understand how to run more complex and comprehensive SQL queries with Presto. Delivered by David Phillips, this session covers to following topics:

Using JSON and other complex data types
Advanced aggregation techniques
Window functions
Array and map functions
Lambda expressions
Many other SQL functions and features

Date: Wednesday, 29 July 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Duration: 2h

Register now!

We look forward to many Presto users joining us.

Presto Migration at Arm Treasure Data

2020-07-06T00:00:00+00:00

Both events of our virtual Presto Summit tour event, State of Presto and the Zuora presentation were well received and recordings are available for you to watch. Your next chance to learn more about Presto in the real world comes from Arm Treasure Data and is presented by Taro L. Saito:

Presto at Arm Treasure Data: A Journey of Migrating 1 Million Presto Queries

Update:

We had a great event with some in-depth, detailed questions from the audience. Check out the recording to learn more:

Join us to discover how as part of their customer data platform, Arm Treasure Data utilizes Presto as the query engine processing over 1 million queries per day. This system supports the data business of over 500 companies in three regions - US, EU, and Asia.

Arm Treasure Data has been using Presto 0.205 and in 2019 started a big migration project to Presto 317. Although they performed extensive query simulations to check any incompatibilities, the team faced many unexpected challenges. In this session you learn more about their migration of the production system:

Technical details on many challenges
Key lessons learned
Latest updates on AWS Graviton2, the next generation of 64-bit Arm instance types that can be used for running Presto

Our speaker, Taro L. Saito, is a principal software engineer at Arm Treasure Data and Ph.D. of computer science at the University of Tokyo. He has built a cloud database service at Arm Treasure Data, which is processing over millions of queries every day. Previously, he worked as an assistant professor at the University of Tokyo, studying distributed database systems and their applications to genome sciences. He has created several open-source projects, including Airframe, MessagePack, and various sbt plugins (sbt-sonatype, sbt-pack) for Scala that help to publish thousands of OSS projects.

Date: Thursday, 16 July 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Register now!

We look forward to many Presto users joining us.

Data Integrity Protection in Presto

2020-06-25T00:00:00+00:00

It all started on an Thursday afternoon in March, when Karol Sobczak was grilling Presto with heavy rounds of benchmarks, as we were ramping up to Starburst Enterprise Presto 332-e release. Karol discovered what seemed to be a serious regression, and turned out to be even more serious Cloud environment issue.

Presto Benchmarks

At the Presto project, we take serious care of stability and efficiency, so releases undergo rigorous performance benchmarks. The intention is to safe guard against any performance regressions or stability problems. Usually, the performance improvements are benchmarked separately when they are being added to the codebase. At Starburst, those benchmarks are even more important, especially for the Starburst Enterprise Presto LTS releases.

On a side note, we use Benchto for organizing Presto benchmark suites, executing them and collecting the results. We use managed Kubernetes in a public cloud for provisioning Presto clusters, along with Starburst Enterprise Presto Kubernetes. We use Jupyter for producing result reports in HTML and PDF formats.

Alleged Regression

It all started in March, when Karol Sobczak was grilling Presto with heavy rounds of benchmarks for the Starburst Enterprise Presto 332-e release. On one Thursday afternoon he reported stability problems, with few benchmark runs failing with exceptions similar to:

Query failed (#20200326_150852_00338_dj225): Unknown block encoding:
LONG_ARRAY� � �� � @@@���� �@  @ � �@@@ @@� @�@D�� @@��@ `� @@� @#�@ � 0�
... (9550 more bytes)

In Presto, a block encoding is a way of encoding a particular Block type (here, a LongArrayBlock). They are used when exchanging blocks of data between Presto nodes, or in spill to disk. Blocks form a polymorphic class hierarchy, so every time a block is encoded, we need to also store the encoding identifier. The encoding identifier (here, the LONG_ARRAY string) is written as <string length> (4-byte, signed integer in little-endian) followed by <string bytes> containing the UTF-8 representation of the encoding id. Clearly, in the case above, the receiver read the <encoding id length> as 9623 instead of 10! How could that be ever possible?

Presto 332 brought a lot of good changes and upgrade to Java 11 was one of them. Therefore, Starburst Enterprise Presto 332-e was the first Starburst release using Java 11 by default. For earlier releases, we ran benchmarks using AWS EC2 machines orchestrated with Starburst’s Presto CloudFormation Template (CFT). This was also the first time we did Presto release benchmarks running on Kubernetes clusters, with AWS EKS. We could suspect many different factors as being the cause. We started to sift through the code, search team’s “collective brain” and the Internet for any ideas. One of the important sources was Vijay Pandurangan’s writeup on data corruption bug discovered by Twitter in 2015. Of course, we also repeated benchmark runs. Seeing is believing.

Production issues

On the next day, a customer reported similar problems with their Presto cluster. Of course, they were not running a yet-to-be-released version that we were still benchmarking. They run into what seemed to be a very serious regression in a Starburst Enterprise Presto 323-e release line. The customer was also using the AWS cloud, but not the Kubernetes deployment. They were using CFT-based deployment – the same stack we were using for all our release benchmarks so far – and we had never run into issues like this before. As the customer was using a fresh-off-press latest minor release, we decided (in spirit of global health care trend) to “quarantine” that release and roll back the customer installation to the previous version.

However, the fact that a small bug fix release triggered data problems was unnerving. The fact that we did not discover any of these problems before, was even more unnerving.

More testing – the data corruption

As we were running more and more, and even more test runs, we discovered new failure modes. For example:

Query failed (#20200327_001931_00020_8di4r): Cannot cast DECIMAL(7, 2) '18734974449861284.67' to DECIMAL(12, 2)

Well, this message is not wrong. It’s not possible to cast 18734974449861284.67 to DECIMAL(12, 2). Except that it is also not possible to have a DECIMAL(7, 2) with such value. Something wrong happened to the data. At that moment, we realized the problem was very serious, because data could become corrupted. This corrupted data could lead to a failure (like above), but it could also lead to incorrect query results, or incorrect data being persisted (in case of INSERT or CREATE TABLE AS queries). We created a virtual War Room (that is, a Slack channel), got together all Presto experts and our experienced field team to discuss potential causes, further diagnostics and mitigation strategies.

Since the problem was affecting data exchanges between Presto nodes, we listed the following strategies to try to dissect the problem:

determining which query (queries) is (are) causing failures,
running with HTTP/2,
reverting to running on Java 8,
enabling exchange compression (as decompression is very sensitive to data corruption),
trying to upgrade Jetty,
determining whether failures correlate with JVM GC activity,
inspecting the source code.

Different configuration

We were able to quickly prototype and verify some of the ideas. Switching to HTTP/2 or upgrading Jetty to the latest version did not help. Nor did downgrading to Jetty version that had been using for a long time. We also verified that problem was reproducible with Java 8, so we concluded Java 11 was not the cause of it.

Checksums

We identified the problem occurs somewhere within exchanges, between one Presto worker node serializing a Page object (basic unit of data processing in Presto) and another node deserializing it.

While decimal cast failure didn’t directly point at the data corruption problem (there could be many other reasons for it), there was no other explanation for the Unknown block encoding exceptions. The serialization is done in PagesSerde.serialize (used by TaskOutputOperator, the data sender) and deserialization is done in PagesSerde.deserialize (used by ExchangeOperator, the receiver of the data). As the logic is nicely encapsulated in PagesSerde class, we added checksums to the serialized data: <checksum> <serialized page>. This felt like a smart move – except that it gave us nothing more than a confirmation that there is a problem (“checksum failure”). This we already knew.

We considered adding logging to capture data going out from one node and going in on another node, but that would be huge amount of logs. One run of benchmarks transfers hundreds of terabytes of data between the nodes.

We went ahead and created a Presto build that added data redundancy to be able to reconstruct the data on the receiving side. There are many well-known error-correction codes (e.g. Reed–Solomon error correction available in Hadoop 3). In our case, speed of implementation (a.k.a. simplicity) was a deciding factor, so we added data mirroring: <checksum> <serialized page> <serialized page>. In order to avoid logging of all the data exchanges, we added the deserialized pages (both copies) to the exceptions being raised.

java.sql.SQLException: Query failed (#20200401_113622_00676_p7qp7): Hash mismatch, read: 1251072184702746109, calculated: 7591448164918409110
    Suppressed: java.lang.RuntimeException: Slice, first half: 040000000A0000004C4F4E475F415252.... (945 kilobytes)
    Suppressed: java.lang.RuntimeException: Slice, secnd half: 040000000A0000004C4F4E475F415252.... (945 kilobytes)

The exception told us the first part was changed, since read checksum did not match the calculated checksum (it was calculated based on the first copy of the data and was different than the checksum calculated on the sending side). Having the encoded data in the exception like that, it was easy to extract the actual data and compare, so now we could see how the data was changed.

cat failure.txt | grep 'Slice, first half' | cut -d: -f4- | sed 's/^ *//' | xxd -r -p > changed
cat failure.txt | grep 'Slice, secnd half' | cut -d: -f4- | sed 's/^ *//' | xxd -r -p > original

Comparing binary files is fun, but in practice it can be more convenient to compare hexdump output. The output below was created with vimdiff <(hexdump -Cv original) <(hexdump -Cv changed).

++--6064 lines: 00000000  04 00 00 00 0a 00 00 00  4c 4f 4...|+ +--6064 lines: 00000000  04 00 00 00 0a 00 00...
 00017b00  00 cb 6a 25 00 00 00 00  00 cb 6a 25 00 00 00 00  |  00 cb 6a 25 00 00 00 00  00 cb 6a 25 00 00 00 00
 00017b10  00 cb 6a 25 00 00 00 00  00 cb 6a 25 00 00 00 00  |  00 cb 6a 25 00 00 00 00  00 cb 6a 25 00 00 00 00
 00017b20  00 cb 6a 25 00 00 00 00  00 e1 67 25 00 00 00 00  |  00 cb 6a 25 00 00 00 00  00 e1 67 25 00 00 00 00
 00017b30  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00  |  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00
 00017b40  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00  |  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00
 00017b50  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00  |  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00
 00017b60  00 e1 67 25 00 00 00 00  00 e1 67 25 00 00 00 00  |  00 e1 67 25 00 00 00 00  e1 67 25 00 00 00 00 00
 00017b70  00 e1 67 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  e1 67 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017b80  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017b90  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017ba0  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017bb0  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017bc0  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017bd0  00 fb 69 25 00 00 00 00  00 fb 69 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  fb 69 25 00 00 00 00 00
 00017be0  00 fb 69 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  fb 69 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017bf0  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c00  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c10  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c20  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c30  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c40  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c50  00 5e 6a 25 00 00 00 00  00 5e 6a 25 00 00 00 00  |  5e 6a 25 00 00 00 00 00  5e 6a 25 00 00 00 00 00
 00017c60  00 34 68 25 00 00 00 00  00 34 68 25 00 00 00 00  |  34 68 25 00 00 00 00 00  34 68 25 00 00 00 00 00
 00017c70  00 34 68 25 00 00 00 00  00 34 68 25 00 00 00 00  |  34 68 25 00 00 00 00 00  34 68 25 00 00 00 00 00
 00017c80  00 34 68 25 00 00 00 00  00 34 68 25 00 00 00 00  |  34 68 25 00 00 00 00 00  34 68 25 00 00 00 00 00
 00017c90  00 34 68 25 00 00 00 00  00 34 68 25 00 00 00 00  |  34 68 25 00 00 00 00 00  34 68 25 00 00 00 00 00
 00017ca0  00 34 68 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  34 68 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017cb0  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017cc0  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017cd0  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017ce0  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017cf0  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017d00  00 2e 6b 25 00 00 00 00  00 2e 6b 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  2e 6b 25 00 00 00 00 00
 00017d10  00 2e 6b 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  2e 6b 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d20  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d30  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d40  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d50  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d60  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d70  00 cf 68 25 00 00 00 00  00 cf 68 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  cf 68 25 00 00 00 00 00
 00017d80  00 cf 68 25 00 00 00 00  00 6b 69 25 00 00 00 00  |  cf 68 25 00 00 00 00 00  6b 69 25 00 00 00 00 00
 00017d90  00 6b 69 25 00 00 00 00  00 6b 69 25 00 00 00 00  |  6b 69 25 00 00 00 00 00  6b 69 25 00 00 00 00 00
 00017da0  00 6b 69 25 00 00 00 00  00 6b 69 25 00 00 00 00  |  6b 69 25 00 00 00 00 00  6b 69 25 00 00 00 00 00
 00017db0  00 6b 69 25 00 00 00 00  00 6b 69 25 00 00 00 00  |  6b 69 25 00 00 00 00 00  6b 69 25 00 00 00 00 00
 00017dc0  00 6b 69 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  6b 69 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017dd0  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017de0  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017df0  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017e00  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017e10  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017e20  00 7e 66 25 00 00 00 00  00 7e 66 25 00 00 00 00  |  7e 66 25 00 00 00 00 00  7e 66 25 00 00 00 00 00
 00017e30  00 a9 66 25 00 00 00 00  00 a9 66 25 00 00 00 00  |  a9 66 25 00 00 00 00 00  a9 66 25 00 00 00 00 00
 00017e40  00 a9 66 25 00 00 00 00  00 a9 66 25 00 00 00 00  |  a9 66 25 00 00 00 00 00  a9 66 25 00 00 00 00 00
 00017e50  00 a9 66 25 00 00 00 00  00 a9 66 25 00 00 00 00  |  a9 66 25 00 00 00 00 00  a9 66 25 00 00 00 00 00
 00017e60  00 a9 66 25 00 00 00 00  00 a9 66 25 00 00 00 00  |  a9 66 25 00 00 00 00 00  a9 66 25 00 00 00 00 00
 00017e70  00 a9 66 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  a9 66 25 00 00 00 00 00  fb 67 25 00 00 00 00 00
 00017e80  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  fb 67 25 00 00 00 00 00  fb 67 25 00 00 00 00 00
 00017e90  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  fb 67 25 00 00 00 00 00  fb 67 25 00 00 00 00 00
 00017ea0  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00
 00017eb0  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00
 00017ec0  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00
 00017ed0  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00
 00017ee0  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 fb 67 25 00 00 00 00
 00017ef0  00 fb 67 25 00 00 00 00  00 5e 6b 25 00 00 00 00  |  00 fb 67 25 00 00 00 00  00 5e 6b 25 00 00 00 00
++--23429 lines: 00017f00  00 5e 6b 25 00 00 00 00  00 5e ...|+ +--23429 lines: 00017f00  00 5e 6b 25 00 00 0...

It is perhaps no surprise that 0 bytes occupied a lot of the data transfer. For performance reasons, Presto uses fixed-length representation for fixed-length data types, such as integers or decimals. Compressing data for the sake of network exchanges makes sense, if your network is saturated and CPU is not, and is off by default. If we replace 0 bytes with __, we see that the difference between original (left) and changed (right) is pretty interesting: it looks like one 0 byte was shifted from offset 0x00017b60+5 (approximately) to 00017e90+12 (approximately). This is very unusual data change. We got other failure samples showing similar data changes, with varying offset numbers.

++--6064 lines: 00000000  04 00 00 00 0a 00 00 00  4c 4f 4...|+ +--6064 lines: 00000000  04 00 00 00 0a 00 00...
 00017b00  __ cb 6a 25 __ __ __ __  __ cb 6a 25 __ __ __ __  |  __ cb 6a 25 __ __ __ __  __ cb 6a 25 __ __ __ __
 00017b10  __ cb 6a 25 __ __ __ __  __ cb 6a 25 __ __ __ __  |  __ cb 6a 25 __ __ __ __  __ cb 6a 25 __ __ __ __
 00017b20  __ cb 6a 25 __ __ __ __  __ e1 67 25 __ __ __ __  |  __ cb 6a 25 __ __ __ __  __ e1 67 25 __ __ __ __
 00017b30  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __  |  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __
 00017b40  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __  |  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __
 00017b50  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __  |  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __
 00017b60  __ e1 67 25 __ __ __ __  __ e1 67 25 __ __ __ __  |  __ e1 67 25 __ __ __ __  e1 67 25 __ __ __ __ __
 00017b70  __ e1 67 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  e1 67 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017b80  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017b90  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017ba0  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017bb0  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017bc0  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017bd0  __ fb 69 25 __ __ __ __  __ fb 69 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  fb 69 25 __ __ __ __ __
 00017be0  __ fb 69 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  fb 69 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017bf0  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c00  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c10  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c20  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c30  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c40  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c50  __ 5e 6a 25 __ __ __ __  __ 5e 6a 25 __ __ __ __  |  5e 6a 25 __ __ __ __ __  5e 6a 25 __ __ __ __ __
 00017c60  __ 34 68 25 __ __ __ __  __ 34 68 25 __ __ __ __  |  34 68 25 __ __ __ __ __  34 68 25 __ __ __ __ __
 00017c70  __ 34 68 25 __ __ __ __  __ 34 68 25 __ __ __ __  |  34 68 25 __ __ __ __ __  34 68 25 __ __ __ __ __
 00017c80  __ 34 68 25 __ __ __ __  __ 34 68 25 __ __ __ __  |  34 68 25 __ __ __ __ __  34 68 25 __ __ __ __ __
 00017c90  __ 34 68 25 __ __ __ __  __ 34 68 25 __ __ __ __  |  34 68 25 __ __ __ __ __  34 68 25 __ __ __ __ __
 00017ca0  __ 34 68 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  34 68 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017cb0  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017cc0  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017cd0  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017ce0  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017cf0  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017d00  __ 2e 6b 25 __ __ __ __  __ 2e 6b 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  2e 6b 25 __ __ __ __ __
 00017d10  __ 2e 6b 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  2e 6b 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d20  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d30  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d40  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d50  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d60  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d70  __ cf 68 25 __ __ __ __  __ cf 68 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  cf 68 25 __ __ __ __ __
 00017d80  __ cf 68 25 __ __ __ __  __ 6b 69 25 __ __ __ __  |  cf 68 25 __ __ __ __ __  6b 69 25 __ __ __ __ __
 00017d90  __ 6b 69 25 __ __ __ __  __ 6b 69 25 __ __ __ __  |  6b 69 25 __ __ __ __ __  6b 69 25 __ __ __ __ __
 00017da0  __ 6b 69 25 __ __ __ __  __ 6b 69 25 __ __ __ __  |  6b 69 25 __ __ __ __ __  6b 69 25 __ __ __ __ __
 00017db0  __ 6b 69 25 __ __ __ __  __ 6b 69 25 __ __ __ __  |  6b 69 25 __ __ __ __ __  6b 69 25 __ __ __ __ __
 00017dc0  __ 6b 69 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  6b 69 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017dd0  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017de0  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017df0  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017e00  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017e10  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017e20  __ 7e 66 25 __ __ __ __  __ 7e 66 25 __ __ __ __  |  7e 66 25 __ __ __ __ __  7e 66 25 __ __ __ __ __
 00017e30  __ a9 66 25 __ __ __ __  __ a9 66 25 __ __ __ __  |  a9 66 25 __ __ __ __ __  a9 66 25 __ __ __ __ __
 00017e40  __ a9 66 25 __ __ __ __  __ a9 66 25 __ __ __ __  |  a9 66 25 __ __ __ __ __  a9 66 25 __ __ __ __ __
 00017e50  __ a9 66 25 __ __ __ __  __ a9 66 25 __ __ __ __  |  a9 66 25 __ __ __ __ __  a9 66 25 __ __ __ __ __
 00017e60  __ a9 66 25 __ __ __ __  __ a9 66 25 __ __ __ __  |  a9 66 25 __ __ __ __ __  a9 66 25 __ __ __ __ __
 00017e70  __ a9 66 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  a9 66 25 __ __ __ __ __  fb 67 25 __ __ __ __ __
 00017e80  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  fb 67 25 __ __ __ __ __  fb 67 25 __ __ __ __ __
 00017e90  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  fb 67 25 __ __ __ __ __  fb 67 25 __ __ __ __ __
 00017ea0  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __
 00017eb0  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __
 00017ec0  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __
 00017ed0  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __
 00017ee0  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ fb 67 25 __ __ __ __
 00017ef0  __ fb 67 25 __ __ __ __  __ 5e 6b 25 __ __ __ __  |  __ fb 67 25 __ __ __ __  __ 5e 6b 25 __ __ __ __
++--23429 lines: 00017f00  00 5e 6b 25 00 00 00 00  00 5e ...|+ +--23429 lines: 00017f00  00 5e 6b 25 00 00 00...

Outside of Presto

We captured a cluster of 10 nodes manifesting the problem and hold on to it in further investigation. Our testing showed that TPC-DS query 72 is significantly more likely to fail than other queries. On the isolated cluster, a loop running TPC-DS query 72 would reproduce a failure within 2 hours. We added additional information in the exception reporting checksum failure, to identify on which node the failure happens and which node is the sender of the data. For all the failures on the isolated 10-node cluster, the failure would always happen with one worker node (10.83.28.124, the Receiver) reading data from certain other worker node (10.142.0.84, the Sender). We stopped all other workers and attempted to reproduce the problem outside of Presto.

One of the things we tried was checking the network reliability with netcat. On the Sender node, we ran the following:

dd if=/dev/urandom of=/tmp/small-data bs=$[1024*1024] count=1
ncat -l 20165 --keep-open --max-conns 100 --sh-exec "cat /tmp/small-data" -v

On the Receiver node we run the following in a loop:

ncat --recv-only 10.142.0.84 20165 > "/tmp/received"
sha1sum "/tmp/received"

Running this in a loop for just a few dozens of seconds resulted in /tmp/received different than /tmp/small-data. Sometimes the /tmp/received would be “just” a prefix of the original data and sometimes there would be data displacements within the /tmp/received file. We cross-checked these observations on a different pair of nodes and also on a different public cloud, using same netcat version. We observed the same behavior everywhere we checked it, with varying, but high error rate, over 1%. This high error rate was what led us to discard this evidence – there was either something wrong with the way we used netcat, we violated netcat’s assumptions or netcat was not the right tool for this task.

We searched for other tools that we could use. iperf is a well-known tool for stressing out the network. Sadly, iperf does not have an ability to verify exchanged data integrity yet. We deployed a home-made, Java-based tool instead. using this tool we were able to reproduce the data corruption problem between Sender and Receiver nodes. The error rate was very low. To reproduce the problem we had to saturate the network and use multiple concurrent TCP connections (which is very similar to how Presto uses the network). This validated our observations that the data corruption problem was happening outside of Presto. Interestingly, we were unable to reproduce the problem when stressing the network with a single TCP connection.

Mystery unsolved

Obviously, with such a strong evidence gathered so far, we opened a support ticket with AWS. The support team was great and did a lot of investigation on their own. Unfortunately, the problem went away before the support team was able to get to the bottom of it. It was April already. Perhaps, one day someone will find the smoking gun and write the rest of this story.

Conclusions

We implemented data integrity protection measure in Presto. We used Martin Traverso’s Java implementation of the XXHash64 algorithm. Thanks to its speed, we could enable it by default, with negligible impact on overall query performance. By default, data integrity violation results in query failure, but Presto can be configured to retry as well, by setting the exchange.data-integrity-verification configuration property.

This chapter of the Presto history should remain closed and we should be able to forget about all this. However, a couple days ago, a customer running Presto on Azure Kubernetes Service (AKS) reported an exception like the one below. On the next day, we bumped into this as well. We were doing CREATE TABLE AS SELECT to prepare a new benchmark dataset on Azure Storage.

Query failed (#20200622_124803_00000_abcde): Checksum verification failure on 10.12.3.47
    when reading from http://10.12.3.53:8080/v1/task/20200622_124803_00000_abcde.2.6/results/5/8:
    Data corruption, read checksum: 0xe17e6eaeb665dc6e, calculated checksum: 0xb3540697373195f1

It is no fun when a query fails like this. However – what a joy and pride that it did not silently return incorrect query results. Rest assured, Presto will not return incorrect results, wherever you run it.

Credits

Special thanks go to our customers, for your understanding and the trust you have in us. Without you, Starburst wouldn’t be as fun place as it is! Thanks to Łukasz Walkiewicz and Karol Sobczak for fantastic benchmark and experimentation automation and your help with running the experiments! Thanks to Will Morrison for finding the Sender and Receiver machines that reproduced the problem so nicely! Thanks to Martin Traverso, Dain Sundstrom and David Phillips for guidance, ideas, clever tips and code pointers! Thanks to Łukasz Osipiuk for running experiments, cross-checking the results and helping keep sanity. Shout out to the whole Starburst team – it was truly a team’s work!

□

Presto at Zuora

2020-06-16T00:00:00+00:00

The Presto Summit is morphing into a series of virtual events, and we already started with the State of Presto webinar recently. Next up is a talk about Presto with lots of practical insights at Zuora presented by Henning Schmiedehausen:

Using Presto as Query Layer in a Distributed Microservices Architecture

Update:

We had a great event with lots of questions from the audience, taking us beyond the planned time frame. Check out the recording to learn more:

Presto has found its place as a SQL-based query engine for big data in the new stack, but it does not have to be limited to big data and large scale analytics applications.

In this presentation, Henning highlights how Presto helped Zuora to transform its monolithic data architecture for an online transactional system into a loosely coupled, services-based architecture. In doing so it helped to solve the most pressing problem when splitting up data, providing direct to access production data across many services and enabling complex data queries across live data. Zuora Data Query was an instant success when it was launched.

In this webinar you discover:

The technical architecture that embedded Presto in the Zuora service stack
The pieces of Presto that could be used directly off the shelf
How we productized it into a system that now serves huge numbers of small queries against live data

Our speaker, Henning Schmiedehausen, Chief Architect at Zuora, is a thought leader in the open source Java community with more than 25 years of experience contributing to successful open source projects. At Zuora he serves as the chief architect and is responsible for the technical aspects of transforming the Zuora system to a new, scalable, and flexible Microservices Architecture. Prior to Zuora he worked at Facebook and Groupon as a principal engineer. Henning also served as a board member at the Apache Software Foundation

Date: Tuesday, 30 June 2020

Time: 10am PDT (San Francisco), 1pm EDT (New York), 6pm BST (London), 5pm UTC

Register now!

We look forward to many Presto users joining us.

Dynamic partition pruning

2020-06-14T00:00:00+00:00

Star-schema is one of the most widely used data mart patterns. The star schema consists of fact tables (usually partitioned) and dimension tables, which are used to filter rows from fact tables. Consider the following query which captures a common pattern of a fact table store_sales partitioned by the column ss_sold_date_sk joined with a filtered dimension table date_dim:

SELECT COUNT(*) FROM 
store_sales JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
WHERE d_following_holiday='Y' AND d_year = 2000;

Without dynamic filtering, Presto will push predicates for the dimension table to the table scan on date_dim but it will scan all the data in the fact table since there are no filters on store_sales in the query. The join operator will end up throwing away most of the probe-side rows as the join criteria is highly selective. The current implementation of dynamic filtering improves on this, however it is limited only to broadcast joins on tables stored in ORC or Parquet format. Additionally, it does not take advantage of the layout of partitioned Hive tables.

With dynamic partition pruning, which extends the current implementation of dynamic filtering, every worker node collects values eligible for the join from date_dim.d_date_sk column and passes it to the coordinator. Coordinator can then skip processing of the partitions of store_sales which don’t meet the join criteria. This greatly reduces the amount of data scanned from store_sales table by worker nodes. This optimization is applicable to any storage format and to both broadcast and partitioned join.

Design considerations

This optimization requires dynamic filters collected by worker nodes to be communicated to the coordinator over the network. We needed to ensure that this additional communication overhead does not overload the coordinator. This was achieved by packing dynamic filters into Presto’s existing framework for sending status updates from worker to coordinator.

DynamicFilterService was added on the coordinator node to perform dynamic filter collection asynchronously. Queries registered with this service can request dynamic filters while scheduling splits without blocking any operations. This service is also responsible for ensuring that all the build-side tasks of a join stage have completed execution before constructing dynamic filters to be used in the scheduling of probe-side table scans by the coordinator.

Implementation

For identifying opportunities for dynamic filtering in the logical plan, we rely on the implementation added in #91. Dynamic filters are modeled as FunctionCall expressions which evaluate to a boolean value. They are created in the PredicatePushDown optimizer rule from the equi-join clauses of inner join nodes and pushed down in the plan along with other predicates. Dynamic filters are added to the plan after the cost-based optimization rules. This ensures that dynamic filters do not interfere with cost estimation and join reordering. The PredicatePushDown rule can end up pushing dynamic filters to unsupported places in the plan via inferencing. This was solved by adding the RemoveUnsupportedDynamicFilters optimizer rule which is responsible for ensuring that:

Dynamic filters are present only directly above a TableScan node and only if the subtree is on the probe side of some downstream JoinNode
Dynamic filters are removed from JoinNode if there is no consumer for it on its probe side subtree.

We also run DynamicFiltersChecker at the end of the planning phase to ensure that the above conditions have been satisfied by the optimized plan.

We reuse the existing DynamicFilterSourceOperator in LocalExecutionPlanner to collect build-side values from each inner join on each worker node. In addition to passing the collected TupleDomain to LocalDynamicFiltersCollector within the same worker node for use in broadcast join probe-side scans, we also pass them to TaskContext to populate task status updates for the coordinator.

ContinuousTaskStatusFetcher on the coordinator node pulls task status updates from all worker nodes up to every task.status-refresh-max-wait seconds (default is 1 second) or less (if task status changes). DynamicFilterService on the coordinator regularly polls for dynamic filters from task status updates through SqlQueryExecution and provides an interface to supply dynamic filters when they are ready. The ConnectorSplitManager#getSplits API has been updated to optionally utilize dynamic filters supplied by the DynamicFilterService.

In the Hive connector, BackgroundHiveSplitLoader can apply dynamic filtering by either completely skipping the listing of files within a partition, or by avoiding the creation of splits within a loaded partition if the dynamic filters become available in InternalHiveSplitFactory#createInternalHiveSplit due to lazy enumeration of splits.

Benchmarks

We ran TPC-DS queries on 5 worker nodes cluster of r4.8xlarge machines using data stored in ORC format. TPC-DS tables were partitioned as:

catalog_returns on cr_returned_date_sk
catalog_sales on cs_sold_date_sk
store_returns on sr_returned_date_sk
store_sales on ss_sold_date_sk
web_returns on wr_returned_date_sk
web_sales on ws_sold_date_sk

createAllORCTables.hql

The following queries ran faster by more than 20% with dynamic partition pruning (measuring the elapsed time in seconds, CPU time in minutes and Data read in MB).

Query	Baseline elapsed	Dynamic partition pruning elapsed	Baseline CPU	Dynamic partition pruning CPU	Baseline data read	Dynamic partition pruning data read
q01	10.96	8.50	10.2	8.9	17.91	14.53
q04	21.63	10.80	23.6	16.1	34.81	12.99
q05	41.38	14.94	57.1	16.8	54.81	11.45
q07	12.35	9.26	26.4	14.6	30.28	17.31
q08	10.48	6.43	11.0	4.7	10.19	3.52
q11	20.04	14.82	35.6	27.8	25.37	9.72
q17	24.05	9.87	26.4	12.0	30.18	9.75
q18	13.98	6.00	17.5	7.7	20.29	8.81
q25	18.91	8.04	26.9	9.1	37.54	11.12
q27	11.98	5.58	25.1	8.6	26.69	10.12
q29	24.11	15.46	30.5	18.5	30.18	13.50
q31	27.81	12.77	48.2	21.3	39.53	13.73
q32	11.51	8.15	12.7	10.3	15.05	12.76
q33	15.95	4.31	24.3	5.4	31.26	6.67
q35	15.10	5.22	13.8	6.2	4.83	1.70
q36	11.68	6.43	22.4	11.4	24.28	12.78
q38	21.08	16.20	39.4	31.6	5.65	3.15
q40	37.40	11.98	37.7	8.4	17.02	9.20
q46	11.57	9.06	24.4	17.3	18.51	14.19
q48	20.48	12.65	42.3	22.5	20.71	11.54
q49	26.69	16.01	38.8	12.0	68.67	30.57
q50	46.90	33.22	43.4	42.5	21.30	16.77
q54	43.05	11.39	27.5	14.8	17.71	11.52
q56	16.23	4.12	23.8	5.5	31.26	6.72
q60	16.39	6.02	25.1	6.6	31.26	7.42
q61	17.18	5.50	33.4	7.1	42.63	9.37
q66	13.67	6.59	19.1	8.9	19.63	8.34
q69	9.89	7.46	10.5	6.1	4.83	3.16
q71	17.32	6.11	23.3	6.6	31.26	8.06
q74	16.86	9.44	24.1	17.6	22.59	8.08
q75	122.04	69.45	102.7	62.9	110.86	63.91
q77	23.94	7.51	29.3	6.8	49.95	12.20
q80	43.46	18.57	45.8	11.5	37.25	11.78
q85	20.97	16.54	16.9	14.7	14.65	10.52

18 TPC-DS queries improved runtime by over 50% while decreasing CPU usage by an average of 64%. Data read was decreased by 66%.
7 TPC-DS queries improved between 30% to 50% while decreasing CPU usage by an average of 47%. Data read was decreased by 54%.
29 TPC-DS queries improved by 10% to 30% while decreasing CPU by an average of 20%. Data read was decreased by 27%.

Note that the baseline here includes the improvements from the existing node local dynamic filtering implementation.

Discussion

In order for dynamic filtering to work, the smaller dimension table needs to be chosen as a join’s build side. Cost-based optimizer can automatically do this using table statistics from the metastore. Therefore, we generated table statistics prior to running this benchmark and rely on the CBO to correctly choose the smaller table on the build side of join.

It is quite common for large fact tables to be partitioned by dimensions like time. Queries joining such tables with filtered dimension tables benefit significantly from dynamic partition pruning. This optimization is applicable to partitioned Hive tables stored in any data format. It also works with both broadcast and partitioned joins. Other connectors can easily take advantage of dynamic filters by implementing the new ConnectorSplitManager#getSplits API which supplies dynamic filters to the connector.

Future work

Support for using min-max range in DynamicFilterSourceOperator when the build-side contains too many values.
Passing dynamic filters back to the worker nodes from coordinator to allow ORC and Parquet readers to use dynamic filters with partitioned joins.
Allow connectors to block probe-side scan until dynamic filters are ready.
Support dynamic filtering with inequality operators
Support for semi-joins
Take advantage of dynamic filters in connectors other than Hive.

Hive ACID and transactional tables' support in Presto

2020-06-01T00:00:00+00:00

Hive ACID and transactional tables are supported in Presto since the 331 release. Hive ACID support is an important step towards GDPR/CCPA compliance, and also towards Hive 3 support as certain distributions of Hive 3 create transactional tables by default.

In this blog post we cover the concepts of Hive ACID and transactional tables along with the changes done in Presto to support them. We also cover the performance tests on this integration and look at the future plans for this feature.

How to use Hive ACID and transactional tables in Presto

Hive transactional tables are readable in Presto without any need to tweak configs, you only need to take care of these requirements:

Use Presto version 331 or higher
Use Hive 3 Metastore Server. Presto does not support Hive transactional tables created with Hive before version 3.

Note that Presto cannot create or write to Hive transactional tables yet. You can create and write to Hive transactional tables via Hive or via Spark with Hive ACID Data Source plugin and use Presto to read these tables.

What is Hive ACID and Hive transactional tables

Hive transactional tables are the tables in Hive that provide ACID semantics. This excerpt from Hive documentation covers ACID traits well:

“ACID stands for four traits of database transactions: Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (an incomplete operation by one user does not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure). These traits have long been expected of database systems as part of their transaction functionality.“

Need for Hive ACID and transactional tables

In any organisation, there is always a need to update or delete existing entries in tables e.g., a user writes or updates the review for an item purchased a week back or a transaction status is changed after a day, etc.. With regulations like GDPR/CCPA updates/deletes become even more frequent as the users can ask the organisation to delete the data on them, and organisations are obligated to fulfill these requests.

The standard practice to update data has been to overwrite the partition or table with the updated data but this is inefficient and unreliable. It takes a lot of resources to overwrite all of the existing data to update a few entries, but more importantly there are issues around isolation when reads on old data are going on and the overwrite starts deleting that data. To solve these issues several solutions have been developed, many of them are covered in this blog post, and Hive ACID is one of them.

Concepts of Hive ACID and transactional tables

Several concepts like transactions, WriteIds, deltas, locks, etc. are added in Hive to achieve ACID semantics. To understand the changes done in Presto to support Hive ACID and transactional tables, covered in the next section, it is important to understand these concepts first. So let’s look at them in detail.

Types of Hive transactional tables

There are two types of Hive transactional tables: Insert-Only transactional tables and CRUD transactional tables. Following table compares the two:

Type of transactional table	Hive DML Operations Supported	Input Formats supported	Synthetic columns in file?	Additional Table Properties
Insert-Only Transactional Tables	INSERT	All input formats	No	`'transactional'='true'`, `'transactional_properties'='insert_only'`
CRUD Transactional Tables	INSERT, UPDATE, DELETE	ORC	Yes	`'transactional'='true'`

Hive Transactions

Hive transactional tables should be accessed under Hive Transactions only. Note that these transactions are different from Presto transactions and are managed by Hive. Running DML queries under separate transactions helps in atomicity. Each transaction is independent and when rolled back will not have any impact on the state of the table.

WriteIds

DML queries under a transaction write to a unique location under partition/table described in detail later in “New Sub-Directories” section. This location is derived by WriteId allocated to the transaction. This provides Isolation of DML queries and such queries can run in parallel, whenever they can, without interfering with each other.

Valid WriteIds

Read queries under a transaction get a list of valid WriteIds that belong to the transactions which were successfully committed. This ensures Consistency by making results of committed transactions available to all the future transactions and also provides Isolation as DML and read queries can run in parallel with read queries not reading partial data written by DML queries.

New Sub-Directories

Results of a DML queries are written to a unique location derived from WriteId of the transaction. These unique locations are delta directories under partition/table location. Apart from the WriteId, this unique location is made up of the DML operation and depending on the operation type there can be two types of delta directories:

Delete Delta Directory: This delta directory is created for results of DELETE statements and is named delete_delta_<writeId>_<writeId> under partition/table location.
Delta Directory: This type is created for the results of INSERT statements and is named delta_<writeId>_<writeId> under partition/table location.

Apart from delta directories, there is another sub-directory that is now added called “Base directory” and is named as base_<writeId> under partition/table location. This type of directory is created by INSERT OVERWRITE TABLE query or by major compaction which is described later.

The following animation shows how these new sub-directories are created in the filesystem along with transaction management at metastore with different queries:

RowID

To uniquely identify each row in the table, a synthetic rowId is created and added to each row. RowIds are added to CRUD transactional tables only because it is used in case of DELETE statements only. When a DELETE is performed, the rowIds of the rows that it would delete are written into the delete_delta directory and subsequents reads will read all but these rows.

RowId is made of 5 entries today: operation, originalTransaction, bucket, rowId, currentTransaction but operation and currentTransaction fields are redundant now. RowId is added in the root STRUCT of ORC and hence the schema of ORC files is different from the schema defined in the table, e.g.:

Schema of CRUD transactional Hive Table:

n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string

Schema of ORC file for this table:

struct {
    operation : int,
    originalTransaction : bigint,
    bucket : int,
    rowId : bigint,
    currentTransaction : bigint,
    row : struct {
        n_nationkey : int,
        n_name : string,
        n_regionkey : int,
        n_comment : string
    }
}

Note that one level of nesting of table schema, like the inner struct above, is applicable to flat Hive tables too. The two level nesting of data columns is added for Orc files of CRUD transactional tables to keep rowId columns isolated from data columns.

Compactions

The working described above with delta and delete_delta directories for each transaction makes the DML queries execute fast but have the following impact on read queries:

Many delta directories with small data in each directory will slow down execution of read queries. This is a known problem around small files where engines end up spending more time opening files than actually processing the data.
Cross referencing all delete_delta directories to remove all deleted rows slows down the reads.

To solve these problems, Hive compacts delta directories asynchronously at two levels:

Minor Compaction: This compaction combines active delta directories into one delta directory and active delete_delta directories into one delete_delta directory thereby decreasing the number of small files. Limiting scope of this compaction to combining only delta directories keeps it fast. Minor compaction is automatically triggered as soon as active delta directories count reaches 10 (configurable). This compaction creates new delta directories like delta_<start_write_id>_<end_write_id> where [start_write_id, end_write_id] gives the range of existing delta directories that we compacted. Similar naming convention is used for delete_delta directory.
Major Compaction: Minor compaction does not work on merging base, delta and delete_delta directories as that requires rewriting of data with only the non-deleted rows, hence time consuming. This work is handled by a separate, less frequent and longer running, compaction called Major compaction. Major compaction is triggered when the total size of delta directories reaches 10% (configurable) of the base directory size. This compaction creates a new Base directory.

Locks

Hive uses shared locks to control what operations can run in parallel on partition/table. For example, DML queries take a write-lock on partitions they are modifying while read queries take a read-lock on partitions they are reading. The read-locks taken by read queries prevents Hive from cleaning up the delta directories that have been compacted while they are being read by the query.

Changes in Presto to support Hive ACID and transactional ables

At high level, there are changes at two places in Presto to support Hive ACID and transactional tables: In split generation logic that runs in coordinator and in ORC reader that is used in workers.

Split generation

Hive ACID State is setup in SemiTransactionalHiveMetastore.beginQuery, only for Hive transactional tables:
1. A new Hive transaction is opened per Query
2. A shared read-lock is obtained from Metastore server for the partitions read in the query
3. A Heartbeat mechanism is set up to inform the Metastore server about liveliness periodically. Frequency of heartbeats is figured out from the Metastore server but can be overridden with hive.transaction-heartbeat-interval property.
BackgroundSplitLoader is set up with valid WriteIds for the partitions as provided by Metastore server
BackgroundSplitLoader.loadPartitions is called in an Executor to create splits for each partition:
1. ACID sub-directories: base, delta and delete_delta directories are figured out by listing the partition location
2. DeleteDeltaLocations, a registry of delete_delta directories, is created. It contains minimal information through which delete_delta directory paths can be recreated at workers.
3. HiveSplits are created with each location of base and delta directories. Each HiveSplit contains the DeleteDeltaLocations
4. If the table is Insert-Only transactional table then DeleteDeltaLocations is empty and the HiveSplit is same as the HiveSplit on flat/non-transactional Hive table

Reading Hive transactional data in workers

The HiveSplit generated during the split generation phase make their way to worker nodes where OrcPageSourceFactory is used to create PageSource for TableScan operator.

Insert-Only transactional tables are read in the same way a non-transactional tables are read, OrcPageSource is created for their splits which reads the data for the split and makes it available to TableScanOperator
CRUD transactional tables need special handling during reads because the file schema does not match the table for them due to the synthetic RowId column added which introduces additional Struct nesting as mentioned earlier:
1. RowId columns are added to the list of columns to be read from file
2. ORC reader is setup by accessing column name from the file instead of using the column indexes from table schema, equivalent to forcing hive.orc.use-column-names=true for CRUD transactional tables
3. OrcRecordReader is created for the ORC file of the split
4. OrcDeletedRows is created for delete_delta locations, if any.
5. OrcPageSouce is created that returns rows from OrcRecordReader which are not present in OrcDeletedRows. This cross referencing of deleted rows is done lazily for each Block of the Page only when that Block is needed to be read from the PageSource. This works well with the lazy materialization logic of Presto to skip over Blocks if a predicate does not apply to the Page at all.

Performance numbers

Each Insert on Hive transactional table can create additional splits for delta directories and each delete can create delete_delta directories that adds additional work of cross referencing deleted rows while reading the split. To measure the impact of these operations on reads from Presto we ran the following performance tests where multiple Hive transactional tables are created with varying number of Insert and Delete operations and runtime of different read-focused Presto queries were recorded:

Table Type	Description	delta directories	delete_delta directories
Flat	TPCDS store_sales scale 3000 table, 8.6B rows	0	0
Only Base	Hive transactional store_sales scale 3000 table: 8.6B rows	0	0
Base + 1-Delete	Derived from “Only Base” with rows having customer_id=100 deleted by 1 DELETE query: 347 deleted entries	0	1
Base + 1-Delete + 1-Insert	Derived from “Base + 1 Delete” with deleted rows added back by 1 INSERT query: 347 deleted entries + 347 inserted entries	1	1
Base + 5-Deletes	Derived from “Only Base” with rows for 5 customer_ids deleted by 5 DELETE queries: 1355 rows deleted	0	5
Base + 5-Deletes + 5-Inserts	Derived from “Base + 1 Delete” with deleted rows added back by 5 INSERT queries: 1355 deleted entries + 1355 inserted entries	5	5

Following is the result of these tests, ran on a cluster with 5 c3.4xlarge machines on AWS:

It was seen that there is an impact of deleted rows on read performance, which is expected as the work for the reader increases in this case. But with predicates in place, this impact was reduced as the amount of data to be read goes down.

Ongoing and Future work

There has been ongoing work on the Hive ACID integration and some improvements are planned in future, notably:

Bucketed Hive transactional table support has been added (#1591)
Support for original files is in progress (#2930), this will allow Presto to read the Hive tables that were converted to transactional table at some point after having non-transactional data
Write support will be taken up in future (#1956)
There is ongoing work on Hive side for ACID on Parquet format. Once that lands, Presto’s implementation will be extended to support Parquet too.

Acknowledgements and Conclusion

Thanks to the folks who helped out in the development of this feature: Abhishek Somani provided continuous guidance on internals of Hive ACID, Dain helped out with simplifying ORC reader and along with Piotr helped in code refinement and with multiple rounds of reviews.

While we continue development on this feature to get full fledged support including writes, you can start using it on Hive transactional tables which do not have files in flat format. If you have such tables and want to use Presto with them then you can apply this fix to your Presto installation or you can trigger a major compaction on all partitions to migrate full table into CRUD transactional table format.

Apache Pinot Connector

2020-05-25T00:00:00+00:00

Presto 334 introduces the new Pinot Connector which allows Presto to query data stored in Apache Pinot™. Not only does this allow access to Pinot tables but gives users the ability to do things they could not do with Pinot alone such as join Pinot tables to other tables and use Presto’s scalar functions, window functions and complex aggregations.

Pinot UDF’s can be directly used by including the Pinot SQL query in quotes, explained below in the Pinot SQL Passthrough section. This enables aggregations and other complex query types to be done directly in Pinot.

This connector supports Pinot 0.3.0 and newer.

Setup

Create a properties file in the catalog directory, such as etc/catalog/pinot.properties which includes at least the following to get started:

connector.name=pinot
pinot.controller-urls=host1:9000,host2:9000

The pinot.controller-urls is a comma separated list of controller hosts. If Pinot is deployed via Kubernetes and you expose the the pinot.controller-urls needs to point to the controller Service endpoint. The Pinot broker and server must be accessible via DNS as Pinot will return hostnames and not ip addresses.

If you have a smaller number of Pinot servers than Presto workers or a relatively small number of rows per Pinot segment, you can minimize the requests to pinot by increasing the number of Pinot segments per split (default is 1 segment per split):

pinot.segments-per-split=15

If DNS resolution is slow or you get Request timed out errors, you can increase the request timeout as follows:

pinot.request-timeout=3m

Schema

Pinot supports the following data types. Currently null values are not supported. The corresponding Presto datatypes are:

Pinot Datatype	Presto Datatype
boolean	boolean
integer	integer
float, double	double
string, bytes*	varchar
integer_array	array(integer)
float_array, double_array	array(double)
long_array	array(bigint)
string_array	array(varchar)

The Pinot bytes type is converted to a hex-encoded varchar. See the Pinot docs for more information.

Pinot SQL Passthrough

If you would like to leverage Pinot’s fast aggregations you can use a “dynamic” table where you specify the Pinot SQL query as the table name and it is passed directly to Pinot:

SELECT * 
FROM pinot.default."SELECT col3, col4, MAX(col1), COUNT(col2) FROM pinot_table GROUP BY col3, col4"
WHERE col3 IN ('FOO', 'BAR') AND col4 > 50
LIMIT 30000

The filter in the outer presto query will be pushed down into the Pinot query via Presto’s applyFilter(). These queries are routed to the broker and should not return huge amounts of data as broker queries currently return a single response with all the results. This is more suited to aggregate queries.

Limits are pushed into the “dynamic” Pinot query via Presto’s applyLimit(). The above query would yield the following Pinot PQL query:

Pinot functions such as PERCENTILEEST can be used in the quoted sql.

SELECT MAX(col1), COUNT(col2)
FROM pinot_table
WHERE col3 IN('FOO', 'BAR') and col4 > 50
LIMIT 30000

If you are returning a larger dataset you can issue a normal Presto query which will get routed to the Pinot servers which store the Pinot segments. Filters and Limits are pushed down to Pinot for regular queries as well.

Future Work

As Presto and Pinot continue to evolve the Pinot connector will leverage new features such as aggregation pushdown and more.

State of Presto

2020-05-15T00:00:00+00:00

Presto is continuing to gain adoption across many industries and use cases. Our community is growing rapidly and there is a lot going on, so we are taking the Presto Summit online. And we are starting with a State of Presto webinar with the founders of the project.

Update:

We had a great event with lots of questions from the audience, taking us beyond the planned time frame. Check out the recording to learn more:

Join us virtually to hear Presto co-creators Martin Traverso, Dain Sundstrom, and David Phillips talk about the state of Presto, followed by a live Q&A moderated by Presto maintainer Piotr Findeisen.

Agenda:

2020 project milestones
Community and technical growth
Recent Presto updates
Project roadmap
Live Q&A

Date: Thursday, 21 May 2020

Time: 11am PDT (San Francisco), 2pm EDT (New York), 7pm BST (London), 6pm UTC

Register now!

We look forward to many questions and a lively webinar.

Presto on FLOSS Weekly

2020-05-06T00:00:00+00:00

Spreading the word about our project is an important task to grow the community around Presto. With a large, lively community we can ensure the success of Presto. Today we had the opportunity to talk about Presto on the long running open source podcast FLOSS Weekly.

Randal Schwartz was joined by his co-host Simon Phipps. We introduced Presto overall and talked about use cases of Presto and the problems it can solve. Both hosts, as well as the live audience, had some great questions and we did our best to answer them.

We moved through the history of Presto, current users and usage, the community around the project, and Dain talked about some of the upcoming improvements. In the end it seemed like we just scratched the surface and all wanted to keep talking about the project.

It was a great conversation and you should check it out!

Watch a recording of the Presto episode of FLOSS Weekly now!

Presto: The Definitive Guide

2020-04-11T00:00:00+00:00

Nearly two years ago Matt and Martin got the ball rolling on getting a book about Presto happening. A thriving project and community like everyone around Dain, David and Martin, the founders and creators of Presto, just needs a book. Even in this digital age of online documentation, communities on chat and other platforms, and videos everywhere, there is great value in a well structured and written book. Today, we are happy to announce that our book Presto: The Definitive Guide.

Get a free copy of Trino: The Definitive Guide from Starburst now!

This first book about Presto, is finally available for you all to get, read and hopefully learn from.

Update April 2021: The project has moved to the new name Trino, and the content of our book has been updated to Trino: The Definitive Guide.

With the help of O’Reilly, the book is now available in digital form, and paper copies are just around the corner as well. You can find more information about the book on our permanent page about it.

It is based on the very recent 330 release of Presto, but applicable to any Presto version. The book is broken up into three separate parts. No matter, if you are beginner keen to learn, or maybe with just a bit of command line and SQL knowledge, or an advanced or even expert Presto user, we are certain that you can learn something from the book and encourage you to check it out.

The first part of the book establishes what Presto is, and gets you quick wins to install a minimal setup, run it, connect to it with the CLI and an application using the JDBC driver and run some SQL queries.

The second part dives into the details of the Presto architecture, query planning, connectors for all sorts of data sources and SQL usage. There is a lot to learn and digest in these main sections.

In the third part we round things out with tuning tips, a good overview of the Web UI, usage of other tools, security configuration and more tips to get Presto into production.

Of course, putting all this information together requires work from many people. And in fact we did get lots of help from members of the Presto community and O’Reilly.

Specifically, we have some great news from our major supporter, Starburst! Starburst allowed us to work on the book and bring it across the finish line.

And that turns out to be great news for you all as well. Not only is the book finished now, you can also get a free digital copy of Trino: The Definitive Guide from Starburst.

So what are you waiting for? Go get a copy, check out the code repository for the book, provide feedback and contact us on Slack.

Looking forward to it all!

Matt, Manfred and Martin

Exhausted, but happy authors

Beyond LIMIT, Presto meets OFFSET and TIES

2020-02-03T00:00:00+00:00

Presto follows the SQL Standard faithfully. We extend it only when it is well justified, we strive to never break it and we always prefer the standard way of doing things. There was one situation where we stumbled, though. We had a non-standard way of limiting query results with LIMIT n without implementing the standard way of doing that first. We have corrected that, adding ANSI SQL way of limiting query results, discarding initial results and – a hidden gem – retaining initial results in case of ties.

Limiting query results

Probably everyone using relational databases knows the LIMIT n syntax for limiting query results. It is supported by e.g. MySQL, PostgreSQL and many more SQL engines following their example. It is so common that one could think that LIMIT n is the standard way of limiting the query results. Let’s have a look at how various popular SQL engines provide this feature.

DB2, MySQL, MariaDB, PostgreSQL, Redshift, MemSQL, SQLite and many others provide the ... LIMIT n syntax.
SQL Server provides SELECT TOP n ... syntax.
Oracle provides ... WHERE ROWNUM <= n syntax.

And what does the SQL Standard say?

SELECT *
FROM my_table
FETCH FIRST n ROWS ONLY 

If we look again at the database systems mentioned above, it turns out many of them support the standard syntax too: Oracle, DB2, SQL Server and PostgreSQL (although that’s not documented currently).

And Presto? Presto has LIMIT n support since 2012. In Presto 310, we added also the FETCH FIRST n ROWS ONLY support.

Let’s have a look beyond the limits.

Tie break

Admittedly, FETCH FIRST n ROWS ONLY syntax is way more verbose than the short LIMIT n syntax Presto always supported (and still does). However, it is also more powerful: it allows selecting rows “top n, ties included”. Consider a case where you want to list top 3 students with highest score on an exam. What happens if the 3^rd, 4^th and 5^th persons have equal score? Which one should be returned? Instead of getting an arbitrary (and indeterminate) result you can use the FETCH FIRST n ROWS WITH TIES syntax:

SELECT student_name, score
FROM student s JOIN exam_result e ON s.id = e.student_id
ORDER BY score
FETCH FIRST 3 ROWS WITH TIES

The FETCH FIRST n ROWS WITH TIES clause retains all rows with equal values of the ordering keys (the ORDER BY clause) as the last row that would be returned by the FETCH FIRST n ROWS ONLY clause.

Offset

Per the SQL Standard, the FETCH FIRST n ROWS ONLY clause can be prepended with OFFSET m, to skip m initial rows. In such a case, it makes sense to use FETCH NEXT ... variant of the clause – it’s allowed with and without OFFSET, but definitely looks better with that clause.

SELECT student_name, score
FROM student s JOIN exam_result e ON s.id = e.student_id
ORDER BY score
OFFSET 5
FETCH NEXT 3 ROWS WITH TIES

As an extension to SQL Standard, and for the brevity of this syntax, we also allow OFFSET with LIMIT:

SELECT student_name, score
FROM student s JOIN exam_result e ON s.id = e.student_id
ORDER BY score
OFFSET 5
LIMIT 3

Concluding notes

LIMIT / FETCH FIRST ... ROWS ONLY, FETCH FIRST ... WITH TIES and OFFSET are powerful and very useful clauses that come especially handy when writing ad-hoc queries over big data sets. They offer certain syntactic freedom beyond what is described here, so check out documentation of OFFSET Clause and LIMIT or FETCH FIRST Clauses for all the options. Since semantics of these clauses depend on query results being well ordered, they are best used with ORDER BY that defines proper ordering. Without proper ordering the results are arbitrary (except for WITH TIES) which may or may not be a problem, depending on the use case.

For scheduled queries, or queries that are part of some workflow (as opposed to ad-hoc), we recommend using query predicates (where relevant) instead of OFFSET. Read more at https://use-the-index-luke.com/sql/partial-results/fetch-next-page.

□

Presto in 2019: Year in Review

2020-01-01T00:00:00+00:00

What a great year for the Presto community! We started with the year with the launch of the Presto Software Foundation, with the long term goal of ensuring the project remains collaborative, open and independent from any corporate interest, for years to come.

Since then, the community around Presto has grown and consolidated. We’ve seen contributions from more than 120 people across over 20 companies. Every week, 280 users and developers interact in the project’s Slack channel. We’d like to take the opportunity to thank everyone that contributed the project in one way or another. Presto wouldn’t be what it is without your help.

With the collaboration of companies such as Starburst, Qubole, Varada, Twitter, ARM Treasure Data, Wix, Red Hat, and the Big Things community, we ran several Presto summits across the world:

All these events were a huge success and brought thousands of Presto users, contributors and other community members together to share their knowledge and experiences.

The project has been more active than ever. We completed 28 releases comprised of more than 2850 commits in over 1500 pull requests. Of course, that alone is not a good measure of progress, so let’s take a closer look at everything that went in. And there is a lot to look at!

Language Features

FETCH FIRST n ROWS [ONLY | WITH TIES] standard syntax. The WITH TIES clause is particularly useful when some of the rows have the same value for the columns being used to order the results of a query. Consider a case where you want to list top 5 students with highest score on an exam. If the 6th person has the same score as the 5th, you want to know this as well, instead of getting an arbitrary and non-deterministic result:
```
SELECT student_name, score
FROM student JOIN exam_result USING (student_id)
ORDER BY score
FETCH FIRST 5 ROWS WITH TIES
```
OFFSET syntax, which is especially useful in ad-hoc queries.
COMMENT ON <table> syntax to set or remove table comments. Comments can be shown via DESCRIBE or the new system.metadata.table_comments table.
Support for LATERAL in the context of an outer join.
Support for UNNEST in the context of LEFT JOIN. With this feature, it is now possible to preserve the outer row when the array contains zero elements or is NULL. Most common usages of UNNEST in a CROSS JOIN should actually be using this form.
```
SELECT * FROM t LEFT JOIN UNNEST(t.a) u (v) ON true
```
IGNORE NULLS clause for window functions. This is useful when combined with functions such as lead, lag, first_value, last_value and nth_value if the dataset contains nulls.
ROW expansion using .* operator.
CREATE SCHEMA syntax and support in various connectors (Hive, Iceberg, MySQL, PostgreSQL, Redshift, SQL Server, Phoenix).
Support for correlated subqueries containing LIMIT or ORDER BY+LIMIT.
Subscript operator to access ROW type fields by index. This greatly improves usability and readability of queries when dealing with ROW types containing anonymous fields.

Query Engine

Generalize conditional, lazy loading and processing (a.k.a., Late Materialization) beyond Table Scan, Filter and Projection to support Join, Window, TopN and SemiJoin operators. This can dramatically reduce latency, CPU and I/O for highly selective queries. This is one of the most important performance optimizations in recent times and we will be blogging about this more in coming weeks.
Unwrap cast/predicate pushdown optimizations.
Connector pushdown during planning for operations such as limit, table sample, or projections. This allows connectors to optimize how data is accessed before it’s provided to the Presto engine for further processing.
Dynamic filtering.
Cost-Based Optimizer can now consider estimated query peak memory footprint. This is especially useful for optimizing bigger queries, where not all parts of the query can be run concurrently.
Improved handling of projections, aggregations and cross joins in cost based optimizer.
Improved accounting and reporting of physical and network data read or transmitted during query processing.

Performance

10x performance improvement for UNNEST.
2-7x improvement in performance of ORC decoders, resulting in a 10% global CPU improvement for the TPC-DS benchmark.
Improvements when reading small Parquet files, files with large number of columns, or files with small row groups. We found this very useful, for example, when working with data exported from Snowflake.
Support for new ORC bloom filters.
Remove redundant ORDER BY clauses.
Improvements for IN and NOT-IN with subquery expressions (i.e., semijoin).
Huge performance improvements when reading from information_schema.
Reduce query latency and Hive metastore load, for both SELECT and INSERT queries.
Improve metadata handling during planning. This can result in dramatic improvements in latency, especially for connectors such as MySQL, PostgreSQL, Redshift, SQL Server, etc. Some queries like SHOW SCHEMAS or SHOW TABLES that could take several minutes to complete now finish in a few seconds.
Improved stability, performance, and security when spilling is enabled.

Functions

combinations
format
UUID type and related functions.
all_match, any_match and none_match.
Support flexible aggregation with lambda expressions using reduce_agg.
New date and time functions: last_day_of_month, at_timezone and with_timezone.

Security

Role-based access control and related commands.
INVOKER security mode for views, which allows views to be run using the permissions of the current user.
Prevent replay attacks and result hijacking in client APIs.
JWT-based internal communication authentication, which obsoletes the need to use Kerberos or certificates and greatly simplifies secure setups.
Credential passthrough, which allows Presto to authenticate with the underlying data source with credentials provided by the user running a query. This especially useful when dealing with Google Storage in GCP or SQL databases that manage user authentication and authorization on their own.
Impersonation for Hive metastore.
Support for reading and writing encrypted files in HDFS using Hadoop KMS.
Support for encrypting spilled data.

Geospatial

New geospatial functions: ST_Points, ST_Length, ST_Area, line_interpolate_point and line_interpolate_points.
SphericalGeography type and related functions to support spatial features in geographic coordinates (latitude / longitude) using a spherical model of the earth.
Support for Google Maps Polyline format via to_encoded_polyline and from_encoded_polyline functions.
geometry_from_hadoop_shape to decode geometry objects in Spatial Framework for Hadoop representation

Cloud Integration

Support for Azure Data Lake Blob and ADLS Gen2 storage.
Support for Google Cloud Storage.
Several performance improvements for AWS S3.

CLI and JDBC Driver

JSON output format and improvements to CSV output format.
Support and stability improvements for running the CLI and JDBC driver with Java 11.
Improve compatibility of JDBC driver with third-party tools.
Syntax highlighting and multi-line editing.

New Connectors

Elasticsearch
Google Sheets
Amazon Kinesis
Apache Phoenix
MemSQL
Apache Iceberg (preview version still under development)

Other Improvements

Presto Docker image that provides an out-of-the-box single node cluster with the JMX, memory, TPC-DS, and TPC-H catalogs. It can be deployed as a full cluster by mounting in configuration and can be used for Kubernetes deployments.

Support for LZ4 and Zstd compression in Parquet and ORC. LZ4 is currently the recommended algorithm for fast, lightweight compression, and Zstd otherwise.
Support for insert-only Hive transactional tables and Hive bucketing v2 as part of making Presto compatible with Hive 3.
Improvements in ANALYZE statement for Hive connector.
Support for multiple files per bucket for Hive tables. This allows inserting data into bucketed tables without having to rewrite entire partitions and improves Presto compatibility with Hive and other tools.
Support for upper- and mixed-case table and column names in JDBC-based connectors.
New features and improvements in type mappings in PostgreSQL, MySQL, SQL Server and Redshift connectors. This includes support for PostgreSQL arrays and timestamp with time zone type, and the ability to read columns of unsupported types.
Improvements in Hive compatibility with Hive version 2.3 and with Cloudera (CDH)’s Hive.
Connector provided view definitions, which allow connectors to generate the definition dynamically at query time. For example, the connector can provide a union of two tables filtered on a disjoint time range, with the cutoff time determined at resolution time.
Lots and lots of bug fixes!

Coming Up…

These are some of the projects that are currently in progress and are likely to land in the short term.

Support for pushing down row dereference expressions into connectors. This will help reduce the amount of data and CPU needed to process highly nested columnar formats such as ORC and Parquet.
Extend dynamic filtering to support distributed joins and other operators. Use dynamic filters for pruning partitions at runtime when querying Hive.
Extended Late Materialization support to queries involving complex correlated subqueries.
Finalize Hive 3 support.
Improved INSERT into partitioned tables, which will help with large ETL queries.
Improvements and features in Iceberg connector.
Pinot connector.
Oracle connector.
Influx connector.
Prometheus connector.
Salesforce connector.
Support for Confluent registry in Kafka connector.
Revamp of the function registry and function resolution to support dynamically-resolved functions and SQL-defined functions.
A new Parquet writer optimized to work efficiently within Presto.

… and many, many more.

Hive 3 support in Presto

2019-12-28T00:00:00+00:00

The Hive community is centered around a few different Hive distributions, one of them being Hortonworks Data Platform (HDP). Even after the Cloudera-Hortonworks merger there is vivid interest in HDP 3, featuring Hive 3. Presto is ready for the game.

In this post, we summarize which Hive 3 features Presto already supports, covering all the work that went into Presto to achieve that. We also outline next steps lying ahead.

Introduction

There are several Hive versions in active use by the Hive community: 0.x, 1.x, 2.x and 3.x. Hive 3 major release brings a number of interesting features, including:

support for Hadoop Erasure Coding (EC), allowing much better HDFS storage capacity utilization without reducing data availability,
update to ORC ACID transactional tables - they no longer need to be bucketed,
transactional tables for all file formats (“insert-only” except for ORC),
materialized views,
new bucketing function, offering a better data distribution and less data skew,
new timestamp semantics and timestamp-related changes in file formats,
and a lot more (let’s skip over features and changes that are not interesting from Presto perspective).

That’s no surprise that many people want to try out all these features and run Hive 3, either the Apache project’s official release or using HDP version 3.

Hive 3 in Presto

The Presto community expressed interest in using Presto with Hive 3, both in the project’s issues and on Slack.

You spoke, we listened. Actually – we, community, spoke and listened.

In collaboration between Starburst, Qubole and the wider Presto community, Presto gradually improves its compatibility with Hive 3:

Presto 319 fixed issues with backwards-incompatible changes in Hive metastore thrift API
Presto 320 added continuous integration with Hive 3
Presto 321 added support for Hive bucketing v2 ("bucketing_version"="2")
Presto 325 added continuous integration with HDP 3’s Hive 3
Presto 327 added support for reading from insert-only transactional tables, and added compatibility with timestamp values stored in ORC by Hive 3.1

Upcoming improvements already being worked on include:

Try it out

The amazing Presto community is working hard on getting Hive 3 support fully integrated in the Presto project and a lot is already accomplished. Chances are THAT all you need is already included in the latest release. If you need one of the upcoming improvements, watch the pull requests linked above, the roadmap issue, join Slack and stay tuned for upcoming release announcements. In the meantime, you can try out the features today by running the 323-e release of Starburst Presto.

□

Presto Experiment with Graviton Processor

2019-12-23T00:00:00+00:00

This December, AWS announced new instance types powered by Arm-based AWS Graviton2 Processor. M6g, C6g, and R6g are designed to deliver up to 40% improved price/performance compared with the current generation instance types. We can achieve cost-effectiveness by using these instance type series. Presto is just a Java application, so that we should be able to run the workload with this type of cost-effective instance type without any modification.

But is it true? Initially, we do not have a clear answer to how much effort we need to bring Presto into the world of the different processors. No care about the underlying platform is generally beneficial for development. But if using different processors enables us to accelerate the performance and stability of Presto, we must care about it. We must prove anything unclear by the experiment.

This article is the report to clarify what we need to do to run Presto on the Arm-based platform and see how much benefit we can potentially obtain with Graviton Processor.

As the Graviton 2 based instance types are preview state, we tried to run Presto on A1 instance that has the first generation of Graviton processor inside. It still would be a helpful anchor to understand the potential benefit of the Graviton 2 processor.

How to make Presto compatible with Arm

We are going to build the binary of Presto supporting Arm platform first. From the results, there are not so many things to do so. As long as JVM supports the Arm platform, it should work without any modification in the application code. But Presto has some restrictions on the platform where it runs to protect the functionality, including plugins. For example, the latest Presto supports only x86 and PowerPC architectures. This limitation prevents us from using Presto on the Arm platform.

To make Presto runnable on Arm machine, we need to modify PrestoSystemRequirements class to allow aarch64 architecture and more. For experimental purposes, we can apply such a patch to remove the restriction altogether.

diff --git a/presto-main/src/main/java/io/prestosql/server/PrestoSystemRequirements.java b/presto-main/src/main/java/io/prestosql/server/PrestoSystemRequirements.java
index 07b7d12c64..b6a1249681 100644
--- a/presto-main/src/main/java/io/prestosql/server/PrestoSystemRequirements.java
+++ b/presto-main/src/main/java/io/prestosql/server/PrestoSystemRequirements.java
@@ -71,9 +71,9 @@ final class PrestoSystemRequirements
 String osName = StandardSystemProperty.OS_NAME.value();
 String osArch = StandardSystemProperty.OS_ARCH.value();
 if ("Linux".equals(osName)) {
- if (!"amd64".equals(osArch) && !"ppc64le".equals(osArch)) {
- failRequirement("Presto requires amd64 or ppc64le on Linux (found %s)", osArch);
- }
 if ("ppc64le".equals(osArch)) {
 warnRequirement("Support for the POWER architecture is experimental");
 }

This patch is all we have to do to run Presto on the Arm platform. It should work for most cases except for the usage with Hive connector because it has a native code not yet available for Arm platform.

Prepare Docker Images

Docker container is a desirable option to run Presto experimentally due to its availability and easiness of use. But there is one thing to do to build Docker image supporting cross-platform.

Docker buildx is an experimental feature for the full support of Moby BuildKit toolkit. It enables us to build a Docker image supporting multiple platforms, including Arm. The feature is so useful that we can quickly make the cross-platform Docker image with a one-line command. But the feature is not generally available in the typical installation of Docker. Enabling the experimental flag is necessary as follows in the case of macOS.

And make sure to restart the Docker daemon. We can build the Docker image for Presto supporting aarch64 architecture with buildx command. We have used the source code of 317-SNAPSHOT with the earlier patch in the PrestoSystemRequirements.

$ docker buildx build \
 --build-arg VERSION=317-SNAPSHOT \
 --platform linux/arm64 \
 -f presto-base/Dockerfile-aarch64 \
 -t lewuathe/presto-base:317-SNAPSHOT-aarch64 \
 presto-base --push

$ docker buildx build \
 --build-arg VERSION=317-SNAPSHOT-aarch64 \
 --platform linux/arm64 \
 -t lewuathe/presto-coordinator:317-SNAPSHOT-aarch64 \
 presto-coordinator --push

$ docker buildx build \
 --build-arg VERSION=317-SNAPSHOT-aarch64 \
 --platform linux/arm64 \
 -t lewuathe/presto-worker:317-SNAPSHOT-aarch64 \
 presto-worker --push

We should be able to specify multiple platform names for --platform option. But unfortunately, the Docker image of OpenJDK for Arm is distributed under the separated organization, arm64v8/openjdk. Building an image supporting Arm requires us another Dockerfile. Anyway, Docker images containing Presto supporting Arm are now available.

Setup A1 Instance

The following setup prepares the environment enough to run docker-compose on the A1 instance. As no docker-compose binary for Arm is distributed officially, we need to install and build docker-compose with pip. Make sure to run them after the instance initialization completes.

# Install Docker
$ sudo yum update -y
$ sudo amazon-linux-extras install docker -y
$ sudo service docker start
$ sudo usermod -a -G docker ec2-user

# Install docker-compose
$ sudo yum install python2-pip gcc libffi-devel openssl-devel -y
$ sudo pip install -U docker-compose

Performance Comparison

Let’s briefly take a look into how the performance provided by the Graviton processor looks like. We are going to use a1.4xlarge as a benchmark instance of Graviton processor.

Here is our specification of the benchmark conditions.

We use the commit b0c07249de5c70a70b3037875df4fd0477dec9fc + the patch previously described.
1 coordinator + 2 worker processes run by docker-compose on a single instance.
We use a1.4xlarge and c5.4xlarge, whose CPU core and memory are the same as a1.4xlarge. And we also compared with m5.2xlarge, whose on-demand instance cost is close to a1.4xlarge.
We use q01, q10, q18, and q20 run on the TPCH connector. Since the Presto TPCH connector does not access external storage, we can measure pure CPU performance without worrying about network variance.
We choose tiny and sf1 as the scaling factor of TPCH connector
Our experiment measures the average time of 5 query runtime after 5 times warmup for every query.

OpenJDK 8

Here is the result of our experiment. The vertical axis represents the running time in milliseconds.

It shows c5.4xlarge achieves the best performance consistently in every case. Compared with m5.2xlarge, the result was switched by the query type. a1.4xlarge and m5.2xlarge are probably competing with each other.

Although we use OpenJDK 8 for this case, it might not be able to generate the code fully optimized for Arm architecture. In general, the later versions, such as OpenJDK 9 or 11, give us better performance.

OpenJDK 11

Let’s try to run Presto with OpenJDK 11 again. There is one thing to do. From JDK 9, the Attach API was disabled as default. We have found that we needed to allow the usage of attach API by adding the following option in jvm.config file, otherwise we will see an error message at the bootstrap phase.

-Djdk.attach.allowAttachSelf=true

Here is the performance comparison with OpenJDK 11.

a1.4xlarge and c5.4xlarge achieve even higher performance than OpenJDK 8 for every case. On the contrary, m5.2xlarge shows a slower result in some cases. While this result still demonstrates c5.4xlarge is the best instance in terms of the performance, the performance gaps between instances are smaller compared with the OpenJDK 8 cases. Especially, a1.4xlarge shows relatively competitive performance with the smaller dataset (tiny). How does the scaling factor influence performance? We’ll see.

The above chart shows how performance is affected by the scaling factor. c5.4xlarge demonstrates the most stable running time, regardless of the scaling factor. If we want to achieve stable performance as much as possible, c5.4xlarge is a good option in the list. a1.4xlarge and m5.2xlarge show similar volatility against the scaling factor this time.

Considering the cost of a1.4xlarge instance is 40% cheaper than c5.4xlarge, it may make sense to use a1.4xlarge for the specific case. The on-demand cost of a1.4xlarge is $9.8/day and c5.4xlarge is $16.3/day for on-demand instance type. The public announcement says Graviton 2 delivers 7x performance compared to the Graviton processor. We may expect an even better performance by using a new generation processor. We cannot wait for the general availability of Graviton 2.

Amazon Corretto

How about other JVM distributions? Now we have found Amazon Corretto also supports Arm architecture, and it distributes the Docker image built for Arm. Let’s try Amazon Corretto similarly.

This chart illustrates the performance result by different JDK implementations, OpenJDK 8, OpenJDK 11, and Amazon Corretto 11. Overall, OpenJDK 11 seems to be the best. But Amazon Corretto achieves the even better performance in some of the sf1 cases interestingly. It indicates that Presto with Amazon Corretto may provide better performance in some query types.

Wrap Up

As Presto is just a Java application, there are not so many things to do to support the Arm platform. Only applying one patch and one JVM option brings us Presto binary supporting the latest platform. It is always exciting to see a new technology used for complicated distributed systems such as Presto. The combination of cutting-edge technologies surely takes us a journey to the new horizon of technological innovation.

Last but not least, we have used docker-compose and TPCH connectors to execute queries for the Presto cluster quickly in the Arm platform. Note that the performance of a distributed system such as Presto depends on various kinds of factors. Please be sure to run your benchmark carefully when you try to use a new instance type in your production environment.

We have uploaded the Docker image used for this experiment publicly. Feel free to use them if you are interested in running Presto on the Arm platform.

# Image for Armv8 using OpenJDK 11
$ docker pull lewuathe/presto-coordinator:327-SNAPSHOT-aarch64
$ docker pull lewuathe/presto-worker:327-SNAPSHOT-aarch64


# Image for Armv8 using Amazon Corretto 11
$ docker pull lewuathe/presto-coordinator:327-SNAPSHOT-corretto
$ docker pull lewuathe/presto-worker:327-SNAPSHOT-corretto

And also, I have raised an issue to start the discussion of supporting Arm architecture in the community. It would be great if we could get any feedback from those who are interested in it.

Thanks!

First Presto Summit in India, Bangalore, September 2019

2019-09-05T00:00:00+00:00

Qubole organized the first ever Presto Summit in India on September 05, 2019. Bangalore, as the technology and startup hub of India was the perfect venue for India’s first Presto Summit. Presto has seen a lot of interest and adoption in this (south asia and asia pacific) region, as was evident with the turnout in the last two Presto Meetups organized by Qubole over the past year. Courtyard By Marriott, on Outer Ring Road (ORR) - a 17 KM stretch that hosts 10% of Bangalore’s working population (around 1 million people), as the conference venue proved to be an ideal destination for Presto enthusiasts, several of whom, work in its immediate vicinity.

With 150 attendees from more than 75 companies, Presto community in India was super excited and eager to meet and interact with Presto co-creators - Martin Traverso, Dain Sundstrom and David Phillips, who flew down to Bangalore for this Event.

Welcome Note by Joydeep Sen Sarma

Joydeep Sen Sarma, co-creator Hive and co-founder Qubole, kicked off the event by welcoming Presto co-creators, speakers and all the attendees. He also provided a brief historical perspective of Qubole’s contributions to Presto and highlighted the importance of Presto in Qubole’s customer base.

Keynote by Martin, Dain and David

Slides Video

This was followed by the most awaited presentation of the day - the keynote from Martin, Dain and David. Martin took the audience through Presto’s journey - right from its birth at Facebook, to its growth and adoption at Facebook, and finally to the present with the formation of Presto Software Foundation for wider community involvement. He also highlighted some of their design choices and some mis-steps they took along the way.

Presto at Grab

Slides Video

First industry speaker of the day was Edwin Hui Hean Law, Data Engineering Lead at Grab, Singapore. He and his team flew all the way from Singapore for Presto Summit - a true testament to their passion and interest in Presto. His talk covered Grab’s experience of using Presto on Amazon EMR followed by their migration to Presto on Qubole. He provided his insights on the relative pros and cons of these platforms. Final part of his talk covered his team’s recent experimentation with Presto on Kubernetes.

Read Support for Hive ACID tables in Presto

Slides Video

Next, Shubham Tagra, Sr. Staff at Qubole, presented his work on providing read support for Hive ACID tables in Presto. This has become increasingly important with the arrival of data privacy regulations like GDPR and CCPA that grant users “Right to erasure” and/or “Right to rectification”. These regulations require that organisations storing user data are obligated to delete or update user data as per user request. Hive ACID is a solution available in open source that addresses these problems around delete and updates. Shubham’s talk covered why he picked Hive ACID over other options available in open source, as well as details of Hive ACID and Presto integration that he added.

Presto Optimizations at Zoho Corporation

Slides Video

Post lunch, Praveen Krishna from Zoho Corporation, presented a summary of his team’s journey with Presto. In order to serve their teams with a pretty small cluster, they had to optimize Presto at various levels. Praveen’s team started by analyzing various phases of query execution and their impact on performance. Praveen’s team optimized Presto’s planner and reduced the planning time by 20-30% for queries involving multiple joins on wide tables. He also highlighted how they have integrated Apache Lucene to speed up full text search operation. After several iterations his team came up with a model where they maintained the Lucene index for each row group in the ORC itself. For columns with higher null ratio, replacing normal blocks with run length encoded blocks reduced memory consumption . With this logic implemented in ORC reader and Core Presto, they were able to reduce memory pressure in the cluster .

Presto at Walmart Labs

Slides Video

Second presentation in this session was from Ashish Kumar Tadose, Principal Engineer at Walmart Labs. He gave an overview of how his team is using Presto on Google Compute Cloud (GCP). He highlighted the challenges associated with querying diverse data sources at Walmart and how his team has tackled these challenges using Presto. His talk also described how his team has implemented monitoring, auto scaling, caching (via Alluxio), and security policies via Ranger.

Presto at InMobi

Slides Video

Ater a coffee break, Rohit Chatter, CTO at InMobi, provided a historical perspective of how his team has migrated from Hive in private Data centers to Presto on the public cloud. His talk covered various aspects of how his team handles autoscaling and workload management on the cloud.

Presto Scheduler Changes for Rubix

Slides Video

Next, Garvit Gupta from Microsoft presented his work on Presto scheduler changes for data locality and optimized scheduling for caching engines like RubiX. This work was done primarily as part of his internship at Qubole. This talk was co-presented by Ankit Dixit from Qubole, who first gave an overview of the Rubix caching engine and its architecture. Garvit highlighted the need for having locality as another dimension to be considered while assigning splits to nodes and how this led to the implementation of a new Presto scheduler. The new scheduling model manages to prioritize locality while ensuring a uniform distribution of workload to nodes and improves efficacy of any data caching framework that you would use with Presto. His talk covered the new scheduler changes in detail, and concluded with performance numbers where he saw upto 9x improvement in cached/local reads with RubiX.

Presto at MiQ Digital

Slides Video

Final presentation of the day was from Rohit Srivastava, Engineering Manager at MiQ Digital, who presented an overview of Unified Insights & Data Analytics platform at MiQ. He highlighted several challenges that his team had to overcome, such as scaling the team/infrastructure/company, dealing with data copies, duplication of data pre-processing and the cost and effort that goes into it, meeting strict SLAs etc. He gave an overview of how using Presto on Qubole for all dashboarding needs with additions like standardising most of their data to be stored in the Apache Parquet format on S3 has helped overcome some of these challenges.

In summary, first Presto Summit in India, had a great mix of talks - some were around Presto usage and experience of operating large Presto deployments across multiple clouds, while some others focussed on niche technical contributions around Presto scheduler changes for data locality, speeding up ORC reader, and read support for Hive ACID tables in Presto. Participants had interesting and engaging questions for all the speakers and in general, enjoyed interacting with Presto founders, other Presto users and developers in the region.

Videos and slides for all talks can be found here.

We look forward to the next Presto Summit in this region soon!

Unnest Operator Performance Enhancement with Dictionary Blocks

2019-08-23T00:00:00+00:00

Queries with CROSS JOIN UNNEST clause are expected to have a significant performance improvement starting version 316.

Executive Summary

The execution plans for queries with a CROSS JOIN UNNEST clause contain an Unnest Operator. The previous implementation of Unnest Operator performed a deep copy on all input blocks to generate output blocks. This caused high CPU consumption and memory allocation for the operator, and impacted the performance of such queries. The impact was worse for UNNEST queries accessing a high number of columns, or even a few columns with deeply nested schema.

We realized that the implementation can be made more efficient by avoiding copies in the Unnest Operator, if possible. Using dictionary blocks to create output blocks pointing to input elements has given us significant CPU and memory benefits by avoiding copies. The benchmark results for the new Unnest Operator implementation show more than ~10x gain in CPU time and 3x~5x gain in memory allocation.

Let’s try to understand this change with an example. At LinkedIn, the most common usage for CROSS JOIN UNNEST clause is seen to be for unnesting a single array or map column. A sample query with the clause would look like the following:

SELECT T.c0, U.unnest_c1 
FROM T CROSS JOIN UNNEST(c1) AS U(unnest_c1)

The plots below compare the performance of Unnest Operator in the previous and the current implementation for 3 different cases. Every case evaluates the Unnest Operator performance for a query like the above, on a table T with two columns c0 and c1. For all the 3 cases, c0 is a VARCHAR type column. But the nested column c1 is of ARRAY(VARCHAR), MAP(VARCHAR, VARCHAR) and ARRAY(ROW(VARCHAR, VARCHAR, VARCHAR)) types respectively. All the VARCHAR elements in both the columns have length 50, and the arrays in the second column have lengths distributed uniformly between 0 and 300.

We used JMH benchmark to measure the performance of the queries in terms of CPU time and memory allocations per operation. An “operation” (for the purposes of this measurement) is defined as the processing of 10,000 rows by an unnest operator. These results reflect the speedup of the operator and may not extend to the overall query execution.

The figure above compares the CPU times before and after the enhancements. For the three cases, we see that every operation finishes more than 10x faster. The new implementation removes the need of copying data for output block generation in this case, giving us significant CPU time savings.

The figure above compares the memory allocation per operation before and after the enhancement. The new Unnest Operator implementation does not allocate new large memory chunks for output blocks. Instead, it uses integer typed pointers pointing to input block elements, which results in smaller memory allocations than creating new VARCHAR blocks. This brings down the allocation rate by 3x-5x in this example.

Let’s dig into the design and implementation details.

Background

An Operator in Presto performs a step of computation on data. The local execution plan for a task involves pipelines of operators. Operators process pages coming from the previous Operator in the pipeline, and produce output pages for the next one. Code for an Operator has to be efficient, since it may be evaluated billions of times for a single query.

A Page is made of a set of blocks storing data for different columns. DictionaryBlock is one of the Block implementations in Presto. The elements in a DictionaryBlock are represented using an integer array (called ids) and a reference to another block. The values in ids array represent elements of the DictionaryBlock by pointing to element indices in the referenced block. DictionaryBlocks are useful to perform more efficient encoding of columns with duplicates.

The Unnest Operator was implemented before the DictionaryBlock was added. We saw an opportunity to enhance the performance of this Operator by using DictionaryBlocks. A DictionaryBlock can enable the Unnest Operator to reuse already constructed input blocks. Using DictionaryBlock for the Unnest operator eliminates the need for expensive copies and results in significant compute and memory savings.

Design

Consider the following CROSS JOIN UNNEST query on a table with one VARCHAR type and one ARRAY(VARCHAR) type columns.

SELECT T.name, U.unnested_position 
FROM T CROSS JOIN UNNEST(positions_held) AS U(unnested_position)

Elements of name column are replicated while we unnest elements in positions_held column. In this example, name is a “replicated column”, and positions_held will be referred to as an “unnested column”.

Multiple unnest columns are also allowed (eg. UNNEST(positions_held, company_name) AS U(unnested_position, unnested_company)), but that case is not that common. It requires special handling, and we talk about that later in the post.

In the old design, an element from a replicated column would get copied over n times for building the output, where n is the cardinality of the element in the unnest column. For example, Alice and Bob will be copied 2 and 3 times respectively. In the new design, the output block will contain n pointers to the element in the input block, without actually copying. It will store a reference to the input block as well. The benefits here are proportional to the replicated column element sizes. The bigger the element size, the greater the speedup.

Unnest columns are handled the same way. The previous design would copy them over one by one. This becomes CPU intensive and requires new memory allocations, especially in case of deeply nested columns, since a deep copy is required. In the new design, we try to use pointers instead of copies in most of the cases. The following figure shows the output block structure of the unnested_positions column in the query above, for the old and the new implementation.

The indices in the output block B3 shown above are strictly increasing starting from 0, but that is not always the case. The same input block can be used to generate multiple output blocks, with a different set of indices. Another interesting scenario is when multiple columns are being unnested. In that case, the output may require null appends because of the difference in cardinalities. We look for null elements in the input block and use their indices for handling the null-appends. If that is not possible, we have to fall back to copying data. We discuss this in more detail in the next section.

Implementation Challenges

Extracting Input from Nested Blocks

Data in the input unnest columns is represented in terms of nested structures (eg. ArrayBlock, MapBlock and RowBlock), which creates a layer of indirection on top of the actual element blocks. For the positions_held column from the example above, the input block is an ArrayBlock, that contains:

offset information for representing arrays in every row
actual data in the form of an underlying element block storing VARCHARs.

For building an output DictionaryBlock, we create pointers to this underlying block. While processing entries from input array block, array offsets are translated to indices of the underlying block. Similar translation has been implemented for unnest columns with array type, map type and array of row type. ColumnarMap, ColumnarArray and ColumnarRow structures are used for enabling such translation of indices.

Dealing with Multiple Unnest Columns

When there are more than one nested columns in a table, a user may want to unnest multiple columns in the same query. Consider a table S with 3 columns: name, schools_attended and graduation_dates. They have VARCHAR, ARRAY(VARCHAR) and ARRAY(VARCHAR) types respectively. Every row in this table indicates schools attended and corresponding graduation dates for a person. Let’s say a user wants to unnest the contents of the two array columns into unnested_school and unnested_graduation_date.

One naive way of doing that is using the CROSS JOIN UNNEST clause twice, on the two different columns. This translates to two different UNNEST operators (as shown in the query below) with a single unnest column producing two independent cross joins, and the execution will proceed the way we discussed earlier. This query structure is not very helpful, since we get blown up cross joined data.

SELECT S.name, U1.unnested_school, U2.unnested_graduation_date 
FROM S
CROSS JOIN UNNEST(schools_attended) AS U1(unnested_school) 
CROSS JOIN UNNEST(graduation_dates) AS U2(unnested_graduation_date)

The correct way of unnesting the two columns is using them in the same unnest clause, as shown below.

SELECT S.name, U.unnested_school, U.unnested_graduation_date 
FROM T 
CROSS JOIN UNNEST(schools_attended, graduation_dates) AS U(unnested_school, unnested_graduation_date) 

The arrays/maps being unnested in multiple columns can have different cardinalities. In this example, the graduation_date value for the last school may not be present, if the user has not yet graduated. Null elements need to be appended to the output unnest columns in such cases.

In the example data shown below, a NULL element is appended in the unnested_graduation_date column since the array in graduation_dates column is shorter than that in the schools_attended column.

Since we are using a DictionaryBlock for building the unnest output column, appending a null gets slightly tricky. How do we create a pointer for representing a NULL? The DictionaryBlock implementation, as of now, does not have a way to represent null elements. In such cases, we first check for existence of a null element in the input block. If we find a NULL element there, we use the index of that element while appending NULLs in the output. Otherwise we copy elements from the input to create a new output block, like we used to do in the previous implementation.

In cases with multiple columns, the length of arrays/maps are usually the same, and misalignments are not that frequent. Having said that, misalignments can result in copying of data while building output blocks if NULL elements are not present in the input. This may reduce the CPU and memory savings (even increase the average memory allocation in some cases), but this specific case is not common.

Future Work

Performance for the queries with CROSS JOIN UNNEST clause can be further improved through the following optimizations.

While unnesting a deeply nested column of type array(row(.....)), the user is often interested in a small subset of fields from the row. Such cases can benefit from optimization of the logical plan through the pushdown of dereference projections. There are ongoing efforts in the community in this direction.

The dictionary blocks created in the discussed implementation use the input block as a reference. What happens if the input itself is a DictionaryBlock? We end up with two levels of dereferencing. Such cases can be further optimized by collapsing the multiple indirections into a single one.

The common case for unnest column does not involve any NULL appends. The unnested output DictionaryBlock in this case represents a range over the input block. We can avoid the DictionaryBlock creation by using the getRegion method on the input block.

For variable-width and complex columns, usage of DictionaryBlock can be beneficial in terms of CPU and memory. This may be overkill for primitive types (booleans or integers) and we might be better off copying rather than creating a dictionary block. Selectively choosing to use dictionary blocks based on the type can be helpful.

Conclusion

LinkedIn’s data ecosystem makes heavy use of tables with deeply nested columns, and this change is beneficial for handling Presto queries on such tables. In our internal experiments with production data, we have seen queries perform up to ~9x faster with as much as ~13x less cpu usage.

We look forward to people in the community trying this out starting with the 316 release. We would love to hear others’ observations of performance after this change. Feel free to reach out to me over slack (handle @padesai) or LinkedIn with questions or feedback.

A Report of First Ever Presto Conference Tokyo

2019-07-11T00:00:00+00:00

Nowadays, Presto is getting much attraction from the various kind of companies all around the world. Japan is not an exception. Many companies are using Presto as their primary data processing engine.

To keep in touch with each other among the community members in Japan, we have just held the first ever Presto conference in Tokyo with welcoming Presto creators, Dain Sundstrom, Martin Traverso and David Phillips. The conference was hosted at the Tokyo office of Arm Treasure Data. This article is the summary of the conference aiming to convey the excitement in the room.

Presto: Current and Future

First of all, Presto creators introduced their work in these days and software foundation launched in the last year. They covered the following changes and enhancements achieved by the community recently.

Presto Software Foundation
New Connectors
- Phoenix
- Elasticsearch
- Apache Ranger

Attendees can also learn several plans that will happen shortly.

The plan to support more complex pushdown to connectors
Case-sensitive identifier
Timestamp semantics
Dynamic filtering
Connectors such as Iceberg, Kinesis, Druid.
Coordinator high availability

Reading The Source Code of Presto

To make attendees get used to the technical talk about Presto in the conference, Leo provided a guide for walking around the source code of Presto code. Since the Presto source code repository is enormous, it must be helpful as a leader to help developers explore the forest of the codebase.

Reading The Source Code of Presto from Taro L. Saito

Presto At Arm Treasure Data

Then Kai (it’s me) provides an overview of how Arm Treasure Data uses Presto in their service. Presto is heavily used to support many enterprise use cases, including IoT data analysis, and it is becoming the hub component processing high throughput workload from many kinds of clients such as Spark, ODBC and JDBC.

Presto At Arm Treasure Data - 2019 Updates from Taro L. Saito

Large Scale Migration from Hive to Presto in Yahoo! JAPAN

We could learn how hard to migrate large scale workload from Hive to Presto from the presentation given by Star from Yahoo! Japan. Quite a few people seem to be interested in the tool they have created to convert HiveQL into Presto SQL. They might have faced the same type of challenges.

Large scale migration fromHive to Presto at Yahoo! JAPAN from Yahoo!デベロッパーネットワーク

Presto At LINE

LINE is the biggest company providing the mobile communication tool in Japan (say WhatsApp in Japan). Wataru Yukawa, Yuya Ebihara gave us how they can improve their platform with collaborating with the community. We could find difficulty and challenge primarily provoked by the dependencies on other Hadoop ecosystems such as HDFS and Spark.

Presto conferencetokyo2019 from wyukawa

One notable thing in the session was the question about the discussion of how to make the error message excellent provided by Presto. David and creators are genuinely caring about the error message shown by the system. To reduce the time consumed to deal with the inquiry about the error, improving the error message is one of the best options. That’s the primary reason to maintain the error message easy to understand.

Q&A Session

At the end of the conference, attendees got a chance to freely ask Presto creators about a bunch of topics not only Presto technical thing but also their working style, or thoughts. Here is a part of the list of Q&A talked at the conference.

Q: What do you expect most from Japan community?

Considering the communication in the Israel community, gaining the diversity of the use case will make Presto better. We are expecting that kind of diversity. Japan surely has a unique community to solve the difficulty. Having a Japanese slack channel might be a good idea to help each other :)

Q: How do you review the pull request code? How to keep the quality of the code review process?

Code review difficulty depends on the complexity of PR itself. We use IntelliJ extensively to read the code base. There are mainly two things to keep the code review quality. One is that involving the actual code review will make you a good reviewer. Another thing is automating minor checks such as code style. These things are helpful to keep the code review process functional.

Make it readable is the most important thing in the Presto codebase.

Do not use the abbreviation and slang because not everyone can understand these words at a glance

Write comment -> Write code -> Delete comment. That is the process to make the code readable itself.

Q: SQL on Everything approach vs. pursuing the performance. Which direction should Presto move forward?

It depends on the community decision. However, along with the discussion with several companies in the community, even not a single company does not show much concern about the performance of Presto.

Wrap Up

This conference was the first ever Presto conference inviting the Presto creators in Tokyo. We were able to have an exciting discussion with the community developers and creators. One of the great things we could find in the conference was the enthusiasm of creators to make Presto usable by every developer. They are genuinely caring about the error message checked by users, code quality read by developers. Thanks to this type of good usability from the viewpoint of both users and developers, Presto keeps gaining attraction from the community.

That was a great time to have many conversations with the community members. We really appreciate developers in the community and creators. Thank you so much for coming to the conference and see you next time!

Reference

Introduction to Trino Cost-Based Optimizer

2019-07-04T00:00:00+00:00

Last edited 15 June 2022: Update to use the Trino project name.

The Cost-Based Optimizer (CBO) in Trino achieves stunning results in industry standard benchmarks (and not only in benchmarks)! The CBO makes decisions based on several factors, including shape of the query, filters and table statistics. I would like to tell you more about what the table statistics are in Trino and what information can be derived from them.

This post was originally published at Starburst Data Engineering Blog.

Background

Before diving deep into how Trino analyzes statistics, let’s set up a stage so that our considerations are framed in some context. Let’s consider a Data Scientist who wants to know which customers spend most dollars with the company, based on history of orders (probably to offer them some discounts). They would probably fire up a query like this:

SELECT c.custkey, sum(l.price)
FROM customer c, orders o, lineitem l
WHERE c.custkey = o.custkey AND l.orderkey = o.orderkey
GROUP BY c.custkey ORDER BY sum(l.price) DESC;

Now, Trino needs to create an execution plan for this query. It does so by first transforming a query to a plan in the simplest possible way — here it will create CROSS JOINS for FROM customer c, orders o, lineitem l part of the query and FILTER for WHERE c.custkey = o.custkey AND l.orderkey = o.orderkey. The initial plan is very naïve — CROSS JOINS will produce humongous amounts of intermediate data. There is no point in even trying to execute such a plan and Trino won’t do that. Instead, it applies transformation to make the plan more what user probably wanted, as shown below. Note: for succinctness, only part of the query plan is drawn, without aggregation (GROUP BY) and sorting (ORDER BY).

Indeed, this is much better than the CROSS JOINS. But we can do even better, if we consider cost.

Cost-Based Optimizer

Without going into database internals on how JOIN is implemented, let’s take for granted that it makes a big difference which table is right and which is left in the JOIN. (Simple explanation would be that the table on the right basically needs to be kept in the memory while JOIN result is calculated). Because of that, the following plans produce same result, but may have different execution time or memory requirements.

CPU time, memory requirements and network bandwidth usage are the three dimensions that contribute to query execution time, both in single query and concurrent workloads. These dimensions are captured as the cost in Trino.

Our Data Scientist knows that most of the customers made at least one order and every order had at least one item (and many orders had many items), so lineitem is the biggest table, orders is medium and customer is the smallest. When joining customer and orders, having orders on the right side of the JOIN is not a good idea! However, how the planner can know that? In the real world, the query planner cannot reliably deduce information just from table names. This is where table statistics kick in.

Table statistics

Trino has connector-based architecture. A connector can provide table and column statistics:

number of rows in a table,
number of distinct values in a column,
fraction of NULL values in a column,
minimum/maximum value in a column,
average data size for a column.

Of course, if some information is missing — e.g. average text length in a varchar column is unknown — a connector can still provide other information and Cost-Based Optimizer will be able to use that.

In our Data Scientist’s example, data sizes can look something like the following:

Having this knowledge, Trino’s Cost-Based Optimizer will come up with completely different join ordering in the plan.

Filter statistics

As we saw, knowing the sizes of the tables involved in a query is fundamental to properly reordering the joins in the query plan. However, knowing just the sizes is not enough. Returning to our example, the Data Scientist might want to drill down into results of their previous query, to know which customers repeatedly bought and spent most money on a particular item (clearly, this must be some consumable, or a mobile phone). For this, they will use almost identical query as the original one, adding one more condition.

SELECT c.custkey, sum(l.price)
FROM customer c, orders o, lineitem l
WHERE c.custkey = o.custkey AND l.orderkey = o.orderkey
  AND l.item = 106170                              --- additional condition
GROUP BY c.custkey ORDER BY sum(l.price) DESC;

The additional FILTER might be applied after the JOIN or before. Obviously, filtering as early as possible is the best strategy, but this also means the actual size of the data involved in the JOIN will be different now. In our Data Scientist’s example, the join order will indeed be different.

Under the Hood

Execution Time and Cost

From external perspective, only three things really matter:

execution time,
execution cost (in dollars),
ability to run (sufficiently) many concurrent queries at a time.

The execution time is often called “wall time” to emphasize that we’re not really interested in “CPU time” or number of machines/nodes/threads involved. Our Data Scientist’s clock on the wall is the ultimate judge. It would be nice if they were not forced to get coffee/eat lunch during each query they run. On the other hand, a CFO will be interested in keeping cluster costs at the lowest possible level (without, of course, impeding employees’ effectiveness). Lastly, a System Administrator needs to ensure that all cluster users can work at the same time. That is, that the cluster can handle many queries at a time, yielding enough throughput that “wall time” observed by each of the users is satisfactory.

It is possible to optimize for only one of the above dimensions. For example, we can have single node cluster and CFO will be happy (but employees will go somewhere else). Contrarily, we may have thousand node cluster even if the company cannot afford that. Users will be (initially) happy, until the company goes bankrupt. Ultimately, however, we need to balance these trade-offs, which basically means that queries need to be executed as fast as possible, with as little resources as possible.

In Trino, this is modeled with the concept of the cost, which captures properties like CPU cost, memory requirements and network bandwidth usage. Different variants of a query execution plan are explored, assigned a cost and compared. The variant with the least overall cost is selected for execution. This approach neatly balances the needs of cluster users, administrators and the CFO.

The cost of each operation in the query plan is calculated in a way appropriate for the type of the operation, taking into account statistics of the data involved in the operation. Now, let’s see where the statistics come from.

Statistics

In our Data Scientist’s example, the row counts for tables were taken directly from table statistics, i.e. provided by a connector. But where did “~3K rows” come from? Let’s dive into some nitty-gritty details.

A query execution plan is made of “building block” operations, including:

table scans (reading the table; at runtime this is actually combined with a filter)
filters (SQL’s WHERE clause or any other conditions deduced by the query planner)
projections (i.e. computing output expressions)
joins
aggregations (in fact there are a few different “building blocks” for aggregations, but that’s a story for another time)
sorting (SQL’s ORDER BY)
limiting (SQL’s LIMIT)
sorting and limiting combined (SQL’s ORDER BY .. LIMIT .. deserves specialized support)
and a lot more!

The way how the statistics are computed for most interesting “building blocks” is discussed below.

Table Scan statistics

As explained in “Table statistics” section, the connector which defines the table is responsible for providing the table statistics. Furthermore, the connector will be informed about any filtering conditions that are to be applied to the data read from the table. This may be important e.g. in the case of Hive partitioned table, where statistics are stored on per-partition basis. If the filtering condition excludes some (or many) partitions, the statistics will consider smaller data set (remaining partitions) and will be more accurate.

To recall, a connector can provide the following table and column statistics:

number of rows in a table,
number of distinct values in a column,
fraction of NULL values in a column,
minimum/maximum value in a column,
average data size for a column.

Filter statistics

When considering a filtering operation, a filter’s condition is analyzed and the following estimations are calculated:

what is the probability that data row will pass the filtering condition. From this, expected number of rows after the filter is derived,
fraction of NULL values for columns involved in the filtering condition (for most conditions, this will simply be 0%),
number of distinct values for columns involved in the filtering condition,
number of distinct values for columns that were not part of the filtering condition, if their original number of distinct values was more than the expected number of data rows that pass the filter.

For example, for a condition like l.item = 106170 we can observe that:

no rows with l.item being NULL will meet the condition,
there will be only one distinct value of l.item (106170) after the filtering operation,
on average, number of data rows expected to pass the filter will be equal to number_of_input_rows * fraction_of_non_nulls / distinct_values. (This assumes, of course, that users most often drill down in the data they really have, which is quite a reasonable assumption and also safe to make).

Projection statistics

Projections (l.item – 1 AS iid) are similar to filters, except that, of course, they do not impact the expected number of rows after the operation.

For a projection, the following types of column statistics are calculated (if possible for given projection expression):

number of distinct values produced by the projection,
fraction of NULL values produced by the projection,
minimum/maximum value produced by the projection.

Naturally, if iid is only returned to the user, then these statistics are not useful. However, if it’s later used in filter or join operation, these statistics are important to correctly estimate the number of rows that meet the filter condition or are returned from the join.

Conclusion

Summing up, Trino’s Cost-Based Optimizer is conceptually a very simple thing. Alternative query plans are considered, the best plan is chosen and executed. Details are not so simple, though. Fortunately, to use Trino, one doesn’t need to know all these details. Of course, anyone with a technical inclination that like to wander in database internals is invited to study the Trino code!

Enabling Trino CBO is really simple:

set optimizer.join-reordering-strategy=AUTOMATIC and join-distribution-type=AUTOMATIC in your config.properties,
analyze your tables,
no, there is no third step. That’s it!

Take Trino CBO for a spin today and let us know about your Trino experience!

□

Dynamic filtering for highly-selective join optimization

2019-06-30T00:00:00+00:00

By using dynamic filtering via run-time predicate pushdown, we can significantly optimize highly-selective inner-joins.

Introduction

In the highly-selective join scenario, most of the probe-side rows are dropped immediately after being read, since they don’t match the join criteria.

Our idea was to extend Presto’s predicate pushdown support from the planning phase to run-time, in order to skip reading the non-relevant rows from our connector into Presto¹. It should allow much faster joins, when the build-side scan results in a low-cardinality table:

The approach above is called “dynamic filtering”, and there is an ongoing effort to integrate it into Presto.

The main difficulty is the need to pass the build-side values from the inner-join operator to the probe-side scan operator, since the operators may run on different machines. A possible solution is to use the coordinator to facilitate the message passing. However, it requires multiple changes in the existing Presto codebase and careful design is needed to avoid overloading the coordinator.

Since it’s a complex feature with lots of moving parts, we suggest the approach below that allows solving it in a simpler way for specific join use-cases. We note that parts of the implementation below will also help implementing the general dynamic filtering solution.

Design

Our approach relies on the cost-based optimizer (CBO) that allows using “broadcast” join, since in our case the build-side is much smaller than the probe-side. In this case, the probe-side scan and the inner-join operators are running in the same process - so the message passing between them becomes much simpler.

Therefore, most of the required changes are at the LocalExecutionPlanner class, and there is no dependencies on the planner nor the coordinator.

Implementation

First, we make sure that a broadcast join is used and that the local stage query plan contains the probe-side TableScan node. Otherwise - we don’t apply our the optimization since we need access to the probe-side PageSourceProvider for predicate pushdown.

Then, we add a new “collection” operator, just before the hash-builder operator as described below:

This operator collects the build-side values, and after its input is over, exposes the resulting dynamic filter as a TupleDomain to the probe-side PageSourceProvider.

Since the probe-side scan operators are running concurrently with the build-side collection, we don’t block the first probe-side splits - but allow them to be processed while dynamic filters collection is in progress.

The lookup-join operator is not changed, but the optimization above allows it to process much less probe-side rows, while keeping the result the same.

Benchmarks

We ran TPC-DS queries on i3.metal 3-node Varada cluster using TPC-DS scale 1000 data. The following queries benefit the most for our dynamic filtering implementation (measuring the elapsed time in seconds).

Query	Dynamic filtering & CBO	Only CBO	No CBO
q10	2.5	8.9	10.0
q20	3.9	12.6	26.7
q31	6.5	34.8	41.5
q32	6.9	23.0	29.7
q34	3.1	11.4	14.1
q69	2.7	8.9	9.9
q71	9.9	91.8	107.4
q77	3.5	17.9	18.1
q96	1.9	8.0	10.2
q98	5.8	26.5	57.1

For example, running the TPC-DS q71 query results in ~9x performance improvement:

Dynamic filtering	Enabled	Disabled
Elapsed (sec)	10	92
CPU (min)	14	127
Data read (GB)	11	112

Discussion

These queries are joining large fact “sales” tables with much smaller and filtered dimension tables (e.g. “items”, “customers”, “stores”) - resulting in significant optimization by using dynamic filtering.

Note that we rely on the fact that our connector allows efficient run-time filtering of the build-side table, by using an inline index for every column for each split.

We also rely on the CBO and statistics’ estimation to correctly convert join distribution type to “broadcast” join. Since current statistics’ estimation doesn’t support all query plans, this optimization cannot be currently applied for some types of aggregations (e.g. TPC-DS q19 query).

In addition, our current dynamic filtering doesn’t support multiple join operators in the same stage, so there are some TPC-DS queries (e.g. q13) that may be optimized further.

Future work

The implementation above is currently in the process of being reviewed and will be available in a release soon. In addition, we intend to improve the existing implementation to resolve the limitations described above, and to support more join patterns.

Initially we had experimented with adding Index Join support to our connector, but since it requires a global index and efficient lookups for high performance, we switched to the dynamic filtering approach. ↩

Release 315

2019-06-15T00:00:00+00:00

This version adds support for FETCH FIRST ... WITH TIES syntax, locality-awareness to default scheduler for better workload balancing, the new format() function, and improved support for ORC bloom filters. Additionally, connectors can now provide view definitions, which opens up several new use cases.

Release notes
Download

Release 314

2019-06-08T00:00:00+00:00

This version adds support for reading ZSTD and LZ4-compressed Parquet data and writing ZSTD-compressed ORC data, improves compatibility with the Hive 2.3+ metastore, supports mixed-case field names in Elasticsearch, adds JSON output format for the CLI, and improves the rendering of the plan structure in EXPLAIN output.

Release notes
Download

Apache Phoenix Connector

2019-06-04T00:00:00+00:00

Presto 312 introduces a new Apache Phoenix Connector, which allows Presto to query data stored in HBase using Apache Phoenix. This unlocks new capabilities that previously weren’t possible with Phoenix alone, such as federation (querying of multiple Phoenix clusters) and joining Phoenix data with data from other Presto data sources.

Setup

To get started, simply drop in a new catalog properties file, such as etc/catalog/phoenix.properties, which defines the following:

connector.name=phoenix
phoenix.connection-url=jdbc:phoenix:host1,host2,host3:2181:/hbase
phoenix.config.resources=/path/to/hbase-site.xml

The phoenix.connection-url is the standard Phoenix connection string, which contains the zookeeper quorum host information and root zookeeper node.

The phoenix.config.resources is a comma separated list of configuration files, used to specify any custom connection properties.

Schema

For the most part, data types in Phoenix match up with those in Presto, with a few minor exceptions. One thing to note, however, is that tables in Phoenix require a primary key, whereas Presto has no concept of primary keys. To handle this, the Phoenix connector uses a table property to specify the primary key. For example, consider the following statement in Phoenix:

CREATE TABLE example (
  pk_part_1 varchar,
  pk_part_2 varchar,
  val bigint
  CONSTRAINT pk PRIMARY KEY (pk_part_1, pk_part_2)
)

The equivalent statement in Presto would look something like:

CREATE TABLE phoenix.default.example (
  pk_part_1 varchar,
  pk_part_2 varchar,
  val bigint
)
WITH (
  rowkeys = 'pk_part_1,pk_part2'
)

Additional Phoenix and HBase table properties can be specified in a similar way. Note also that the default (empty) schema in Phoenix will always map to a Presto schema named “default”.

Beyond MapReduce

When Phoenix users want to run long-running queries that scan over all/most of the data in a table, they typically have used the Phoenix MapReduce integration. However, this has limitations, as the document states:

Note: The SELECT query must not perform any aggregation or use DISTINCT as these are not supported by our map-reduce integration.

This is because the framework only constructs simple Mappers which scan over each region. To do more complex operations like aggregations, the framework would need Reducers as well. Someone could implement that, but then they would essentially be on the path towards rewriting Hive from scratch.

Presto now provides the ability to do these more complex operations. The Phoenix connector performs the same filtered scans as the MapReduce framework, but now the Presto engine does the aggregations, joins, etc.

Federation

With the Phoenix connector, querying multiple Phoenix clusters is as easy as querying the respective catalogs. As a simple example, suppose we have one cluster in region us-west and another cluster in us-east. If we create two catalog files, phoenix_west.properties and phoenix_east.properties, then we can query both:

SELECT 'us-west' as region, * FROM phoenix_west.default.example
UNION
SELECT 'us-east' as region, * FROM phoenix_east.default.example

Joining with other data sources

Another nice feature of Presto is the ability to join data in Phoenix with other data sources. Suppose we have the following tables:

customer (
  custkey bigint,
  comment varchar,
  ...
)

orders (
  orderkey bigint,
  custkey bigint,
  totalprice double,
  ...
)

Suppose further that:

Either table can hold large amounts of data
The customer comment field can change frequently
We want to be able to query for orders with a certain totalprice range, and join with the customer table to get the comment for these orders

Phoenix/HBase is a row-oriented storage solution with very fast lookup by primary key. On the other hand, ORC is a column-oriented file format that can filter results by column value very efficiently. So in this use case, it might make sense to store the customer table in Phoenix with custkey as the primary key, and the orders table in ORC, perhaps in an object store like S3. We can then use Presto to leverage the strengths of each of our data stores and combine OLTP with OLAP:

SELECT c.custkey, c.comment, o.totalprice
FROM phoenix.tpch.customer AS c
INNER JOIN
(
  SELECT custkey, totalprice FROM hive.tpch.orders WHERE totalprice < 100
) o
ON c.custkey = o.custkey

Inserting/Updating data

In the prior example, since our customer data is coming from Phoenix, our OLTP store, we can easily insert new data:

INSERT INTO phoenix.tpch.customer VALUES (101, 'some comment')

Since Presto’s INSERT translates to Phoenix’s UPSERT, inserting is the same as updating - i.e. if there’s already a custkey of 101, then the comment will get updated instead.

Future work

With upcoming improvements to Presto, there will be opportunities to further optimize the performance of the Phoenix connector.

One of the biggest ways Phoenix optimizes performance is through the use of HBase coprocessors, which allow custom code to be run on each regionserver. For example, to do aggregations, Phoenix runs a partial aggregation in the coprocessor of each table region, and the result for each region is then passed back to the client for a final aggregation. That way, the table data itself doesn’t need to be sent from each region to the client - just the partial aggregation result. However, currently only filters are pushed down to the Phoenix connector. With the ongoing work in Presto to support more complex pushdown to connectors, we will be able to pushdown operations like aggregations to the Phoenix connector, which in turn can push them further down to the HBase coprocessors.

Another area of potential improvement is integration with Presto’s cost-based optimizer, which can analyze table statistics to do things like join reordering. Phoenix already supports statistics collection, with more improvements underway, so this is just a matter of integrating with the Presto statistics framework.

Questions?

If you have any questions about the connector, or Phoenix in general, feel free to ask on the Phoenix dev mailing list: dev@phoenix.apache.org.

Removing redundant ORDER BY

2019-06-03T00:00:00+00:00

Optimizers are all about doing work in the most cost-effective manner and avoiding unnecessary work. Some SQL constructs such as ORDER BY do not affect query results in many situations, and can negatively affect performance unless the optimizer is smart enough to remove them.

Until very recently, Presto would insert a sorting step for each ORDER BY clause in a query. This, combined with users and tools inadvertently using ORDER BY in places that have no effect, could result in severe performance degradation and waste of resources. We finally fixed this in Presto 312!

Quoting from the SQL specification (ISO 9075 Part 2):

A <query expression> can contain an optional <order by clause>. The ordering of the rows of the table specified by the <query expression> is guaranteed only for the <query expression> that immediately contains the <order by clause>.

This means, a query engine is free to ignore any ORDER BY clause that doesn’t fit that context. Let’s consider some examples where the clause is irrelevant.

INSERT INTO some_table 
SELECT * FROM another_table 
ORDER BY field 

While this query has the semblance of creating a sorted table, that’s not so. Tables in SQL are inherently unordered. Once the data is written, there’s no guarantee it will come out sorted when read. This is particularly true for a parallel, distributed query engine like Presto that reads and processes data using many threads simultaneously. Note that some storage engines may store data sorted, but that is not controlled during data insertion. Executing the ORDER BY just causes the query to perform poorly due to reduced parallelism in the merging step of a distributed sort, and consumes more CPU and memory to sort the data.

SELECT *
FROM some_table 
   JOIN (SELECT * FROM another_table ORDER BY field) u 
   ON some_table.key = u.key 

In this case, whether the tables involved in the join are sorted doesn’t matter, since Presto is going to build a hash lookup table out of one of them to execute the join operation. As in the previous example preserving the ORDER BY just causes the query to perform poorly.

When does ORDER BY matter? Since it is “guaranteed only for the <query expression> that immediately contains the <order by clause>”, only operations that are part of the same <query expression> are sensitive to it.

A query expression is a block with the following structure:

<query expression> ::=
  [ <with clause> ] 
  <query expression body>
  [ <order by clause> ] 
  [ <result offset clause> ] 
  [ <fetch first clause> ]

where <query expression body> devolves into one of the set operations (UNION, INTERSECT, EXCEPT), a SELECT construct, VALUES or TABLE clause.

The only operations that occur after an ORDER BY are FETCH FIRST (a.k.a., LIMIT) and OFFSET. So, unless a subquery contains one of these two clauses, the query engine is free to remove the ORDER BY clause without breaking the semantics dictated by the specification.

Here’s an example where the clause is meaningful:

SELECT *
FROM some_table
WHERE field = (
    SELECT a 
    FROM another_table 
    ORDER BY b 
    LIMIT 1)

Other databases tackle this in a variety of ways. MariaDB and Hive 3.0 will ignore redundant ORDER BY clauses. SQL Server, on the other hand, will produce an error:

The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.

What’s the catch?

It is a common mistake for users to think the ORDER BY clause has a meaning in the language regardless of where it appears in a query. The fact that, for implementation reasons, in some cases ORDER BY is significant for Presto complicates matters. We often see users rely on this when formulating queries where aggregation or window functions are sensitive to the order of their inputs:

SELECT array_agg(name) FROM (
    SELECT *
    FROM nation
    ORDER BY name DESC
)

SELECT *, row_number() OVER ()
FROM (
    SELECT *
    FROM nation
    ORDER BY name DESC
)

The Right Way™ of doing this in SQL is to use the aggregation or window-specific ORDER BY clause. For the examples above:

SELECT array_agg(name ORDER BY name DESC) 
FROM nation

SELECT *, row_number() OVER (ORDER BY name DESC)
FROM nation

In order to ease the transition, the new behavior can be turned off globally via the optimizer.skip-redundant-sort configuration option or on a per-session basis via the skip_redundant_sort session property. These options will be removed in a future version.

Additionally, any time Presto detects a redundant ORDER BY clause, it will warn users about it:

Release 313

2019-06-01T00:00:00+00:00

This version fixes incorrect results for queries involving GROUPING SETS and LIMIT, fixes selecting the UUID type from the CLI and JDBC driver, and adds support for compression and encryption when using Spill to Disk.

Release notes
Download

Using Precomputed Hash in SemiJoin Operations

2019-05-30T00:00:00+00:00

Queries involving IN and NOT IN over a subquery are much faster in Presto 312.

We ran the benchmark above with 3 workers (r3.2xlarge) and 1 coordinator (r3.xlarge) on TPC-DS scale 1000 stored in ORC format using the following queries:

SELECT count(*)
FROM store_sales
WHERE store_sales.ss_customer_sk IN (
    SELECT c_customer_sk FROM customer
)

SELECT count(*)
FROM store_sales
WHERE store_sales.ss_store_sk NOT IN (
    SELECT s_store_sk 
    FROM store
    WHERE s_hours <> '8AM-4PM'
)

What was the improvement?

We found that the optimization to use precomputed hashes, which is enabled by default, was missing in SemiJoin operator. Hash values were precomputed at the leaf stages but they were not being used in the SemiJoin operator leading to re-calculation of the hash values at this operator. Since queries involving IN and NOT IN over a subquery use SemiJoin operator, the fix to use precomputed hash in SemiJoin operator improves the performance of such queries significantly.

How does optimize-hash-generation optimization work

Presto divides a query plan into parts called Stages which can be run in parallel on multiple nodes, each node working on different set of data. There are two types of stages:

Leaf Stages: these are the stages that are at the leaf of the Query Plan and read data from a datasource, like a Hive Table.
Intermediate Stages: these are the stages other than the leaf stages and process data from other upstream stages.

The Exchange operator shuffles and transfers the output from upstream stages to the intermediate stages. For certain operators like GROUP BY and JOIN, output data of the leaf stage is partitioned by the values of a column and the shuffle operation ensures that a particular partition is always processed by the same task of the Intermediate stage. This partitioning requires calculation of a hash on that column’s values during exchange and later in the intermediate stage same hash is needed during the execution of GROUP BY or JOIN operation. To prevent redundant calculations, Presto calculates this hash value in the leaf stage, uses it in Exchange operator and makes it available in the output to let GROUP BY or JOIN operations use it in the intermediate stage.

Consider this query to count the number of stores per city:

SELECT count(*), city 
FROM stores 
GROUP BY city

The query plan (simplified) and its division into stages looks like below:

The leaf stage (Stage2) reads the table from a data source, feeds the partially aggregated data to Stage1 where final aggregation happens, and finally, the result is available via Stage0.

Each row produced by Stage2, needs to be partitioned by the value of city column in it to ensure data for same city is processed by the same task of Stage1. After the exchange, when a row is consumed in Stage1, it needs to be hashed again to find a group for the row so that the final aggregation accumulates results for each city in it’s corresponding group bucket. Double hash calculations on the values of city column is prevented by doing this calculation once while reading the data and then using it in both Exchange and Final Aggregation operations which reduces CPU usage of the query. Additionally, pushing this calculation into leaf stage which is better parallelized when there is a large number of splits for this stage, improves query latency.

How to get this fix?

This fix is available in Presto version 312 and above. The optimize-hash-generation setting is enabled by default so the fix will be in action as soon as you upgrade your Presto installation.

Release 312

2019-05-29T00:00:00+00:00

This version has many performance improvements (including cast optimization), a new UUID data type and uuid() function, a new Apache Phoenix connector, support for the PostgreSQL TIMESTAMP WITH TIME ZONE data type, support for the MySQL JSON data type, improved support for Hive bucketed tables, and some bug fixes.

Release notes
Download

Improved Hive Bucketing

2019-05-29T00:00:00+00:00

Presto 312 adds support for the more flexible bucketing introduced in recent versions of Hive. Specifically, it allows any number of files per bucket, including zero. This allows inserting data into an existing partition without having to rewrite the entire partition, and improves the performance of writes by not requiring the creation of files for empty buckets.

Hive bucketing overview

Hive bucketing is a simple form of hash partitioning. A table is bucketed on one or more columns with a fixed number of hash buckets. For example, a table definition in Presto syntax looks like this:

CREATE TABLE page_views (
  user_id bigint,
  page_url varchar,
  dt date
)
WITH (
  partitioned_by = ARRAY['dt'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)

The bucketing happens within each partition of the table (or across the entire table if it is not partitioned). In the above example, the table is partitioned by date and is declared to have 50 buckets using the user ID column. This means that the table will have 50 buckets for each date. The assigned bucket for each row is determined by hashing the user ID value. This means that all user IDs with the same value will go into the same bucket.

Original Hive bucketing

Originally, Hive required exactly one file per bucket. The files were named such that the bucket number was implicit based on the file’s position within the lexicographic ordering of the file names. For example, the following list of files represent buckets 0 to 2, respectively:

00000_0
00001_0
00002_0

file0
file3
file5

bucketA
bucketB
bucketD

The file names are meaningless aside from their ordering with respect to the other file names.

What’s the problem?

The original Hive bucketing scheme has a couple of problems:

Inserting data into the table by adding additional files is not possible. Instead, an insert operation requires rewriting all of the existing files, which can be quite expensive.
If the data is sparse, some of the buckets might be empty, but because there must be a file for every bucket, the writer must create an empty file for each bucket. Some file formats, such as ORC, support zero-byte files as empty files. Other formats require writing a file with a valid header and footer. Creating these files adds latency to the write operation, and storing these tiny files is inefficient for file systems like HDFS which are designed for large files.

Improved Hive bucketing

Newer versions of Hive support a bucketing scheme where the bucket number is included in the file name. This is the same naming scheme that Hive has always used, thus it is backwards compatible with existing data. The naming convention has the bucket number as the start of the file name, and requires that the number starts with a 0.

The following list of files shows what data written by Hive might look like for a table with a bucket count of 4:

000000_0            # bucket 0
000000_0_copy_1     # bucket 0
000000_0_copy_2     # bucket 0
000001_0            # bucket 1
000001_0_copy_1     # bucket 1
000003_0            # bucket 3

We can see that there are multiple files for buckets 0 and 1, one file for bucket 3, and no files for bucket 2.

Unfortunately, Presto used a different naming convention that was valid according to the lexicographical ordering requirement, but not the newer explicit numbering convention. File names written by Presto used to look like this:

20180102_030405_00641_x1y2z_bucket-00234

The 20180102_030405_00641_x1y2z value at the start of the file name is the Presto query ID for the query that wrote the data. This is followed by bucket- plus the padded bucket number. Presto now writes file names that match the new Hive naming convention, with the bucket number at the the start and the query ID at the end:

000234_0_20180102_030405_00641_x1y2z

When reading bucketed tables, Presto supports both the new Hive convention and the old Presto convention. Additionally, it still supports the original Hive scheme when the files do not match either of the naming conventions, keeping the requirement that there must be exactly one file per bucket.

Skipping empty buckets for faster writes

Now that Hive and Presto no longer require files for empty buckets, Presto does not need to create them. They are still created by default for compatibility with earlier versions of Hive, Presto, and other tools, but we expect to disable it in a future release, making writes faster by default. Or you may choose to disable them now if that works for your environment. This is controlled by the hive.create-empty-bucket-files configuration property or the create_empty_bucket_files session property.

Optimizing the Casts Away

2019-05-21T00:00:00+00:00

The next release of Presto (version 312) will include a new optimization to remove unnecessary casts which might have been added implicitly by the query planner or explicitly by users when they wrote the query.

This is a long post explaining how the optimization works. If you’re only interested in the results, skip to the last section. For the full details, read on!

Like many programming languages, SQL allows certain operations between values of different types if there are implicit conversions (a.k.a., implicit casts or coercions) between those types. This improves usability, as it allows writing expressions like 1.5 > 2 without worrying too much whether the types are compatible (1.5 is of type decimal(2,1), while 2 is an integer).

During query analysis and planning, Presto introduces explicit casts for any implicit conversion in the original query as it translates it into the intermediate query plan representation the engine uses internally for optimization and execution. This eliminates a layer of complexity for the optimizer, which, as a result, doesn’t need to reason about types (type inference) or worry about whether expressions are properly typed.

More importantly, it simplifies the job of defining and implementing operators (e.g., >, <, =, etc). Without implicit conversions, there would need to exist a variant of every operator for every combination of compatible types. For example, it would be necessary to have an implementation of the = operator for (tinyint, tinyint), (tinyint, smallint), (tinyint, integer), (tinyint, bigint), (smallint, integer), and so on.

Given two columns, s :: tinyint and t :: smallint, and an expression such as s = t, the planner determines that tinyint can be implicitly coerced to smallint and derives the following expression:

CAST(s AS smallint) = t

This is not without challenges. The predicate pushdown logic relies on simple equality and range comparisons to move predicates around, and importantly, to infer that certain predicates in one branch of a join can be used to constrain the values on the other side of the join. An expression like the one above is not “simple” from this perspective due to the type conversion involved, and it can defeat the (arguably simplistic) predicate inference algorithm.

Secondly, if t is a constant (or an expression that is effectively constant), the engine has to convert every value of s it sees during query execution in order to compare it with t. This brings up the obvious question: “can’t it somehow convert t to tinyint and compare directly”? It would look like:

s = CAST(t AS tinyint)

Since t is a constant, the term CAST(t AS tinyint) can be trivially pre-computed and reused for the entire query. It’s not that simple in the general case, though. Narrowing cast, such as a conversion from smallint to tinyint, or from double to integer can fail or alter the value due to rounding or truncation, so we must take special care to avoid errors or change query semantics. We discuss this at length in the sections below.

Some properties of (well-behaved) implicit casts

Let’s take a short detour and talk briefly about some properties of well-behaved implicit casts we can exploit to do the transformation we described in the previous section.

Since the query engine is free to insert implicit casts wherever it sees fit, these functions need to follow some ground rules. Failure to do so can result in queries producing incorrect results due to changes in query semantics.

Implicit casts need to have the following properties:

Injective. Given $\cast{S}{T}$ every value in S must map to a distinct value in T (this does not imply that every value in T has to map to a value in S, though).
Order-preserving. Given $s_1 \in S$ and $s_2 \in S$,

\[\begin{equation} s_1 = s_2 \quad \Rightarrow \quad \cast{S}{T}(s_1) = \cast{S}{T}(s_2) \\ s_1 < s_2 \quad \Rightarrow \quad \cast{S}{T}(s_1) < \cast{S}{T}(s_2) \\ s_1 > s_2 \quad \Rightarrow \quad \cast{S}{T}(s_1) > \cast{S}{T}(s_2) \end{equation}\]

For exact numeric types (e.g., smallint, integer, decimal, etc.), this holds as long as T has enough integer digits to hold the integral part of S and enough fractional digits to hold the fractional part of S.

As an example, the picture below depicts how every value of type tinyint, which has a range of $[-128, 127]$, maps to a distinct value of a wider type such as smallint. Also, every value of the wider type that is within the range of representable values of tinyint has a distinct mapping to a tinyint. So, for the values within the tinyint range, the tinyint → smallint conversion is bijective. This is not necessary for the transformation to work, but it simplifies one of the cases we’ll consider. We’ll cover this more later.

On the other hand, some conversions such as those between integer types and decimal types with fractional parts are injective but not bijective, even when excluding the values outside the range of the narrower type.

The properties clearly hold for tinyint → smallint → integer → biginteger. They also hold for:

tinyint → decimal(3,0) → decimal(4,1) → decimal(5,2) → …
smallint → decimal(5,0) → decimal(6,1) → decimal(7,2) → …
integer → decimal(10,0) → decimal(11,1) → …
bigint → decimal(19,0) → decimal(20, 1) → …

It even works for conversions between exact and approximate numbers, such as:

smallint → real
real → double
integer → double

It does not work for bigint → double, integer → real, or decimal → double when precision is large because not all bigints fit in a double (64 bits vs 53-bit mantissa) and not all integers fit in a real (32 bits vs 23-bit mantissa). Sadly, for legacy reasons Presto allows those conversions implicitly. We “justify” it with the argument that “since they are dealing with approximate numerics anyway, and given the conversions only lose precision in the least significant part, they are sort of ok”. This is something we’ll revisit in the future once we have a reasonable story around dealing with inherent break in backward-compatibility of removing such conversions.

Finally, the properties also apply for varchar to varchar conversions:

varchar(0) → varchar(1) → varchar(2) → … → varchar

Getting to the point…

With this in mind, let’s look at the simplest scenario: conversions between integer types.

As in the example we covered in the introduction, the transformation is straightforward when the constant can be represented in the narrower type. Given s :: tinyint:

CAST(s AS smallint) = smallint '1'     ⟺  s = tinyint '1'
CAST(s AS smallint) = smallint '127'   ⟺  s = tinyint '127'
CAST(s AS smallint) = smallint '-128'  ⟺  s = tinyint '-128'

CAST(s AS smallint) > smallint '10'    ⟺  s > tinyint '10'
CAST(s AS smallint) < smallint '10'    ⟺  s < tinyint '10'

Of course, when the value is at the edge of the range of the narrower type, we can cleverly turn some inequalities into equalities:

CAST(s AS smallint) >= smallint '127'   ⟺  s >= tinyint '127'  
                                        ⟺  s =  tinyint '127'
                                       
CAST(s AS smallint) <= smallint '-128'  ⟺  s <= tinyint '-128'  
                                        ⟺  s =  tinyint '-128'

Additionally, we may be able to tell that an expression is always true or false. Special care needs to be taken when the value is null, though, since in SQL any comparison with null yields null:

CAST(s AS smallint) > smallint '127'    ⟺  s > tinyint '127'  
                                        ⟺  if(s is null, null, false)
                                        
CAST(s AS smallint) <= smallint '127'   ⟺  s <= tinyint '127'  
                                        ⟺  if(s is null, null, true)

CAST(s AS smallint) < smallint '-128'   ⟺  s < tinyint '-128'  
                                        ⟺  if(s is null, null, false)
                                        
CAST(s AS smallint) >= smallint '-128'  ⟺  s >= tinyint '-128'  
                                        ⟺  if(s is null, null, true)

We can make similar inferences when the value is outside the range of possible values for tinyint. For equality comparisons, it’s trivial.

CAST(s AS smallint) = smallint '1000'  ⟺  if(s is null, null, false)

Conversely,

CAST(s AS smallint) <> smallint '1000'  ⟺  if(s is null, null, true)

Just like the earlier cases involving comparisons with values at the edge of the range, we can apply the same idea when the value falls outside of the range:

CAST(s AS smallint) < smallint '1000'   ⟺  if(s is null, null, true) 
CAST(s AS smallint) < smallint '-1000'  ⟺  if(s is null, null, false)

CAST(s AS smallint) > smallint '1000'   ⟺  if(s is null, null, false) 
CAST(s AS smallint) > smallint '-1000'  ⟺  if(s is null, null, true)

Unrepresentable values

Values that are outside the range of the narrower type may not be the only ones without a mapping. For example, for a type such as decimal(2,1), any value with a fractional part (e.g., 1.5, 2.3) cannot be represented as a tinyint.

We can tell whether a value t in T is representable in S by converting it to S and back to T. We’ll call this value t'.

If t <> t', t is not representable in S, and similar rules as for out-of-range values apply when the expression involves an equality. For example, given s :: tinyint:

CAST(s AS double) =  double '1.1'  ⟺  if(s is null, null, false)    
CAST(s AS double) <> double '1.1'  ⟺  if(s is null, null, true)

When some values in T are not representable in S, the cast between T → S will generally either truncate or round. The SQL specification doesn’t mandate which of those alternatives an implementation should follow, and even allows that to vary for conversions between various combinations of types.

This throws a bit of a wrench in our plans, so to speak. If we can’t tell whether a cast will round or truncate, how would we know whether a > comparison should turn into a > or >= in the resulting expression? To illustrate, let’s consider this example. Given s :: tinyint:

CAST(s AS double) > double '1.9'

If the conversion from double → tinyint truncates, the expression above is equivalent to:

s > tinyint '1'

On the other hand, if the conversion rounds, 1.9 becomes 2, and the expression is equivalent to:

s >= tinyint '2'

In order to know which operator to use in the transformed expression (e.g., > vs >=), it is therefore crucial to distinguish between those two behaviors. The good news is that there’s a simple and elegant way out of this hole.

An important observation is that we don’t need to know how the conversion behaves in general, but only how it behaves when applied to the constant t. Regardless of whether the conversion truncates or rounds, for a given value of t, the outcome can be seen to round up or round down, as depicted below.

We can easily tell which of those scenarios applies by comparing t with t': if t > t', the operation rounded down. Conversely, if t < t', it rounded up. If t = t', the value is representable in S, and the rules from the previous section apply.

Oh, the nullability

Let’s take another quick detour and talk about the issue of nullability. After all, no discussion about SQL is complete without an exploration of the semantics of null.

SQL uses three-valued logic. In addition to true and false, logical expressions can evaluate to an unknown value, which is indicated by null. Logical operations AND and OR behave according to the following rules:

\[\begin{array}{|c|c|c|c|} \hline \text{A} & \text{B} & \text{A and B} & \text{A or B} \\ \hline \text{true}& \text{null} & \text{null} & \text{true} \\ \hline \text{false}& \text{null} & \text{false} & \text{null} \\ \hline \end{array}\]

The logical comparison operators =, <>, >, ≥, <, ≤ evaluate to null when one or both operands are null. Hence, if t is null, our expression cast(s as smallint) = t can be simply replaced with a constant null.

As we mentioned in the previous section, there are cases where cast(s as smallint) = t can be reduced to true or false, except for the fact that if s is null, the expression needs to return null to preserve semantics. So, we use the following forms to capture this:

if(s IS null, null, false)
if(s IS null, null, true)

The catch with that is that the optimizer does not understand the semantics of these if expressions and cannot use them for deriving additional properties. In essence, it becomes an optimization barrier. On the other hand, the optimizer is pretty good at manipulating logical conjunctions (AND) and disjunctions (OR). So, let’s see how we can use boolean logic to obtain an equivalent formulation.

We can exploit the properties of SQL boolean logic to derive expressions that behave in the same manner as the if() constructs from above:

\[\begin{align} \text{if}(s \text{ is null}, \text{null}, \text{false}) & \iff (s \text{ is null}) \text{ and null} \\ \text{if}(s \text{ is null}, \text{null}, \text{true}) & \iff (s \text{ is not null}) \text{ or null} \\ \end{align}\]

Let’s break it down to see why that works.

\[\begin{align} \text{if}(s \text{ is null}, \text{null}, \text{false}) & = (s \text{ is null}) \text{ and null} \\ & = \begin{cases} \text{true and null} & = \text{null}, & \text{if } s \text{ is null} \\ \text{false and null} & = \text{false}, & \text{if } s \text{ is not null} \end{cases} \\[5pt] \text{if}(s \text{ is null}, \text{null}, \text{true}) & = (s \text{ is not null}) \text{ or null} \\ & = \begin{cases} \text{false or null} & = \text{null}, & \text{if } s \text{ is null} \\ \text{true or null} & = \text{true}, & \text{if } s \text{ is not null} \end{cases} \end{align}\]

Putting it all together

Now that we’ve had a taste of how this optimization works, let’s put it all together into one rule to rule them all.

Given an expression of the following form,

\[\cast{S}{T}(s) \otimes t \quad s \in S, t \in T, \otimes \in [=, \ne, <, \le, >, \ge]\]

we derive a transformation based on the rules below.

If $t \text{ is null} \Rightarrow \cast{S}{T}(s) \otimes t \iff \text{null} \tag{1}$ $\\[5pt]$
If $\exists s' \in S \ldotp s' = \cast{T}{S}(t)$, we calculate $t' = \cast{S}{T}(s')$ and consider the following cases:
1. If $t = t' \Rightarrow \cast{S}{T}(s) \otimes t \iff s \otimes \cast{T}{S}(t) \tag{2.1}$ $\\[5pt]$
  - In the special case where $\\[5pt]$ $\quad s' = \text{min}_S \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) > t & \iff s \ne \text{min}_{S} \\ \cast{S}{T}(s) \ge t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) < t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \le t & \iff s = \text{min}_{S} \end{array}\right. \tag{2.1.1} \\[5pt]$
  - In the special case where $\\[5pt]$ $\quad s' = \text{max}_S \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) > t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ge t & \iff s = \text{max}_{S} \\ \cast{S}{T}(s) < t & \iff s \ne \text{max}_{S} \\ \cast{S}{T}(s) \le t & \iff \trueOrNull{s} \end{array}\right. \tag{2.1.2} \\[5pt]$
2. Otherwise, $\\[5pt]$ $\quad t \ne t' \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) = t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ne t & \iff \trueOrNull{s} \end{array}\right. \tag{2.2} \\[5pt]$
  - Further, if $\\[5pt]$ $\quad \quad t < t' \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) > t & \iff s \ge \cast{T}{S}(t) \\ \cast{S}{T}(s) \ge t & \iff s \ge \cast{T}{S}(t) \\ \cast{S}{T}(s) < t & \iff s < \cast{T}{S}(t) \\ \cast{S}{T}(s) \le t & \iff s < \cast{T}{S}(t) \end{array}\right. \tag{2.2.1} \\[5pt]$
    In the special case where $\\[5pt]$ $\quad \quad s' = \text{max}_S \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) > t & \iff s = \text{max}_{S} \\ \cast{S}{T}(s) \ge t & \iff s = \text{max}_{S} \\ \end{array}\right. \\[5pt] \tag{2.2.1.1}$
  - Otherwise, if $\\[5pt]$ $\quad \quad t > t' \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) > t & \iff s > \cast{T}{S}(t) \\ \cast{S}{T}(s) \ge t & \iff s > \cast{T}{S}(t) \\ \cast{S}{T}(s) < t & \iff s \le \cast{T}{S}(t) \\ \cast{S}{T}(s) \le t & \iff s \le \cast{T}{S}(t) \end{array}\right. \\[5pt] \tag{2.2.2}$
    In the special case where $\\[5pt]$ $\quad \quad s' = \text{min}_S \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) < t & \iff s = \text{min}_{S} \\ \cast{S}{T}(s) \le t & \iff s = \text{min}_{S} \end{array}\right. \\[5pt] \tag{2.2.2.1}$
If $\cast{T}{S}$ is undefined or $\cast{T}{S}(t)$ fails, $\\[5pt]$ $t < \cast{S}{T}(\text{min}_S) \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) = t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ne t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) < t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \le t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) > t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) \ge t & \iff \trueOrNull{s} \end{array}\right. \\[5pt] \tag{3.1}$ $t = \cast{S}{T}(\text{min}_S) \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) = t & \iff s = \text{min}_S \\ \cast{S}{T}(s) \ne t & \iff s > \text{min}_S \\ \cast{S}{T}(s) < t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \le t & \iff s = \text{min}_S \\ \cast{S}{T}(s) > t & \iff s > \text{min}_S \\ \cast{S}{T}(s) \ge t & \iff \trueOrNull{s} \end{array}\right. \\[5pt] \tag{3.2}$ $t > \cast{S}{T}(\text{max}_S) \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) = t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ne t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) < t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) \le t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) > t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ge t & \iff \falseOrNull{s} \end{array}\right. \\[5pt] \tag{3.3}$ $t = \cast{S}{T}(\text{max}_S) \Rightarrow \left\{ \begin{array}{@{}ll@{}} \cast{S}{T}(s) = t & \iff s = \text{max}_S \\ \cast{S}{T}(s) \ne t & \iff s < \text{max}_S \\ \cast{S}{T}(s) < t & \iff s < \text{max}_S \\ \cast{S}{T}(s) \le t & \iff \trueOrNull{s} \\ \cast{S}{T}(s) > t & \iff \falseOrNull{s} \\ \cast{S}{T}(s) \ge t & \iff s = \text{max}_S \end{array}\right. \\[5pt] \tag{3.4}$
Otherwise, the transformation is not applicable.

OMGWTFNaN

As if all of this weren’t enough, there’s an additional complication we need to handle for types such as real and double. Those types are what the SQL specification calls approximate numeric types. Presto implements them as IEEE-754 single and double precision floating point numbers, respectively.

In addition to finite numbers, IEEE-754 defines an additional set of values: ∞ and NaN (not a number). It is worth noting that -∞ and +∞ do not behave like ∞ in the mathematical sense. They are actual values in the ordered set of numbers, but they don’t represent any finite number. Therefore, the following relations hold:

-∞ < -1.23E30 < 0 < 3.45E25 < +∞
-∞ = -∞
+∞ = +∞ 

Since -∞ and +∞ can be treated as regular values, we can use them as the minimum and maximum values of the range for these types. Any other choice would not work, since all values of a type must be contained within the range of the type for the transformation to be valid. That is,

\[\forall v \in T \quad T_{\text{min}} \le v \le T_{\text{max}}\]

Let’s look at an example to understand why this is necessary. Instead of using $[-∞, ∞]$ as the range, let’s say we picked the minimum and maximum representable values for the real type (-3.4028235E38 and 3.4028235E38), and consider this expression (s :: real):

cast(s AS double) >= double '3.4028235E38'

From the rules in the previous section, $t = 3.4028235\text{E}38$, $s'= 3.4028235\text{E}38$ and $t' = 3.4028235E38$. Since $t = t'$ and $s' = max_S$, from rule 2.1.2, the expression reduces to:

s = 3.4028235E38

This is clearly incorrect. When s = Infinity, cast(s AS double) results in double 'Infinity', which is not equal to 3.4028235E38.

On the other hand, NaN doesn’t obey any of the comparison rules. It’s neither equal nor distinct from itself, and it’s neither larger, nor smaller than any other value:

NaN =  NaN  ⟺  false  
NaN <> NaN  ⟺  false
NaN > 0     ⟺  false
NaN = 0     ⟺  false
NaN < 0     ⟺  false

So, NaN is not part of the ordered set of values for these types, and the requirement that every value be contained in the range doesn’t hold. From rule 2.1.1, an expression such as:

cast(s AS double) >= double '-Infinity'

reduces to if(s is null, null, true), which is incorrect, since the expression returns false when s is NaN.

Is all hope lost for real and double? Fortunately, not. The range is only needed as an optimization. If we forgo defining a range for types that don’t have the required properties, the special cases 2.1.1 and 2.1.2 don’t apply, and by rule 2.1, the expression is equivalent to:

s >= real '-Infinity'

which correctly returns false when s is NaN.

Show me the money!

So, does all of this even matter? Why, yes! Glad you asked.

As with any performance optimization, you can improve things by working smarter (can you avoid work that can be proven to be unnecessary) or by working harder (can you do the work you have to do more efficiently). This optimization does a little of both. Let’s consider three scenarios when it has a positive effect.

Dead code

Since in some cases it can prove that the comparisons will always produce false, regardless of the input, it can short-circuit entire conditions or subplans before even a single row of data is read. Some query generation tools are not sophisticated enough and may emit queries that contain that kind of construct. Also, everyone makes mistakes, and it’s not hard to end up with queries that contain what’s effectively dead code. The last thing you want is to sit in front of the screen waiting for a query to complete … waiting … waiting … just for Presto to tell you ¯\_(ツ)_/¯.

For example, given:

CREATE TABLE t(x smallint);

-- <insert lots of rows into t> --

SELECT * 
FROM t 
WHERE x IS NOT NULL AND x > 1000000 

Produces the following query plan (Values is an empty inline table):

- Output[x]
  - Values

Improved JOIN performance

What’s nice about this optimization is that it enables other optimizations to work better. We mentioned earlier that comparisons that are not simple expressions between columns, or between columns and constants, make it harder for the predicate pushdown optimization to infer predicates that can be propagated to the other branch of a join.

Given two tables:

CREATE TABLE t1 (v smallint);
CREATE TABLE t2 (v bigint);

And the following query:

SELECT *
FROM t1 JOIN t2 ON t1.v = t2.v
WHERE t1.v = BIGINT '1';

The query plan without this optimization is:

- Output[name]
  - InnerJoin[expr = v]
    - ScanFilterProject[t1, filter = CAST(v AS bigint) = BIGINT '1']
        expr := CAST(v AS bigint)
    - TableScan[t2]

The optimization allows the predicate pushdown logic to apply the condition to the other side of the join, producing a much better plan. If data in t1 and t2 is somehow organized by v (e.g., a partition key in Hive), or if the connector understands how to apply the filter at the source, the query won’t need to even read certain parts of the table. The query plan with the optimization enabled:

- Output[name]
  - CrossJoin
    - ScanFilterProject[t1, filter = (v = SMALLINT '1')]
    - ScanFilterProject[t2, filter = (v = BIGINT '1')]

Best bang for the buck

Finally, if the condition absolutely needs to be evaluated, the transformed expression could be significantly more efficient, especially when the cast between the two types is expensive. To illustrate, given a table with 1 billion rows and a column k :: bigint:

SELECT count_if(k > CAST(0 as decimal(19)) 
FROM t

Without the optimization:

- [...]
    - ScanProject
===>    CPU: 3.75m (66.34%), Scheduled: 5.56m (145.22%)
        expr := (CAST("k" AS decimal(19,0)) > CAST(DECIMAL '0' AS decimal(19,0)))
        
        
Query 20190515_072240_00006_rgzb4, FINISHED, 4 nodes
Splits: 110 total, 110 done (100.00%)
0:22 [1000M rows, 8.4GB] [46M rows/s, 395MB/s]

With the optimization:

- [...]
    - ScanProject
===>    CPU: 29.93s (58.17%), Scheduled: 47.44s (145.07%)
        expr := ("k" > BIGINT '0')
        
        
Query 20190515_071912_00005_bz6cb, FINISHED, 4 nodes
Splits: 110 total, 110 done (100.00%)
0:03 [1000M rows, 8.4GB] [335M rows/s, 2.81GB/s]        

Thirsty for more? Here’s the code. Happy querying!

Many thanks to kasiafi for their thoughtful and thorough feedback on early drafts of this post.

Presto Summit 2019 @TwitterSF

2019-05-17T00:00:00+00:00

Next month will mark the 2nd annual Presto Summit hosted by the Presto Software Foundation, Starburst Data, and Twitter. Last year’s event was a great success (see the Presto Summit 2018 recap).

Please join the community of Presto users and developers for an all-day event dedicated to the world’s fastest distributed SQL query engine. At the Summit we’ll share the latest on Presto and learn how some of the most innovative companies are using this technology to power their analytics platforms.

The agenda will feature talks from some of the world’s largest and innovative Presto users:

Comcast
Twitter
Nordstrom
Grubhub
Lyft
Netflix
LinkedIn
Criteo
Starburst
Presto Software Foundation

(the details will be announced soon)

If you wish to speak at the event, the call for papers is still open: 2019 Presto Summit – Speaker Registration

Please RSVP to secure your spot (space is limited): Presto Summit 2019 @TwitterSF

Release 311

2019-05-15T00:00:00+00:00

This version adds standard OFFSET syntax, a new function combinations() for computing k-combinations of array elements, and support for nested collections in Cassandra.

Release notes
Download

Presto Community Meeting 2019-05-08

2019-05-08T00:00:00+00:00

Agenda

Existing function support
Function namespaces
Connector-resolved functions
SQL-defined functions
Remote functions
Polymorphic table functions

Faster S3 Reads

2019-05-06T00:00:00+00:00

Presto is known for working well with Amazon S3. We recently made an improvement that greatly reduces network utilization and latency when reading ORC or Parquet data.

The problem

The improvement started with a question from Brenton Zillins at Stackpath on our Slack workspace. He noticed that the network traffic to Presto workers was many times larger than the amount of input data reported by Presto for the query.

After a lively discussion on the Slack channel, we found the cause. Parquet would perform a positioned read against the S3 file system to ask for an exact byte range (start and end). However, the file system only implemented the streaming API, so it would tell S3 about the starting location, but not the end location. The file system would stop reading from the stream once it reached the requested end location, but substantial additional data could be read from S3 due to various buffers in different parts of the system.

The streaming API has an additional problem. Establishing a new connection to S3 incurs latency, especially when using secure connections over TLS. There is no way to abort a streaming request to S3, other than by closing the connection, so the file system is forced to close connections after every request, thus preventing the connection from being reused.

The fix

We solved this by implementing positioned reads in the S3 file system. Position reads, which are the only types used by ORC and Parquet, work by asking S3 for the exact byte range required. These reads use the minimal amount of network traffic and allow the connection to be reused.

Brenton tested out the change and reported success:

This PR brought us from >1 GB/s object read rate to under 10 MB/s the same query. Thank you.

While this issue is obvious in retrospect, we are surprised that it took so long to find it, given that S3 is one of the most popular storage systems. This is a great example of how the community makes everything better. Being observant and reporting an issue can have a huge win for everyone.

How to get it

This improvement is in Presto 302+, so you will need to upgrade if you are using an earlier version.

Release 310

2019-05-03T00:00:00+00:00

This version adds standard FETCH FIRST syntax, support for using an alternate AWS role when accessing S3 or Glue, and improved handling of DECIMAL, DOUBLE, and REAL when Hive table and partition metadata differ.

Release notes
Download

A review of the first international Presto Conference, Tel Aviv, April 2019

2019-05-03T00:00:00+00:00

Community, noun: “A feeling of fellowship with others, as a result of sharing common attributes, interests, and goals”

The fun picture you see here was taken at the first lecture of the First international Presto summit in Israel last month.

The atmosphere in the room during the various presentations was unique. It’s as if you could physically feel the brainpower of 250 engineers fascinated by technology in one room.

We would like to share with you a bit of the content that was discussed during the conference. Enjoy the read and the videos!

Presto Software Foundation presentation

The day started with Dain Sundstrom, Martin Traverso, and David Phillips, Presto founders who gave us a great panoramic view on Presto Software Foundation, past, present, and future roadmap.

The Presto founders presented in their talk the following topics:

Presto foundation creation
ORC improvements
The complex pushdown algorithm in details
The opensource roadmap strategy and more

You can find the entire video of the presentation here and the slides here.

Varada presentation

David Krakov, co-founder and CTO at Varada explained how Varada is an example of how Presto can be leveraged to create a new innovative technology that allows interactive analytics on top of a data lakes extracted sets, or in other words Presto for apps.

David presented the three axes of innovation that the Varada team created, to achieve an indexed big data on a distributed platform:

SSD and NVMeF distributed calculation
All dimensions are indexed in the ingest process
Synchronization
Fully automated copy management directly connected to the raw data in the data lake.

You can find the video of the presentation here and the slides here.

WiX open sourcing Quix

The big announcement of the conference came from Valery Florov of Wix. As a web-scale data-driven company, with 150M users, Wix has more than 1000 users of Presto, and over 100K daily queries.

All those queries come through a unified front end for data discovery, transformation, and query: the Quix IDE. Quix is simultaneously: A notebook manager for users to write and share executable notes

Dataset explorer showing catalogs and metadata
Feature-rich SQL query editor
Job scheduler for ETL jobs
Wix has open-sourced most of Quix, available under an MIT license at https://github.com/wix-incubator/quix

As a Presto centric company Wix has developed few more exciting enhancements:

HBase + Parquet interleaving to mix compacted historic data and latest 14 days
One SQL - a query rewriter that unifies usage of Presto and BigQuery to one SQL
ActiveDirectory data security layer to control access to data
Google Drive integration - run Presto SQL directly on Google Sheets. This is one of the coolest connectors to be created and generated a lot of excitement. Can’t wait for Wix to open source this one as well!

See more in the video, slides, source code.

Ironsource - Analyzing data at a petabyte scale.

Ironsource is the ad network of choice for the gaming industry. Supplying solutions for application developers, customer engagement solutions and Ad monetization. Ironsource collects terabytes of events on a daily basis.

In his talk, Or Koren, head of the data team at Ironsource, shared their journey from terabyte scale to petabyte scale. In his talk Or showed how their entire interactive analytics platform was rebuilt to be based on Presto, and the huge savings they got from it including new business insights coming from their data science teams and the data analyst team.

The before and after slides that Or presented in a very clear way the reduction in cost and the increase in efficiency that the use of Presto brought to Ironsource.

See Or’s slides here and the talk video.

Datorama on mutable data at scale

A charismatic presenter, Alexey Finkelstein from Salesforce Datorama had the room rolling with laughter more than once, and on a topic of no laughter: managing mutable data with Presto. Datorama provides a marketing intelligence platform. It has 30,000 customers, who can interactively interact with 1.5PB of data available for interactive queries.

Datorama provides for that a “data lake as a service”, called a DatoLake. Files on data lakes by their nature are not transactionally updatable on a row level, but the users of Datorama require the ability to delete/update specific rows in a transactional manner.

To solve this Datorma has embarked on a journey. Based on partitioning the data by a version number (such as 20190101_009), and rebuilding a partition based on updates. There were 3 attempts to the journey and learning on each step:

At first, using an external Postgres metastore to store the versions, swapping in the metastore and using that as part of a sub-query to Presto to use the correct version. This approach did not pushdown partition pruning.
Next, moving the metastore query to happen before query generation, and be dynamically generate the right filter at each sub-query. This approach required two-pass processing for each query and did not support direct SQL to clients.
And finally, swapping the partition in the Hive Metastore in a transactional manner directly in the Hive Metastore database (MySQL), and refresh the Presto hive cache. With this approach, queries do not need to know about the version change and full separation of the mutability logic from the query is achieved.

See much more details in the video, slides.

Varada, Join Optimization and Dynamic filtering

Roman Zeyde is Varada’s Presto architect. Roman has a unique algorithmic background being a Talpiot graduate and an ex-Googler.

Roman’s talk discussed a new approach to make Joins work faster. Varada will contribute Roman’s work on dynamic filtering back to the community. Stay tuned :)

The talk went over the following major topics:

Presto Cost Based Optimizer feature as a basis for Join optimization
Join optimzation strategies
Dynamic filtering in the application for join optimization

Roman’s talk, slides.

Q&A session

The event finished by an hour-long Q&A session led by Demi Ben-Ari, VP R&S at Panorays and co-founder of Big Things, an Israeli Meetup group having 5000 people listed, all fans of Big data technologies.

See you all in the Second international Presto Conference in Tel Aviv!

Release 309

2019-04-25T00:00:00+00:00

This version adds support for case-insensitive name matching in JDBC-based connectors, more data types in PostgreSQL connector, and some bug fixes.

Release notes
Download

Even Faster ORC

2019-04-23T00:00:00+00:00

Trino is known for being the fastest SQL on Hadoop engine, and our custom ORC reader implementation is a big reason for this speed – now it is even faster!

Why is this important?

For the TPC-DS benchmark, the new reader reduced the global query time by ~5% and CPU usage by ~9%, which improves user experience while reducing the cost.

What improved?

ORC uses a two step system to decode data. The first step is a traditional compression algorithm like gzip that generically reduces data size. The second step has data type specific compression algorithms that convert the raw bytes into values (e.g., text, numbers, timestamps). It is this latter step that we improved.

How much faster is the decoder?

Why exactly is this faster?

Explaining why the new code is faster requires a brief explanation of the existing code. In the old code, a typical value reader looked like this:

if (dataStream == null) {
    presentStream.skip(nextBatchSize);
    return RunLengthEncodedBlock.create(type, null, nextBatchSize);
}

BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
if (presentStream == null) {
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
}
else {
    for (int i = 0; i < nextBatchSize; i++) {
        if (presentStream.nextBit()) {
            type.writeLong(builder, dataStream.next());
        }
        else {
            builder.appendNull();
        }
    }
}
return builder.build();

This code does a few things well. First, for the all values are null case, it returns a run length encoded block which has custom optimizations throughout Trino (this optimization was recently added by Praveen Krishna). Secondly, it separates the unconditional no nulls loop from the conditional mixed nulls loop. It is common to have a column without nulls, so it makes sense to split this out, since unconditional loops are faster than conditional loops.

On the downside, this code has several performance issues:

Many data encodings can be efficiently read in bulk, but this code reads one value at a time.
In some cases, the code can be called with different type instances, which result in slow dynamic dispatch call sites in the loop.
Value reading in the null loop is conditional, which is expensive.

Optimize for bulk reads

As you can see from the code above, Trino is always loading values in batches (typically 1024). This makes the reader and the downstream code more efficient as the overhead of processing data is amortized over the batch, and in some cases data can be processed in parallel. ORC has a small number of low level decoders for booleans, numbers, bytes and so on. These encodings are optimized for each data type, which means each must be optimized individually. In some cases, the decoders already had internal batch output buffers, so the optimization was trivial. In another equally trivial case, we changed the float and double stream decoders from loading a value byte at a time to bulk loading an entire array of values directly from the input and improved the performance more than 10x.

Some changes, however, were significantly more complex. One example is the boolean reader, which was changed from decoding a single bit at a time to decoding 8 bits at a time. This sounds simple, but in practice doing this efficiently is complex, since reads are not aligned to 8 bits, and there is the general problem of forming JVM friendly loops. For those interested, the code is here.

Avoid dynamic dispatch in loops

This is the kind of problem that is not obvious when reading code, and it is easily missed in benchmarks. The core problem happens when you have a loop containing a method call whose target class can vary over the lifetime of the execution. For example, this simple loop from above may or may not be fast, depending on how many different classes it sees for type across multiple executions:

for (int i = 0; i < nextBatchSize; i++) {
    type.writeLong(builder, dataStream.next());
}

Most of the ORC column readers can only be called with a single type implementation, but the LongStreamReader is called with BIGINT, INTEGER, SMALLINT, TINYINT and DATE types. This causes the JVM to generate a dynamic dispatch in the core of the loop. Besides the obvious extra work to select the target code and branch prediction problems, dynamic dispatch calls are normally not inlined, which disables many powerful optimizations in the JVM. The good news is that the fix is trivial:

if (type instanceof BigintType) {
    BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
    return builder.build();
}
if (type instanceof IntegerType) {
    BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
    return builder.build();
}
...

The hard part is knowing that this is a problem. The existing benchmarks for ORC only tested a single type at a time, which allowed the JVM to inline the target method and produce much more optimal code. In this case, we happen to know that the code is being invoked with multiple types, so we updated the benchmark to warm up the JVM with multiple types before benchmarking.

For more information on this kind of optimization, I suggest reading Aleksey Shipilëv’s blog posts on JVM performance. Specifically, The Black Magic of (Java) Method Dispatch.

Improve null reading

With the above improvements, we were getting great performance of 0.5ns to 3ns per value for most types without nulls, but the benchmarks with nulls were taking an additional ~6ns per value. Some of that is expected, since we must decode the additional present boolean stream, but booleans decode at a rate of ~0.5ns per value, so that isn’t the problem. Martin Traverso and I built and benchmarked many different implementations, but we only found one with really good performance.

The first implementation we built was simply to bulk read a null array, bulk read the values packed into the front of an array, and then spread the nulls across the array:

// bulk read and count null values
boolean[] isNull = new boolean[nextBatchSize];
int nullCount = presentStream.getUnsetBits(nextBatchSize, isNull);

// bulk read non-values into an array large enough for full results
long[] result = new long[nextBatchSize];
dataStream.next(longNonNullValueTemp, nextBatchSize - nullCount);

// copy non-null values into output position (in reverse order)
int nullSuppressedPosition = nextBatchSize - nullCount - 1;
for (int outputPosition = isNull.length - 1; outputPosition >= 0; outputPosition--) {
    if (isNull[outputPosition]) {
        result[outputPosition] = 0;
    }
    else {
        result[outputPosition] = result[nullSuppressedPosition];
        nullSuppressedPosition--;
    }
}

This is better because it always bulk reads the values, but there is still a ~4ns per value penalty for nulls. We haven’t been able to explain why it happens, but we’ve observed that the number drops dramatically after we adjusted the code to assign to result[outputPosition] outside the if block. We can’t do that in-place, as in the snippet above, so we introduce a temporary buffer:

// bulk read and count null values
boolean[] isNull = new boolean[nextBatchSize];
int nullCount = presentStream.getUnsetBits(nextBatchSize, isNull);

// bulk read non-values into a temporary array
dataStream.next(tempBuffer, nextBatchSize - nullCount);

// copy values into result
long[] result = new long[isNull.length];
int position = 0;
for (int i = 0; i < isNull.length; i++) {
    result[i] = tempBuffer[position];
    if (!isNull[i]) {
        position++;
    }
}

With this change, the null penalty drops to ~1.5ns per value, which is reasonable given that just reading the null flag counts ~0.5ns per value. There are two downsides to this approach. Obviously, there is an extra temporary buffer, but since the reader is single threaded, we can reuse it for the whole file read. Secondly, the null values are no longer zero. This should not be a problem for correctly written code, but could potentially trigger latent bugs. We did find another approach that left the nulls unset, but it was a bit slower and required another temp buffer, so we settled on this approach.

How much will my setup improve?

We tested the performance using the standard TPC-DS and TPC-H benchmarks on zlib compressed ORC files:

Benchmark	Duration	CPU
TPC-DS	5.6%	9.3%
TPC-H	4.5%	8.3%

There are a number of reasons you may get a larger or smaller win:

The exact queries matter: In the benchmarks above, some queries saved more than 20% CPU and others only saved 1%.
The compression matters: In our tests we used zlib, which is the most expensive compression supported by ORC. Compression algorithms that use less CPU (e.g., Zstd, LZ4, or Snappy) will generally see larger relative improvements.
This improvement is only in Trino 309+, so if you are using an earlier version you will need to upgrade. Also, if you are still using Facebook’s version of Presto, you can either upgrade to Trino 309+ or wait to see if they backport it.

Release 308

2019-04-12T00:00:00+00:00

This version includes significant performance improvements when reading ORC data, authorization checks for SHOW COLUMNS, and limit pushdown for JDBC-based connectors.

Release notes
Download

Release 307

2019-04-08T00:00:00+00:00

This version includes some important security fixes, support for inner and outer joins involving lateral derived tables (LATERAL), new syntax for setting table comments, and performance improvements.

Release notes
Download

Presto Community Meeting 2019-04-03

2019-04-03T00:00:00+00:00

Agenda

Memory management
Spilling

Release 306

2019-03-16T00:00:00+00:00

This version includes some bug fixes, as well as performance improvements when decoding ORC data.

Release notes
Download

Presto Community Meeting 2019-03-13

2019-03-13T00:00:00+00:00

Agenda

Dynamic Filtering
Changes to TIMESTAMP semantics

Release 305

2019-03-08T00:00:00+00:00

Changes in this version include peak-memory awareness in cost-based optimizer, improved handling of CSV output in CLI, and performance improvements for Parquet.

Release notes
Download

Release 304

2019-02-27T00:00:00+00:00

New features include spilling for queries that use ORDER BY or window functions, support for PostgreSQL’s json and jsonb types, and a Hive procedure to synchronize partition metadata with the file system.

Release notes
Download

Presto Community Meeting 2019-02-27

2019-02-27T00:00:00+00:00

Agenda

Pushdown of complex operations (filter, project, join, etc.)
Coordinator high availability

Release 303

2019-02-14T00:00:00+00:00

This version includes bug fixes and performance improvements.

Release notes
Download

Release 302

2019-02-06T00:00:00+00:00

New features include native support for Google Cloud Storage and a connector for Elasticsearch.

Release notes
Download

Presto Community Meeting 2019-02-06

2019-02-06T00:00:00+00:00

Agenda

About the Foundation
Getting involved
Summary of new features
Top requested features
Release verification

Release 301

2019-01-31T00:00:00+00:00

New features include role-based access control and role management, invoker security mode for views, and ANALYZE syntax for collecting table statistics.

Release notes
Download

Presto Software Foundation Launch

2019-01-31T00:00:00+00:00

We are pleased to announce the launch of the Presto Software Foundation, a not-for-profit organization dedicated to the advancement of the Presto open source distributed SQL engine. The foundation is committed to ensuring the project remains open, collaborative and independent for decades to come.

We started the Presto project in 2012 as a small team at Facebook, with the goals of building a high performance, standards compliant, easy-to-use and dependable query engine capable of scaling to the largest datasets (exabyte scale) in the world. From day one, we designed and developed Presto to be maintained by an independent open source community.

In 2013, we released Presto under the Apache License and opened development to the public. Since then, the Presto community has expanded globally, with developers in Brazil, China, Germany, India, Israel, Japan, Poland, Singapore, the U.S., the U.K., and more. In recent years, the center of gravity of the Presto community has shifted, with the majority of contributions now coming from developers outside of Facebook.

From the beginning, we stressed the importance of code quality, architectural extensibility, and open collaboration with the community. With the rapid expansion of both the Presto user base and Presto developer community over the last several years, establishing a non-profit to institutionalize these values is the next logical step to ensure that this project stands the test of time.

The foundation is dedicated to preserving the vision of high quality, performant and dependable software developed by an open, collaborative and independent community of developers throughout the world. Everyone is welcome to participate, whether it be via code contributions, suggestions for improvements, or bug reports.