Last summer Microsoft has rebranded the Azure Kusto Query engine as Azure Data Explorer. While it does not support fully elastic scaling, it at least allows to scale up and out a cluster via an API or the Azure portal to adapt to different workloads. It also offers parquet support out of the box which made me spend some time to look into it.
- Cloudberry Azure Blob Explorer
- Azure Blob Explorer Online
- Azure Blob Explorer Linux
- Azure Data Explorer Event Grid
Azure Data Explorer
With the heavy use of Apache Parquet datasetswithin my team at Blue Yonder we are always looking for managed, scalable/elastic queryengines on flat files beside the usual suspects like drill, hive, presto orimpala.
For the following tests I deployed a Azure Data Explorer cluster with two instances ofStandard_D14_v2 servers with each 16 vCores, 112 GiB ram, 800 GiB SSD storage and a network bandwidth class extremely high (which corresponds to 8 NICs).
Freeware Azure Blob Storage explorer allows tracking, analyzing and debugging your usage of storage. It also comes with full support for Microsoft Azure Storage Analytics. CloudBerry Explorer PRO makes Azure Cloud Storage management even easier, more secure and efficient. View blobs in a container. In the Azure Storage Explorer application, select a container under a storage account. The main pane shows a list of the blobs in the selected container. To download blobs using Azure Storage Explorer, with a blob selected, select Download from the ribbon. A file dialog opens and provides you the. Azure Data Explorer and Parquet files in the Azure Blob Storage. Last summer Microsoft has rebranded the Azure Kusto Query engine as Azure Data Explorer. While it does not support fully elastic scaling, it at least allows to scale up and out a cluster via an API or the Azure portal to adapt to different workloads.
Data Preparation NY Taxi Dataset
Like in the understanding parquet predicate pushdown blog post we are using the NY Taxi dataset for the tests because it has a reasonable size and some nice properties like different datatypes and includes some messy data (like all real world data engineering problems).
We can convert the csv files to parquet with pandas and pyarrow:
Each csv file has about 700MiB, the parquet files about 180MiB and per file about 10 million rows.
Data Ingestion
The Azure Data Explorer supports control and query commands to interactwith the cluster. Kusto control commands always startwith a dot and are used to manage the service, query information about it andexplore, create and alter tables. The primary query language is the kusto query language, but a subset of T-SQL is also supported.
The table schema definition supports a number of scalar data types:
Type | Storage Type (internal name) |
---|---|
bool | I8 |
datetime | DateTime |
guid | UniqueId |
int | I32 |
long | I64 |
real | R64 |
string | StringBuffer |
timespan | TimeSpan |
decimal | Decimal |
To create a table for the NY Taxi dataset we can use the following control command with the table and columns names and the corresponding data types:
Ingest data into the Azure Data Explorer
The
.ingest into table
command can read the data from an Azure Blob or Azure Data Lake Storage and import the data into the cluster. This means it is ingesting the data and stores it locally for a better performance. Authentication is done with Azure SaS Tokens.Importing one month of csv data takes about 110 seconds. As a reference parsing the same csv file with
pandas.read_csv
takes about 19 seconds.One should always use of obfuscated strings (the
h
in front of the string values) to ensure that the SaS Token is never recorded or logged:Cloudberry Azure Blob Explorer
Ingesting parquet data from the azure blob storage uses the similar command, and determines the different file format from the file extension. Beside csv and parquet quite some more data formats like json, jsonlines, ocr and avro are supported. According to the documentation it is also possible to specify the format by appending
with (format='parquet')
.Loading the data from parquet only took 30s and already gives us a nice speedup. One can also use multiple parquet files in the blob store to load the data in one run, but I did not get a performance improvement (e.g better than duration times number files, which I interpret that there is no parallel import happening):
Once the data is ingested on can nicely query it using the Azure Data explorer either in the Kusto query language or in T-SQL:
Query External Tables
Loading the data into the cluster gives best performance, but often one just wants to do an ad hoc query on parquet data in the blob storage. Using external tables supports exactly this scenario. And this time using multiple files/partitioning helped to speed up the query.
Querying external data looks similar but has the benefit that one does not have to load the data into the cluster. In a follow up post I'll do some performance benchmarks. Based on my first experiences it seems like the query engine is aware of some of the parquet properties like columnar storage and predicate pushdown, because queries return results faster than loading the full data from the blob storage (with the 30mb/s limit) would take.
Azure Blob Explorer Online
Azure Blob Explorer Linux
Export Data
Azure Data Explorer Event Grid
For the for the sake of completeness I'll just show an example how to export data fromthe cluster back to a parquet file in the azure lob storage: