Self-Hosting Kusto

This is going to be an unusual one, partly because it is about a Microsoft product (which I typically avoid) and partly because it’s going to be a bit of a brain dump.

In short, Kusto is the original name for the engine behind Azure Data Explorer.

Henceforth referred to as ADX, it is a high-performance distributed database for time-series analytics that I have had a love-hate relationship with for quite some time now–KQL, its query language, is used everywhere in Azure with slightly irritating variations, and I’ve been bitten by a few inconsistencies in the past.

But ADX is a staple in any serious high-speed approach to analytics across ludicrously large data volumes in Azure and has a bunch of built-in anomaly detection functionalities that I can’t easily replicate elsewhere. So, I decided to set up my own–locally.

Kusto Emulator Setup

You can do that by using the Kusto Emulator, which comes in a Docker container. It has a few limitations (it doesn’t come with any sort of GUI or authentication, to begin with), but it’s easy enough to get running–here’s what I did to last week’s stack, carved out into something you can just drop into docker-compose:

services:
  jupyter:
    container_name: jupyter
    hostname: jupyter
    image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11
    volumes:
      - /mnt/data/jupyter:/home/jovyan:rw
    ports:
      - 8888:8888
    links:
      - kusto
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  kusto:
    container_name: kusto
    hostname: kusto
    image: mcr.microsoft.com/azuredataexplorer/kustainer-linux
    volumes: 
      - /mnt/data/kusto:/kustodata
    ports:
      - 8080:8080
    restart: always
    environment:
      - ACCEPT_EULA=Y

You’ll notice that kusto is linked directly to jupyter–that’s for convenience, and I would recommend you don’t expose port 8080 unless you absolutely require it, since the emulator does not have any authentication whatsoever.

Using Jupyter as a Front-End

There is a kqlmagic kernel available, so the only thing you really need to do to get started is:

!pip install kqlmagic
%reload_ext Kqlmagic 
%kql --activate_kernel

This installs the extension, reloads Jupyter, and sets KQL as the default language in the notebook (as usual, you can use %%kql and %%py magics to explicitly set the cell language).

Connecting to Kusto

With my setup above, this is how you get started talking to the Kusto emulator:

azureDataExplorer://anonymous;cluster='http://kusto:8080';database='NetDefaultDB';alias='default'

Creating a Database

From that moment on, you can create your own database, specifying both metadata and data folders:

.create database telemetry persist (
  @"/kustodata/dbs/telemetry/md",
  @"/kustodata/dbs/telemetry/data"
)

You can then reconnect to that database or try attaching to it:

.attach database telemetry from @"/kustodata/dbs/telemetry/md"

…and since I ran this instead of reconnecting to the new database, this is where you might start noticing that, at least for now, debugging KQL in Jupyter is a bit of a challenge:

BadRequest: Request is invalid and cannot be executed.
Error details:
ClientRequestId='Kqlmagic.execute;2fc751d7-c3f1-49f5-996e-c287cc51407a/a3a02fa3-3c02-4ce3-a66c-aa00f6ce1971/AzureDataExplorer', ActivityId='19784e04-f9bb-426f-a3e0-71a5785f4072', Timestamp='2024-06-28T18:29:09.7367751Z'.

I have yet to figure out a good way to get meaningful error messages, but to be fair, I also have trouble with production cluster error messages being somewhat terse…

Managing Tables

The commands for managing tables are somewhat obvious, and you can pipe the results directly to KQL operators:

.show tables
| count

Since the above returns 0, let’s create a table:

.create table metrics(timestamp:datetime, sensor:string, metric:string, value:real)

This is the almost verbatim translation of my home telemetry SQLite schema, except I don’t need indexes or (in this case) a primary key:

PRAGMA auto_vacuum = INCREMENTAL;
CREATE TABLE metrics(id INTEGER PRIMARY KEY AUTOINCREMENT, sensor STRING NOT NULL, metric STRING NOT NULL, value STRING NOT NULL, time TIMESTAMP DEFAULT (strftime('%s', 'now')));
CREATE INDEX metrics_sensor_index on metrics(sensor);
CREATE INDEX metrics_metric_index on metrics(metric);
CREATE INDEX metrics_time_index on metrics(time);

Importing and Processing Data

I prefer to use Parquet as an import format because it provides a bunch of assurances regarding data formats (it’s what I use in most large-scale cloud projects anyway). Since I had already dumped 10GB of my telemetry data into CSV, I whipped up a quick converter using Polars (which is another of my staple libraries for high-performance data processing):

import polars as pl

# I have a few bogus test entries in the file
df = pl.read_csv("telemetry.csv", null_values=["null"])

# enforce a schema (everything else is already inferred as a string)
proper_schema = df.with_columns(
    pl.col("timestamp").str.to_datetime(),
    pl.col("value").cast(pl.Float64),
)

# write out a compressed parquet file
proper_schema.write_parquet(
    "import.parquet",
    compression="zstd",
    row_group_size=100000
)

I was then able to ingest it like this:

.ingest into table metrics(@"/kustodata/raw/import.parquet") with (format="parquet")

Querying and Manipulating Results

This is the easy bit:

metrics
| where sensor == "office_temp" and metric == "humidity"
| order by timestamp asc
| take 1000

This spits out something that looks like a dataframe inside Jupyter–and it is. You can get the last result data inside Python cells like so:

# not the nicest syntax, but it works
df = _kql_raw_result_.to_dataframe()

# now do something with the data
df.plot(x='timestamp', y='value')

…and yes, you’ll have to use Python to plot charts, since the render command in KQL does nothing in the simulator. But that’s not really a problem, since there is no shortage of Jupyter visualization libraries.

Time Series Anomaly Detection

Now for the stuff I wanted to get some practice on–this is pretty much the textbook example of how to do basic anomaly detection in KQL:

Create a resampled time series from the data
Use series_decompose_anomalies to generate three other time series (anomalies, a score, and a baseline)

// time interval
let min_t = datetime(2024-05-01);
let max_t = datetime(2024-06-28);

// resampling factor
let dt = 1h;

metrics
| where sensor == "office_temp" and metric == "temperature"
| order by timestamp asc
| make-series resampled=avg(value) on timestamp from min_t to max_t step dt by sensor
| extend (anomalies, score, baseline) = series_decompose_anomalies(resampled, 1.5, -1, 'linefit')

You’d ordinarily pipe these to render anomalychart, but since we don’t have that, you need to fish out the results from the output (nested) dataframe you get out of the above:

import pandas as pd

df = _kql_raw_result_.to_dataframe()
plot_data = pd.DataFrame()

# assemble the data for plotting sanely
for series in ['timestamp', 'anomalies', 'score', 'baseline']:
  plot_data[series] = df[series][0]

Plotting the data in a sensible way is left as an exercise to the reader (I am rather partial to Altair or PygWalker, but your mileage may vary).

I’ve intentionally left out the actual dataframe outputs and charts because they are not particularly interesting, but I can tell you that the anomaly detection works as expected and the results are exactly what I would get from a production cluster (you can set up a free one for a limited time, but I don’t have the patience to do that right now).

Conclusion

I’m still exploring the emulator’s functionality and its performance envelope, and the kind of additional analysis I need to do for work is fairly specific, so I can’t really write about it–other than to say that so far, all the features I tried just worked once I was able to understand the parameters and syntax well enough to avoid cryptic error messages.

What I can tell you is that self-hosted Kusto is blisteringly fast, at least on my kitted-out analytics sandbox–where 10GB of data, even without compression, is one-tenth of the available RAM… so your mileage may vary.

Tao of Mac