Vega Lite
Data Transformation
Aggregate Data

How Do You Aggregate Data in Vega-Lite?

Ever wondered how to summarize your data effectively in a visualization? Let’s take a deep dive into data aggregation with Vega-Lite, making it easy for you to produce insightful and concise visualizations!

What Is Data Aggregation and How Can It Solve Your Problems?

Aggregation is a very important concept that usually be ignored by beginners, which lead to many kinds of mistake when building data visualization.

For example, some kind of wierd line chart in Cars dataset, trying to show how the Acceleration changes over Year:

{
  "data": {
    "url": "data/cars.json"
  },
  "mark": "line",
  "encoding": {
    "x": {
      "field": "Year",
      "type": "temporal"
    },
    "y": {
      "field": "Acceleration",
      "type": "quantitative"
    }
  }
}

The data visualization above is a common mistake made by beginners when

Data aggregation is the process of summarizing a large dataset into smaller chunks, making it easier to spot patterns and trends. Whether you’re a data scientist, analyst, or just getting started with data visualization, understanding aggregation allows you to transform complex datasets into meaningful visuals.

In Vega-Lite, you can aggregate data using the aggregate property of an encoding field or through the aggregate transform. Let’s explore both!

Using Aggregate in Encoding Field Definition

You can directly use aggregate in encoding field definition by using the appregation property:

PropertyTypeDescription
aggregateAggregationOperationsDefined the operation applied on the data.

Here is an example that shows how you can use the aggregate property in the encoding field of a Vega-Lite specification:

{
  "data": {
    "url": "data/cars.json"
  },
  "mark": "bar",
  "encoding": {
    "x": {
      "aggregate": "mean",
      "field": "Acceleration",
      "type": "quantitative"
    },
    "y": {
      "field": "Cylinders",
      "type": "nominal"
    }
  }
}

Without Aggregation

With Aggregation

In this example, we created a bar chart to show the relationship between acceleration and cylinders. We made two charts, one without aggregate property, the other one used the mean aggregation; you can see the difference between them.

Extra Dimension with detail Channel

Below is another example that adds another dimension with the detail channel to include more group-by fields without mapping them to visual properties. For instance, adding the country of origin:

{
  "data": {
    "url": "data/cars.json"
  },
  "mark": "point",
  "encoding": {
    "x": {
      "field": "Acceleration",
      "type": "quantitative"
    },
    "y": {
      "field": "Cylinders",
      "type": "nominal"
    },
    "detail": {
      "field": "Origin",
      "type": "nominal"
    }
  }
}

Without "detail"

With "detail"

In this example, we used detail property to further differentiating points by the country of origin.

Using Aggregate Transform

Alternatively, you can use the aggregate transform to summarize your data within the transform array:

{
  "data": { "url": "data/cars.json" },
  "transform": [
    {
      "aggregate": [{
        "op": "mean",
        "field": "Acceleration",
        "as": "mean_acc"
      }],
      "groupby": ["Cylinders"]
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": { "field": "Cylinders", "type": "ordinal" },
    "y": { "field": "mean_acc", "type": "quantitative" }
  }
}

This is equivalent to:

{
  "data": { "url": "data/cars.json" },
  "mark": "bar",
  "encoding": {
    "x": { "field": "Cylinders", "type": "ordinal" },
    "y": {
      "field": "Acceleration",
      "type": "quantitative",
      "aggregate": "mean"
    }
  }
}

An aggregate transform in the transform array has the following properties:

PropertyTypeDescription
aggregateAggregatedFieldDef[]Required. Array of objects that define fields to aggregate.
groupbyString[]The data fields to group by. If not specified, a single group containing all data objects will be used.

Aggregated Field Definition

PropertyTypeDescription
opStringRequired. The aggregation operation to apply to the fields (e.g., "sum", "average", or "count").
fieldStringThe data field for which to compute aggregate function. This is required for all aggregation operations except "count".
asStringRequired. The output field names to use for each aggregated field.

Aggregation Operations

OperationDescription
countThe total count of data objects in the group.
validThe count of field values that are not null, undefined or NaN.
valuesA list of data objects in the group.
missingThe count of null or undefined field values.
distinctThe count of distinct field values.
sumThe sum of field values.
productThe product of field values.
meanThe mean (average) field value.
averageThe mean (average) field value. Identical to mean.
varianceThe sample variance of field values.
variancepThe population variance of field values.
stdevThe sample standard deviation of field values.
stdevpThe population standard deviation of field values.
stderrThe standard error of field values.
medianThe median field value.
q1The lower quartile boundary of field values.
q3The upper quartile boundary of field values.
ci0The lower boundary of the bootstrapped 95% confidence interval of the mean field value.
ci1The upper boundary of the bootstrapped 95% confidence interval of the mean field value.
minThe minimum field value.
maxThe maximum field value.
argminAn input data object containing the minimum field value.
argmaxAn input data object containing the maximum field value.

Usage of argmax / argmin

This example shows the Production Budget of the movies which have the highest US Gross in each major genre.

{
  "data": {"url": "data/movies.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "aggregate": {"argmax": "US Gross"},
      "field": "Production Budget",
      "type": "quantitative"
    },
    "y": {"field": "Major Genre", "type": "nominal"}
  }
}

FAQs

1. What's the difference between using the encoding field and the transform array for aggregation?

Using the encoding field for aggregation directly ties the summary statistic to a visual property. In contrast, the transform array performs the aggregation before mapping the data to the visual encoding, allowing more complex preprocessing of data.