Maryam Alavi
Name
Maryam Alavi

Updated on

Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) is a statistical technique used to understand the relationship between two multivariate data sets. Instead of looking at single variables in isolation, CCA helps you uncover how two groups of variables move together.

Typical examples include:

  • Relating demographic variables (age, income, education) to spending behavior (online spending, in-store spending, savings rate).
  • Relating brain imaging signals to cognitive test scores.
  • Relating marketing campaign metrics to sales performance.

In all of these cases, you don’t just want to correlate one variable with another—you want to understand how sets of variables are related.


What is Canonical Correlation Analysis?

Canonical Correlation Analysis measures the linear relationship between two multidimensional random variables, often called X and Y.

  • In multiple regression, you usually relate one outcome to many predictors.
  • In CCA, you relate many variables in X to many variables in Y at the same time.

CCA finds linear combinations (weighted sums) of variables in each set:

  • One combination for X:
    \( U = a_1 X_1 + a_2 X_2 + \dots + a_p X_p \)
  • One combination for Y:
    \( V = b_1 Y_1 + b_2 Y_2 + \dots + b_q Y_q \)

such that the correlation between U and V is as large as possible.
These linear combinations \(U\) and \(V\) are called canonical variates (or canonical components).

CCA then finds the next pair of canonical variates, subject to being uncorrelated with the first pair, and so on.


How Does Canonical Correlation Analysis Work?

At a high level, CCA follows these steps:

  1. Standardization
    Standardize each variable to have mean 0 and variance 1.
    This ensures variables measured on different scales (e.g., dollars vs. years) contribute fairly.

  2. Covariance Matrices
    Compute the covariance matrices within and between the two sets of variables:

    • Covariance of X with itself,
    • Covariance of Y with itself,
    • Covariance between X and Y.
  3. Eigenvalue / Singular Value Decomposition
    Solve an eigenvalue problem (or SVD in many implementations) involving these covariance matrices to obtain:

    • Canonical correlations (one per component),
    • Canonical weights (coefficients that define the canonical variates).
  4. Canonical Correlation Coefficients
    Each canonical component pair has an associated canonical correlation between its U and V.

    • Values close to 1 indicate a strong relationship,
    • Values near 0 indicate a weak relationship.
  5. Canonical Variates and Loadings
    The weights are used to compute the canonical variates (U, V).
    You can also look at loadings (correlations between the original variables and the canonical variates) to interpret which original variables contribute the most.


When Should You Use CCA?

CCA is useful when:

  • You have two sets of variables measured on the same observations.
  • You suspect there is a multivariate relationship between the two sets.
  • You care about patterns across groups of variables, not just one-to-one correlations.

Typical use cases:

  • Neuroscience: relate brain imaging features (X) to behavioral scores (Y).
  • Marketing analytics: relate channel engagement metrics (X) to revenue metrics (Y).
  • Education: relate study habits (X) to performance outcomes (Y).

Canonical Correlation Analysis in Python (with scikit-learn)

Step-by-Step Tutorial

Let’s walk through a complete example of Canonical Correlation Analysis in Python using scikit-learn.

We’ll simulate two sets of variables:

  • X: study behavior features (hours of study, number of practice tests, attendance).
  • Y: exam performance features (midterm score, final score, project score).

We’ll make them correlated so that CCA can recover meaningful relationships.

Note: This is a toy example to illustrate how to run CCA and interpret its output. In real projects, you’d replace the synthetic data with your own.


Step 1: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
from sklearn.cross_decomposition import CCA
from sklearn.preprocessing import StandardScaler

Step 2: Create Sample Data

We create a synthetic dataset where X and Y are related through a latent “ability” factor.

# Reproducibility
rng = np.random.RandomState(42)
n_samples = 200
 
# Latent "ability" factor
ability = rng.normal(size=n_samples)
 
# X: study behavior (3 variables)
X = np.column_stack([
    ability + rng.normal(scale=0.5, size=n_samples),  # hours of study
    ability + rng.normal(scale=0.7, size=n_samples),  # practice tests
    ability + rng.normal(scale=0.6, size=n_samples),  # attendance
])
 
# Y: performance (3 variables)
Y = np.column_stack([
    ability + rng.normal(scale=0.5, size=n_samples),  # midterm score
    ability + rng.normal(scale=0.5, size=n_samples),  # final score
    ability + rng.normal(scale=0.5, size=n_samples),  # project score
])
 
X.shape, Y.shape

Step 3: Standardize Each Block

Although CCA can work on centered data, standardizing each block is often a good practice when variables are on different scales.

scaler_x = StandardScaler()
scaler_y = StandardScaler()
 
X_scaled = scaler_x.fit_transform(X)
Y_scaled = scaler_y.fit_transform(Y)

Step 4: Initialize the CCA Model

We’ll ask for two canonical components.

cca = CCA(n_components=2)

Step 5: Fit the Model and Transform the Data

X_c, Y_c = cca.fit_transform(X_scaled, Y_scaled)
 
print("Shape of canonical variates for X:", X_c.shape)
print("Shape of canonical variates for Y:", Y_c.shape)
  • X_c and Y_c contain the canonical variates (U and V) for each observation.
  • Each column is a canonical component.

To compute the canonical correlations:

corrs = []
for i in range(X_c.shape[1]):
    corr = np.corrcoef(X_c[:, i], Y_c[:, i])[0, 1]
    corrs.append(corr)
 
print("Canonical correlations:", corrs)

Step 6: Visualize the Results

Let’s visualize the first pair of canonical variates (component 1).

plt.figure(figsize=(6, 5))
plt.scatter(X_c[:, 0], Y_c[:, 0], alpha=0.7)
plt.xlabel("First canonical variate of X")
plt.ylabel("First canonical variate of Y")
plt.title("Canonical Correlation: {:.3f}".format(corrs[0]))
plt.grid(True)
plt.tight_layout()
plt.show()

The scatter plot shows the relationship between the first canonical variate of X and the first canonical variate of Y. If the points form a clear diagonal pattern, the canonical correlation is high.


Step 7: Interpret the Results

To understand which original variables drive the relationship, inspect the weights and loadings.

# Canonical weights (how each original variable contributes)
x_weights = cca.x_weights_
y_weights = cca.y_weights_
 
print("X weights (study behavior):")
print(x_weights)
 
print("\nY weights (performance):")
print(y_weights)

Interpretation tips:

  • Larger absolute values in a weight vector mean that variable contributes more to that canonical variate.
  • Positive vs. negative signs tell you the direction of the relationship.
  • You can also compute loadings (correlation of each original variable with its canonical variate) for more interpretable results:
# Loadings for X: correlation between X and X_c
x_loadings = np.corrcoef(X_scaled.T, X_c.T)[:X.shape[1], X.shape[1]:]
# Loadings for Y: correlation between Y and Y_c
y_loadings = np.corrcoef(Y_scaled.T, Y_c.T)[:Y.shape[1], Y.shape[1]:]
 
print("X loadings (variables vs canonical variates):")
print(x_loadings)
 
print("\nY loadings (variables vs canonical variates):")
print(y_loadings)

For our synthetic study example, you’ll typically see:

  • All study behavior variables loading positively on the first canonical variate of X.
  • All performance variables loading positively on the first canonical variate of Y.
  • A high canonical correlation for the first component, capturing the shared “ability” factor.

Interpreting CCA in Practice

When using CCA on real data, focus on:

  1. Magnitude of canonical correlations

    • High first correlation → strong multivariate relationship between the two sets.
    • Subsequent correlations often decrease.
  2. Number of meaningful components

    • Not all canonical components are equally useful.
    • You might keep only those with sufficiently high correlation or that pass a statistical test.
  3. Variable contributions (loadings/weights)

    • Which variables in each set contribute most to each canonical variate?
    • Do the patterns make domain sense (e.g., all “effort” variables in one dimension)?

Limitations and Best Practices

Before using CCA, keep in mind:

  • Linearity CCA captures linear relationships. Non-linear patterns may require kernel CCA or other methods.

  • Sample size vs. number of variables If you have many variables and few observations, classic CCA can overfit. Consider:

    • Reducing dimensionality first (e.g., PCA),
    • Using regularized CCA.
  • Multicollinearity Highly correlated variables within each set can make interpretation harder. Feature selection or regularization can help.

  • Interpretation requires domain knowledge Statistical significance doesn’t automatically mean practical importance—always interpret results in context.


Conclusion

Canonical Correlation Analysis is a powerful tool for exploring relationships between two sets of variables, going beyond simple pairwise correlations.

With Python and scikit-learn, implementing CCA is straightforward:

  1. Prepare and standardize your data,
  2. Fit a CCA model,
  3. Examine canonical correlations,
  4. Interpret weights and loadings to understand which variables drive the relationship.

Whether you’re analyzing socio-economic indicators vs. outcomes, study behavior vs. performance, or any other paired multivariate data, CCA can help you uncover shared structure and meaningful patterns between the two sides.