Cell-Type Classification with HIPPIE

SpikeLab has optional integration with HIPPIE, a pretrained multimodal generative model for neuron classification. HIPPIE encodes each neuron’s waveform, interspike-interval distribution, and autocorrelogram into a shared 30-D latent space, then uses UMAP + HDBSCAN for unsupervised cell-type discovery.

Installation

HIPPIE is an optional dependency — install it alongside SpikeLab:

pip install "spikelab[hippie]"

This pulls in PyTorch, HuggingFace Hub, umap-learn, and hdbscan in addition to the HIPPIE package itself. PyTorch with CUDA must be installed separately if GPU inference is desired.

Note

Nothing in the base SpikeLab install is affected. The HIPPIE adapter is never imported unless you explicitly call it.

Data requirements

HIPPIE requires three features per neuron:

Average waveform — stored as avg_waveform in neuron_attributes
Spike trains — always present in a SpikeData object
Recording technology — passed as tech_id at call time

The waveform is the only thing that may need preparation. The three pipelines below cover the most common starting points.

Pipeline A — Kilosort output + raw `.bin` file

This is the typical Neuropixels + Kilosort4 workflow. Kilosort gives you spike times only; waveforms are extracted from the raw voltage trace afterward.

Note

Attaching raw data to a SpikeData object is currently a Python-only step — there is no MCP tool for it. Use this path when scripting directly.

import numpy as np
from spikelab.data_loaders import load_spikedata_from_kilosort
from spikelab.spikedata.hippie_adapter import classify_neurons

# 1. Load spike times from Kilosort output directory
sd = load_spikedata_from_kilosort(
    folder_path="/path/to/kilosort_output/",
    fs_Hz=30000,                       # Neuropixels default
    cluster_info_tsv="cluster_info.tsv",
    include_noise=False,
)

# 2. Attach the raw voltage recording
#    Shape must be (n_channels, n_samples).
#    Use np.memmap for large files to avoid loading everything into RAM.
raw = np.memmap(
    "/path/to/recording.ap.bin",
    dtype="int16",
    mode="r",
    shape=(385, n_samples),            # adjust n_channels and n_samples
)
sd.raw_data = raw.astype(np.float32)
sd.raw_time = 30.0                     # sampling rate in kHz (30 000 Hz)

# 3. Extract average waveforms for all units in one call.
#    store=True writes avg_waveform into neuron_attributes automatically.
sd.get_waveform_traces(
    unit=None,                         # None = all units
    ms_before=1.0,
    ms_after=2.0,
    store=True,
)

# 4. Run HIPPIE: embed → UMAP → HDBSCAN
result = classify_neurons(
    sd,
    tech_id="neuropixels",             # or tech_id=0
    run_umap=True,
    run_hdbscan=True,
    hdbscan_kwargs={"min_cluster_size": 5},
)

# 5. Store results back into neuron_attributes
sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

n_clusters = (result["cluster_labels"] >= 0).sum()
print(f"{sd.N} neurons → {n_clusters} clustered, "
      f"{(result['cluster_labels'] < 0).sum()} noise")

What `get_waveform_traces` does in step 3

For every unit it finds the peak channel (from neuron_to_channel_map), extracts a short voltage snippet around each spike, averages them, and stores the average as neuron_attributes[i]["avg_waveform"]. The adapter then reads those stored values — no raw data is needed after this point.

Pipeline B — NWB file with raw traces

NWB files produced by SpikeInterface, the Allen Brain Atlas pipeline, or similar tools often embed both spike times and raw traces in a single file. This is the only path that works end-to-end from the MCP / agent interface.

Python

from spikelab.data_loaders import load_spikedata_from_nwb
from spikelab.spikedata.hippie_adapter import classify_neurons

sd = load_spikedata_from_nwb("/path/to/recording.nwb")

# Extract waveforms for all units
sd.get_waveform_traces(unit=None, ms_before=1.0, ms_after=2.0, store=True)

result = classify_neurons(sd, tech_id="neuropixels")
sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

MCP / agent

Give an agent these prompts in order:

"Load the NWB file at /path/to/recording.nwb"

"Extract waveforms for all N units with 1 ms before and 2 ms after the spike"
   (the agent will call get_waveform_traces once per unit)

"Classify the neurons using HIPPIE with tech_id 0 (neuropixels)"

"How many clusters did HIPPIE find?  List cluster IDs and neuron counts."

Note

The MCP get_waveform_traces tool extracts one unit at a time. For a recording with many units the agent needs to call it N times before HIPPIE can run. See Known limitation: MCP waveform extraction is per-unit below.

Pipeline C — Waveforms already available

If avg_waveform is already in neuron_attributes — e.g. loaded from an HDF5 workspace, set manually from an upstream pipeline, or computed in a previous session — skip straight to classification:

from spikelab.spikedata.hippie_adapter import classify_neurons

# sd already has avg_waveform in neuron_attributes
result = classify_neurons(sd, tech_id="neuropixels")

sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

To check whether waveforms are already present before trying:

waves = sd.get_neuron_attribute("avg_waveform")
if waves is None or any(w is None for w in waves):
    print("Waveforms missing — run get_waveform_traces first")
else:
    print(f"Waveforms ready for {sd.N} units")

Quick start (waveforms already present)

from spikelab.spikedata.hippie_adapter import classify_neurons

result = classify_neurons(
    sd,
    tech_id="neuropixels",   # or 0, 1, 2, 3  — see Technology IDs below
    run_umap=True,
    run_hdbscan=True,
)

sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

Return values

Key	Shape	Description
`embeddings`	`(N, 30)`	Latent z_mean vectors from the HIPPIE encoder
`umap_coords`	`(N, 2)`	2-D UMAP projection (present when `run_umap=True`)
`cluster_labels`	`(N,)`	HDBSCAN cluster IDs; `-1` = noise / unclustered (present when `run_hdbscan=True`)

Technology IDs

The pretrained checkpoint was trained on recordings from four technology families. Pass the matching tech_id for best results:

`tech_id`	Technology
`0` / `"neuropixels"`	Neuropixels probes (default)
`1` / `"silicon_probe"`	Silicon probes (non-Neuropixels)
`2` / `"juxtacellular"`	Juxtacellular recordings
`3` / `"tetrodes"`	Tetrode recordings

Unsupervised VAE compression (no conditioning)

If you do not have cell-type or technology labels, or simply want to learn a compressed representation of your own dataset from scratch, the unconditioned VAE pipeline trains the same ResNet18 + fusion encoder architecture as the pretrained model but with all conditioning removed. The only training signal is reconstruction + KL divergence (beta-VAE ELBO, beta=1) — no class embeddings, no technology embeddings.

Results are stored as vae_embedding, vae_umap_x, vae_umap_y, and vae_cluster in neuron_attributes, keeping them separate from the pretrained-model attributes (hippie_embedding etc.).

Train and compress in Python

from spikelab.spikedata.hippie_adapter import (
    train_vae_on_spikedata,
    compress_neurons,
)

# Step 1 — train on your data (requires avg_waveform in neuron_attributes)
compressor = train_vae_on_spikedata(
    sd,
    output_dir="./my_vae",
    z_dim=30,          # latent dimensionality
    n_epochs=100,
    batch_size=256,
    device="cpu",      # or "cuda"
)
# Checkpoint saved to ./my_vae/vae_best.ckpt

# Step 2 — compress (can reuse the returned compressor, or reload later)
result = compress_neurons(sd, compressor, run_umap=True, run_hdbscan=True)

sd.set_neuron_attribute("vae_cluster",   result["cluster_labels"])
sd.set_neuron_attribute("vae_umap_x",    result["umap_coords"][:, 0])
sd.set_neuron_attribute("vae_umap_y",    result["umap_coords"][:, 1])
sd.set_neuron_attribute("vae_embedding", result["embeddings"])

Reload and compress new data later

from spikelab.spikedata.hippie_adapter import compress_neurons

# Load a previously trained checkpoint by path
result = compress_neurons(sd_new, "./my_vae/vae_best.ckpt")

Use the VAE API directly

from hippie.vae import train_vae, VAECompressor

# Train
compressor = train_vae(wave, isi, acg, output_dir="./my_vae", z_dim=30, n_epochs=100)

# Or load an existing checkpoint
compressor = VAECompressor.from_checkpoint("./my_vae/vae_best.ckpt")

# Encode
embeddings = compressor.get_embeddings(wave, isi, acg)  # (N, z_dim)
coords      = compressor.umap_reduce(embeddings)
labels      = compressor.hdbscan_cluster(coords, min_cluster_size=5)

MCP / agent

1. "Train an unconditioned VAE on the neurons in namespace 'probe0',
   saving to ./my_vae, with 50 epochs."
   → calls train_vae_hippie

2. "Compress the neurons in 'probe0' using the checkpoint at ./my_vae/vae_best.ckpt."
   → calls compress_neurons_hippie

3. "How many clusters did the VAE find?"

How it differs from the pretrained classifier

	Pretrained HIPPIE (`classify_neurons`)	Unconditioned VAE (`compress_neurons`)
Requires labels	No (inference only)	No (train & infer)
Requires `tech_id`	Yes	No
Trains on your data	No	Yes
Learns from 11 datasets	Yes (pretrained)	No (your data only)
Latent space shaped by	Cell types + technology	Reconstruction only
Best for	Known-technology recordings	Exploratory compression, novel datasets

Advanced options

Tuning UMAP and HDBSCAN

result = classify_neurons(
    sd,
    tech_id=0,
    umap_kwargs={"n_neighbors": 15, "min_dist": 0.05},
    hdbscan_kwargs={"min_cluster_size": 10, "min_samples": 5},
)

Embeddings only (no clustering)

Useful when you want to inspect the latent space before deciding on clustering parameters:

result = classify_neurons(sd, run_umap=False, run_hdbscan=False)
embeddings = result["embeddings"]   # (N, 30)

Using the HIPPIE API directly

For full control over preprocessing or batching, import hippie.inference.HIPPIEClassifier directly:

from hippie import HIPPIEClassifier

clf = HIPPIEClassifier.from_pretrained("Jesusgf23/hippie", device="cpu")

# Inputs must be preprocessed — see hippie_adapter.extract_features()
# for the exact normalization applied to each modality.
embeddings = clf.get_embeddings(wave, isi, acg, tech_id=0)
coords      = clf.umap_reduce(embeddings, n_neighbors=30)
labels      = clf.hdbscan_cluster(coords, min_cluster_size=5)

# Load from a local checkpoint instead of HuggingFace
clf2 = HIPPIEClassifier.from_checkpoint("./my_trained_model.ckpt")

Using via the MCP server

The classify_neurons_hippie tool is available in the SpikeLab MCP server once spikelab[hippie] is installed. After the tool runs, it writes hippie_embedding, hippie_umap_x, hippie_umap_y, and hippie_cluster directly into neuron_attributes, making them accessible to all downstream tools.

Example agent prompts

"Classify the neurons in this recording using HIPPIE."

"Run HIPPIE cell-type classification with tech_id 1 (silicon probe)."

"Embed the neurons with HIPPIE and cluster with HDBSCAN, minimum cluster size 10."

"What cell types did HIPPIE find? List the cluster IDs and neuron counts."

"Plot the HIPPIE UMAP coloured by cluster label."

Known limitation: MCP waveform extraction is per-unit

The current get_waveform_traces MCP tool extracts waveforms for a single unit per call. For a recording with N neurons, an agent must call it N times before classify_neurons_hippie can run.

Workaround until a bulk extract_all_waveforms tool is added:

Use Pipeline A or C in Python, where get_waveform_traces(unit=None) processes all units in one call.
Or pre-compute waveforms in Python and save the workspace; the agent can then load it and run classify_neurons_hippie directly.

How it works

Feature extraction — For each neuron, SpikeLab computes:
- Waveform (50 samples, min-max normalized to [-1, 1])
- ISI histogram (100 log-spaced bins from 1 ms to 5 s, log(x+1) transformed, then min-max normalized)
- Autocorrelogram (100 bins, 0–100 ms, min-max normalized)
Encoding — Three modality-specific ResNet18 encoders project each neuron’s features into a shared 30-D latent space, conditioned on the recording technology (tech_id).
UMAP — The 30-D embeddings are projected to 2-D using cosine-distance UMAP for visualization and clustering.
HDBSCAN — Density clusters are found in the 2-D UMAP space. Neurons that do not belong to any cluster receive label -1.

Checkpoint

The pretrained model (hippie_techcond_v1.ckpt) is hosted at huggingface.co/Jesusgf23/hippie. It is downloaded automatically on first use and cached locally (HuggingFace default cache, or override with cache_dir). The file is 293 MB; subsequent calls use the local cache.

The model was pretrained on 11 labeled electrophysiology datasets spanning mouse, rat, and macaque across multiple brain regions and recording technologies.

Cell-Type Classification with HIPPIE

Installation

Data requirements

Pipeline A — Kilosort output + raw .bin file

What get_waveform_traces does in step 3

Pipeline B — NWB file with raw traces

Python

MCP / agent

Pipeline C — Waveforms already available

Quick start (waveforms already present)

Return values

Technology IDs

Unsupervised VAE compression (no conditioning)

Train and compress in Python

Reload and compress new data later

Use the VAE API directly

MCP / agent

How it differs from the pretrained classifier

Advanced options

Tuning UMAP and HDBSCAN

Embeddings only (no clustering)

Using the HIPPIE API directly

Using via the MCP server

Example agent prompts

Known limitation: MCP waveform extraction is per-unit

How it works

Checkpoint

Pipeline A — Kilosort output + raw `.bin` file

What `get_waveform_traces` does in step 3