Cell-Type Classification with HIPPIE
SpikeLab has optional integration with HIPPIE, a pretrained multimodal generative model for neuron classification. HIPPIE encodes each neuron’s waveform, interspike-interval distribution, and autocorrelogram into a shared 30-D latent space, then uses UMAP + HDBSCAN for unsupervised cell-type discovery.
Installation
HIPPIE is an optional dependency — install it alongside SpikeLab:
pip install "spikelab[hippie]"
This pulls in PyTorch, HuggingFace Hub, umap-learn, and hdbscan in addition to the HIPPIE package itself. PyTorch with CUDA must be installed separately if GPU inference is desired.
Note
Nothing in the base SpikeLab install is affected. The HIPPIE adapter is never imported unless you explicitly call it.
Data requirements
HIPPIE requires three features per neuron:
Average waveform — stored as
avg_waveforminneuron_attributesSpike trains — always present in a
SpikeDataobjectRecording technology — passed as
tech_idat call time
The waveform is the only thing that may need preparation. The three pipelines below cover the most common starting points.
Pipeline A — Kilosort output + raw .bin file
This is the typical Neuropixels + Kilosort4 workflow. Kilosort gives you spike times only; waveforms are extracted from the raw voltage trace afterward.
Note
Attaching raw data to a SpikeData object is currently a Python-only
step — there is no MCP tool for it. Use this path when scripting directly.
import numpy as np
from spikelab.data_loaders import load_spikedata_from_kilosort
from spikelab.spikedata.hippie_adapter import classify_neurons
# 1. Load spike times from Kilosort output directory
sd = load_spikedata_from_kilosort(
folder_path="/path/to/kilosort_output/",
fs_Hz=30000, # Neuropixels default
cluster_info_tsv="cluster_info.tsv",
include_noise=False,
)
# 2. Attach the raw voltage recording
# Shape must be (n_channels, n_samples).
# Use np.memmap for large files to avoid loading everything into RAM.
raw = np.memmap(
"/path/to/recording.ap.bin",
dtype="int16",
mode="r",
shape=(385, n_samples), # adjust n_channels and n_samples
)
sd.raw_data = raw.astype(np.float32)
sd.raw_time = 30.0 # sampling rate in kHz (30 000 Hz)
# 3. Extract average waveforms for all units in one call.
# store=True writes avg_waveform into neuron_attributes automatically.
sd.get_waveform_traces(
unit=None, # None = all units
ms_before=1.0,
ms_after=2.0,
store=True,
)
# 4. Run HIPPIE: embed → UMAP → HDBSCAN
result = classify_neurons(
sd,
tech_id="neuropixels", # or tech_id=0
run_umap=True,
run_hdbscan=True,
hdbscan_kwargs={"min_cluster_size": 5},
)
# 5. Store results back into neuron_attributes
sd.set_neuron_attribute("hippie_cluster", result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x", result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y", result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])
n_clusters = (result["cluster_labels"] >= 0).sum()
print(f"{sd.N} neurons → {n_clusters} clustered, "
f"{(result['cluster_labels'] < 0).sum()} noise")
What get_waveform_traces does in step 3
For every unit it finds the peak channel (from neuron_to_channel_map),
extracts a short voltage snippet around each spike, averages them, and
stores the average as neuron_attributes[i]["avg_waveform"]. The adapter
then reads those stored values — no raw data is needed after this point.
Pipeline B — NWB file with raw traces
NWB files produced by SpikeInterface, the Allen Brain Atlas pipeline, or similar tools often embed both spike times and raw traces in a single file. This is the only path that works end-to-end from the MCP / agent interface.
Python
from spikelab.data_loaders import load_spikedata_from_nwb
from spikelab.spikedata.hippie_adapter import classify_neurons
sd = load_spikedata_from_nwb("/path/to/recording.nwb")
# Extract waveforms for all units
sd.get_waveform_traces(unit=None, ms_before=1.0, ms_after=2.0, store=True)
result = classify_neurons(sd, tech_id="neuropixels")
sd.set_neuron_attribute("hippie_cluster", result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x", result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y", result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])
MCP / agent
Give an agent these prompts in order:
1. "Load the NWB file at /path/to/recording.nwb"
2. "Extract waveforms for all N units with 1 ms before and 2 ms after the spike"
(the agent will call get_waveform_traces once per unit)
3. "Classify the neurons using HIPPIE with tech_id 0 (neuropixels)"
4. "How many clusters did HIPPIE find? List cluster IDs and neuron counts."
Note
The MCP get_waveform_traces tool extracts one unit at a time.
For a recording with many units the agent needs to call it N times before
HIPPIE can run. See Known limitation: MCP waveform extraction is per-unit below.
Pipeline C — Waveforms already available
If avg_waveform is already in neuron_attributes — e.g. loaded from an
HDF5 workspace, set manually from an upstream pipeline, or computed in a
previous session — skip straight to classification:
from spikelab.spikedata.hippie_adapter import classify_neurons
# sd already has avg_waveform in neuron_attributes
result = classify_neurons(sd, tech_id="neuropixels")
sd.set_neuron_attribute("hippie_cluster", result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x", result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y", result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])
To check whether waveforms are already present before trying:
waves = sd.get_neuron_attribute("avg_waveform")
if waves is None or any(w is None for w in waves):
print("Waveforms missing — run get_waveform_traces first")
else:
print(f"Waveforms ready for {sd.N} units")
Quick start (waveforms already present)
from spikelab.spikedata.hippie_adapter import classify_neurons
result = classify_neurons(
sd,
tech_id="neuropixels", # or 0, 1, 2, 3 — see Technology IDs below
run_umap=True,
run_hdbscan=True,
)
sd.set_neuron_attribute("hippie_cluster", result["cluster_labels"])
sd.set_neuron_attribute("hippie_umap_x", result["umap_coords"][:, 0])
sd.set_neuron_attribute("hippie_umap_y", result["umap_coords"][:, 1])
sd.set_neuron_attribute("hippie_embedding", result["embeddings"])
Return values
Key |
Shape |
Description |
|---|---|---|
|
|
Latent z_mean vectors from the HIPPIE encoder |
|
|
2-D UMAP projection (present when |
|
|
HDBSCAN cluster IDs; |
Technology IDs
The pretrained checkpoint was trained on recordings from four technology
families. Pass the matching tech_id for best results:
|
Technology |
|---|---|
|
Neuropixels probes (default) |
|
Silicon probes (non-Neuropixels) |
|
Juxtacellular recordings |
|
Tetrode recordings |
Unsupervised VAE compression (no conditioning)
If you do not have cell-type or technology labels, or simply want to learn a compressed representation of your own dataset from scratch, the unconditioned VAE pipeline trains the same ResNet18 + fusion encoder architecture as the pretrained model but with all conditioning removed. The only training signal is reconstruction + KL divergence (beta-VAE ELBO, beta=1) — no class embeddings, no technology embeddings.
Results are stored as vae_embedding, vae_umap_x, vae_umap_y,
and vae_cluster in neuron_attributes, keeping them separate from the
pretrained-model attributes (hippie_embedding etc.).
Train and compress in Python
from spikelab.spikedata.hippie_adapter import (
train_vae_on_spikedata,
compress_neurons,
)
# Step 1 — train on your data (requires avg_waveform in neuron_attributes)
compressor = train_vae_on_spikedata(
sd,
output_dir="./my_vae",
z_dim=30, # latent dimensionality
n_epochs=100,
batch_size=256,
device="cpu", # or "cuda"
)
# Checkpoint saved to ./my_vae/vae_best.ckpt
# Step 2 — compress (can reuse the returned compressor, or reload later)
result = compress_neurons(sd, compressor, run_umap=True, run_hdbscan=True)
sd.set_neuron_attribute("vae_cluster", result["cluster_labels"])
sd.set_neuron_attribute("vae_umap_x", result["umap_coords"][:, 0])
sd.set_neuron_attribute("vae_umap_y", result["umap_coords"][:, 1])
sd.set_neuron_attribute("vae_embedding", result["embeddings"])
Reload and compress new data later
from spikelab.spikedata.hippie_adapter import compress_neurons
# Load a previously trained checkpoint by path
result = compress_neurons(sd_new, "./my_vae/vae_best.ckpt")
Use the VAE API directly
from hippie.vae import train_vae, VAECompressor
# Train
compressor = train_vae(wave, isi, acg, output_dir="./my_vae", z_dim=30, n_epochs=100)
# Or load an existing checkpoint
compressor = VAECompressor.from_checkpoint("./my_vae/vae_best.ckpt")
# Encode
embeddings = compressor.get_embeddings(wave, isi, acg) # (N, z_dim)
coords = compressor.umap_reduce(embeddings)
labels = compressor.hdbscan_cluster(coords, min_cluster_size=5)
MCP / agent
1. "Train an unconditioned VAE on the neurons in namespace 'probe0',
saving to ./my_vae, with 50 epochs."
→ calls train_vae_hippie
2. "Compress the neurons in 'probe0' using the checkpoint at ./my_vae/vae_best.ckpt."
→ calls compress_neurons_hippie
3. "How many clusters did the VAE find?"
How it differs from the pretrained classifier
Pretrained HIPPIE ( |
Unconditioned VAE ( |
|
|---|---|---|
Requires labels |
No (inference only) |
No (train & infer) |
Requires |
Yes |
No |
Trains on your data |
No |
Yes |
Learns from 11 datasets |
Yes (pretrained) |
No (your data only) |
Latent space shaped by |
Cell types + technology |
Reconstruction only |
Best for |
Known-technology recordings |
Exploratory compression, novel datasets |
Advanced options
Tuning UMAP and HDBSCAN
result = classify_neurons(
sd,
tech_id=0,
umap_kwargs={"n_neighbors": 15, "min_dist": 0.05},
hdbscan_kwargs={"min_cluster_size": 10, "min_samples": 5},
)
Embeddings only (no clustering)
Useful when you want to inspect the latent space before deciding on clustering parameters:
result = classify_neurons(sd, run_umap=False, run_hdbscan=False)
embeddings = result["embeddings"] # (N, 30)
Using the HIPPIE API directly
For full control over preprocessing or batching, import
hippie.inference.HIPPIEClassifier directly:
from hippie import HIPPIEClassifier
clf = HIPPIEClassifier.from_pretrained("Jesusgf23/hippie", device="cpu")
# Inputs must be preprocessed — see hippie_adapter.extract_features()
# for the exact normalization applied to each modality.
embeddings = clf.get_embeddings(wave, isi, acg, tech_id=0)
coords = clf.umap_reduce(embeddings, n_neighbors=30)
labels = clf.hdbscan_cluster(coords, min_cluster_size=5)
# Load from a local checkpoint instead of HuggingFace
clf2 = HIPPIEClassifier.from_checkpoint("./my_trained_model.ckpt")
Using via the MCP server
The classify_neurons_hippie tool is available in the SpikeLab MCP
server once spikelab[hippie] is installed. After the tool runs, it
writes hippie_embedding, hippie_umap_x, hippie_umap_y, and
hippie_cluster directly into neuron_attributes, making them
accessible to all downstream tools.
Example agent prompts
"Classify the neurons in this recording using HIPPIE."
"Run HIPPIE cell-type classification with tech_id 1 (silicon probe)."
"Embed the neurons with HIPPIE and cluster with HDBSCAN, minimum cluster size 10."
"What cell types did HIPPIE find? List the cluster IDs and neuron counts."
"Plot the HIPPIE UMAP coloured by cluster label."
Known limitation: MCP waveform extraction is per-unit
The current get_waveform_traces MCP tool extracts waveforms for a
single unit per call. For a recording with N neurons, an agent must
call it N times before classify_neurons_hippie can run.
Workaround until a bulk extract_all_waveforms tool is added:
Use Pipeline A or C in Python, where
get_waveform_traces(unit=None)processes all units in one call.Or pre-compute waveforms in Python and save the workspace; the agent can then load it and run
classify_neurons_hippiedirectly.
How it works
Feature extraction — For each neuron, SpikeLab computes:
Waveform (50 samples, min-max normalized to [-1, 1])
ISI histogram (100 log-spaced bins from 1 ms to 5 s, log(x+1) transformed, then min-max normalized)
Autocorrelogram (100 bins, 0–100 ms, min-max normalized)
Encoding — Three modality-specific ResNet18 encoders project each neuron’s features into a shared 30-D latent space, conditioned on the recording technology (
tech_id).UMAP — The 30-D embeddings are projected to 2-D using cosine-distance UMAP for visualization and clustering.
HDBSCAN — Density clusters are found in the 2-D UMAP space. Neurons that do not belong to any cluster receive label
-1.
Checkpoint
The pretrained model (hippie_techcond_v1.ckpt) is hosted at
huggingface.co/Jesusgf23/hippie. It is downloaded automatically on
first use and cached locally (HuggingFace default cache, or override with
cache_dir). The file is 293 MB; subsequent calls use the local cache.
The model was pretrained on 11 labeled electrophysiology datasets spanning mouse, rat, and macaque across multiple brain regions and recording technologies.