.. _hippie_classification:

Cell-Type Classification with HIPPIE
=====================================

SpikeLab has optional integration with `HIPPIE`_, a pretrained multimodal
generative model for neuron classification.  HIPPIE encodes each neuron's
**waveform**, **interspike-interval distribution**, and **autocorrelogram**
into a shared 30-D latent space, then uses UMAP + HDBSCAN for unsupervised
cell-type discovery.

.. _HIPPIE: https://huggingface.co/Jesusgf23/hippie

Installation
------------

HIPPIE is an optional dependency — install it alongside SpikeLab::

    pip install "spikelab[hippie]"

This pulls in PyTorch, HuggingFace Hub, umap-learn, and hdbscan in addition
to the HIPPIE package itself.  PyTorch with CUDA must be installed separately
if GPU inference is desired.

.. note::

   Nothing in the base SpikeLab install is affected.  The HIPPIE adapter is
   never imported unless you explicitly call it.

Data requirements
-----------------

HIPPIE requires three features per neuron:

* **Average waveform** — stored as ``avg_waveform`` in ``neuron_attributes``
* **Spike trains** — always present in a ``SpikeData`` object
* **Recording technology** — passed as ``tech_id`` at call time

The waveform is the only thing that may need preparation.  The three
pipelines below cover the most common starting points.

.. _hippie_pipeline_a:

Pipeline A — Kilosort output + raw ``.bin`` file
-------------------------------------------------

This is the typical Neuropixels + Kilosort4 workflow.  Kilosort gives you
spike times only; waveforms are extracted from the raw voltage trace
afterward.

.. note::

   Attaching raw data to a ``SpikeData`` object is currently a Python-only
   step — there is no MCP tool for it.  Use this path when scripting directly.

.. code-block:: python

    import numpy as np
    from spikelab.data_loaders import load_spikedata_from_kilosort
    from spikelab.spikedata.hippie_adapter import classify_neurons

    # 1. Load spike times from Kilosort output directory
    sd = load_spikedata_from_kilosort(
        folder_path="/path/to/kilosort_output/",
        fs_Hz=30000,                       # Neuropixels default
        cluster_info_tsv="cluster_info.tsv",
        include_noise=False,
    )

    # 2. Attach the raw voltage recording
    #    Shape must be (n_channels, n_samples).
    #    Use np.memmap for large files to avoid loading everything into RAM.
    raw = np.memmap(
        "/path/to/recording.ap.bin",
        dtype="int16",
        mode="r",
        shape=(385, n_samples),            # adjust n_channels and n_samples
    )
    sd.raw_data = raw.astype(np.float32)
    sd.raw_time = 30.0                     # sampling rate in kHz (30 000 Hz)

    # 3. Extract average waveforms for all units in one call.
    #    store=True writes avg_waveform into neuron_attributes automatically.
    sd.get_waveform_traces(
        unit=None,                         # None = all units
        ms_before=1.0,
        ms_after=2.0,
        store=True,
    )

    # 4. Run HIPPIE: embed → UMAP → HDBSCAN
    result = classify_neurons(
        sd,
        tech_id="neuropixels",             # or tech_id=0
        run_umap=True,
        run_hdbscan=True,
        hdbscan_kwargs={"min_cluster_size": 5},
    )

    # 5. Store results back into neuron_attributes
    sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
    sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
    sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
    sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

    n_clusters = (result["cluster_labels"] >= 0).sum()
    print(f"{sd.N} neurons → {n_clusters} clustered, "
          f"{(result['cluster_labels'] < 0).sum()} noise")

What ``get_waveform_traces`` does in step 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For every unit it finds the peak channel (from ``neuron_to_channel_map``),
extracts a short voltage snippet around each spike, averages them, and
stores the average as ``neuron_attributes[i]["avg_waveform"]``.  The adapter
then reads those stored values — no raw data is needed after this point.


.. _hippie_pipeline_b:

Pipeline B — NWB file with raw traces
--------------------------------------

NWB files produced by SpikeInterface, the Allen Brain Atlas pipeline, or
similar tools often embed both spike times and raw traces in a single file.
This is the only path that works end-to-end from the **MCP / agent interface**.

Python
~~~~~~

.. code-block:: python

    from spikelab.data_loaders import load_spikedata_from_nwb
    from spikelab.spikedata.hippie_adapter import classify_neurons

    sd = load_spikedata_from_nwb("/path/to/recording.nwb")

    # Extract waveforms for all units
    sd.get_waveform_traces(unit=None, ms_before=1.0, ms_after=2.0, store=True)

    result = classify_neurons(sd, tech_id="neuropixels")
    sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
    sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
    sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
    sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

MCP / agent
~~~~~~~~~~~

Give an agent these prompts in order:

.. code-block:: text

    1. "Load the NWB file at /path/to/recording.nwb"

    2. "Extract waveforms for all N units with 1 ms before and 2 ms after the spike"
       (the agent will call get_waveform_traces once per unit)

    3. "Classify the neurons using HIPPIE with tech_id 0 (neuropixels)"

    4. "How many clusters did HIPPIE find?  List cluster IDs and neuron counts."

.. note::

   The MCP ``get_waveform_traces`` tool extracts one unit at a time.
   For a recording with many units the agent needs to call it N times before
   HIPPIE can run.  See :ref:`hippie_mcp_gap` below.


.. _hippie_pipeline_c:

Pipeline C — Waveforms already available
-----------------------------------------

If ``avg_waveform`` is already in ``neuron_attributes`` — e.g. loaded from an
HDF5 workspace, set manually from an upstream pipeline, or computed in a
previous session — skip straight to classification:

.. code-block:: python

    from spikelab.spikedata.hippie_adapter import classify_neurons

    # sd already has avg_waveform in neuron_attributes
    result = classify_neurons(sd, tech_id="neuropixels")

    sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
    sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
    sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
    sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

To check whether waveforms are already present before trying:

.. code-block:: python

    waves = sd.get_neuron_attribute("avg_waveform")
    if waves is None or any(w is None for w in waves):
        print("Waveforms missing — run get_waveform_traces first")
    else:
        print(f"Waveforms ready for {sd.N} units")

Quick start (waveforms already present)
----------------------------------------

.. code-block:: python

    from spikelab.spikedata.hippie_adapter import classify_neurons

    result = classify_neurons(
        sd,
        tech_id="neuropixels",   # or 0, 1, 2, 3  — see Technology IDs below
        run_umap=True,
        run_hdbscan=True,
    )

    sd.set_neuron_attribute("hippie_cluster",   result["cluster_labels"])
    sd.set_neuron_attribute("hippie_umap_x",    result["umap_coords"][:, 0])
    sd.set_neuron_attribute("hippie_umap_y",    result["umap_coords"][:, 1])
    sd.set_neuron_attribute("hippie_embedding", result["embeddings"])

Return values
~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 20 15 65

   * - Key
     - Shape
     - Description
   * - ``embeddings``
     - ``(N, 30)``
     - Latent z_mean vectors from the HIPPIE encoder
   * - ``umap_coords``
     - ``(N, 2)``
     - 2-D UMAP projection (present when ``run_umap=True``)
   * - ``cluster_labels``
     - ``(N,)``
     - HDBSCAN cluster IDs; ``-1`` = noise / unclustered
       (present when ``run_hdbscan=True``)

Technology IDs
~~~~~~~~~~~~~~

The pretrained checkpoint was trained on recordings from four technology
families.  Pass the matching ``tech_id`` for best results:

.. list-table::
   :header-rows: 1
   :widths: 10 25

   * - ``tech_id``
     - Technology
   * - ``0`` / ``"neuropixels"``
     - Neuropixels probes *(default)*
   * - ``1`` / ``"silicon_probe"``
     - Silicon probes (non-Neuropixels)
   * - ``2`` / ``"juxtacellular"``
     - Juxtacellular recordings
   * - ``3`` / ``"tetrodes"``
     - Tetrode recordings

Unsupervised VAE compression (no conditioning)
----------------------------------------------

If you do not have cell-type or technology labels, or simply want to learn
a compressed representation of *your own dataset* from scratch, the
unconditioned VAE pipeline trains the same ResNet18 + fusion encoder
architecture as the pretrained model but with all conditioning removed.
The only training signal is reconstruction + KL divergence (beta-VAE ELBO,
beta=1) — no class embeddings, no technology embeddings.

Results are stored as ``vae_embedding``, ``vae_umap_x``, ``vae_umap_y``,
and ``vae_cluster`` in ``neuron_attributes``, keeping them separate from the
pretrained-model attributes (``hippie_embedding`` etc.).

Train and compress in Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from spikelab.spikedata.hippie_adapter import (
        train_vae_on_spikedata,
        compress_neurons,
    )

    # Step 1 — train on your data (requires avg_waveform in neuron_attributes)
    compressor = train_vae_on_spikedata(
        sd,
        output_dir="./my_vae",
        z_dim=30,          # latent dimensionality
        n_epochs=100,
        batch_size=256,
        device="cpu",      # or "cuda"
    )
    # Checkpoint saved to ./my_vae/vae_best.ckpt

    # Step 2 — compress (can reuse the returned compressor, or reload later)
    result = compress_neurons(sd, compressor, run_umap=True, run_hdbscan=True)

    sd.set_neuron_attribute("vae_cluster",   result["cluster_labels"])
    sd.set_neuron_attribute("vae_umap_x",    result["umap_coords"][:, 0])
    sd.set_neuron_attribute("vae_umap_y",    result["umap_coords"][:, 1])
    sd.set_neuron_attribute("vae_embedding", result["embeddings"])

Reload and compress new data later
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from spikelab.spikedata.hippie_adapter import compress_neurons

    # Load a previously trained checkpoint by path
    result = compress_neurons(sd_new, "./my_vae/vae_best.ckpt")

Use the VAE API directly
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from hippie.vae import train_vae, VAECompressor

    # Train
    compressor = train_vae(wave, isi, acg, output_dir="./my_vae", z_dim=30, n_epochs=100)

    # Or load an existing checkpoint
    compressor = VAECompressor.from_checkpoint("./my_vae/vae_best.ckpt")

    # Encode
    embeddings = compressor.get_embeddings(wave, isi, acg)  # (N, z_dim)
    coords      = compressor.umap_reduce(embeddings)
    labels      = compressor.hdbscan_cluster(coords, min_cluster_size=5)

MCP / agent
~~~~~~~~~~~

.. code-block:: text

    1. "Train an unconditioned VAE on the neurons in namespace 'probe0',
       saving to ./my_vae, with 50 epochs."
       → calls train_vae_hippie

    2. "Compress the neurons in 'probe0' using the checkpoint at ./my_vae/vae_best.ckpt."
       → calls compress_neurons_hippie

    3. "How many clusters did the VAE find?"

How it differs from the pretrained classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * -
     - Pretrained HIPPIE (``classify_neurons``)
     - Unconditioned VAE (``compress_neurons``)
   * - Requires labels
     - No (inference only)
     - No (train & infer)
   * - Requires ``tech_id``
     - Yes
     - No
   * - Trains on your data
     - No
     - Yes
   * - Learns from 11 datasets
     - Yes (pretrained)
     - No (your data only)
   * - Latent space shaped by
     - Cell types + technology
     - Reconstruction only
   * - Best for
     - Known-technology recordings
     - Exploratory compression, novel datasets

Advanced options
----------------

Tuning UMAP and HDBSCAN
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    result = classify_neurons(
        sd,
        tech_id=0,
        umap_kwargs={"n_neighbors": 15, "min_dist": 0.05},
        hdbscan_kwargs={"min_cluster_size": 10, "min_samples": 5},
    )

Embeddings only (no clustering)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Useful when you want to inspect the latent space before deciding on
clustering parameters:

.. code-block:: python

    result = classify_neurons(sd, run_umap=False, run_hdbscan=False)
    embeddings = result["embeddings"]   # (N, 30)

Using the HIPPIE API directly
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For full control over preprocessing or batching, import
:class:`hippie.inference.HIPPIEClassifier` directly:

.. code-block:: python

    from hippie import HIPPIEClassifier

    clf = HIPPIEClassifier.from_pretrained("Jesusgf23/hippie", device="cpu")

    # Inputs must be preprocessed — see hippie_adapter.extract_features()
    # for the exact normalization applied to each modality.
    embeddings = clf.get_embeddings(wave, isi, acg, tech_id=0)
    coords      = clf.umap_reduce(embeddings, n_neighbors=30)
    labels      = clf.hdbscan_cluster(coords, min_cluster_size=5)

    # Load from a local checkpoint instead of HuggingFace
    clf2 = HIPPIEClassifier.from_checkpoint("./my_trained_model.ckpt")

Using via the MCP server
------------------------

The ``classify_neurons_hippie`` tool is available in the SpikeLab MCP
server once ``spikelab[hippie]`` is installed.  After the tool runs, it
writes ``hippie_embedding``, ``hippie_umap_x``, ``hippie_umap_y``, and
``hippie_cluster`` directly into ``neuron_attributes``, making them
accessible to all downstream tools.

Example agent prompts
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: text

    "Classify the neurons in this recording using HIPPIE."

    "Run HIPPIE cell-type classification with tech_id 1 (silicon probe)."

    "Embed the neurons with HIPPIE and cluster with HDBSCAN, minimum cluster size 10."

    "What cell types did HIPPIE find? List the cluster IDs and neuron counts."

    "Plot the HIPPIE UMAP coloured by cluster label."

.. _hippie_mcp_gap:

Known limitation: MCP waveform extraction is per-unit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The current ``get_waveform_traces`` MCP tool extracts waveforms for a
**single unit** per call.  For a recording with N neurons, an agent must
call it N times before ``classify_neurons_hippie`` can run.

Workaround until a bulk ``extract_all_waveforms`` tool is added:

* Use Pipeline A or C in Python, where ``get_waveform_traces(unit=None)``
  processes all units in one call.
* Or pre-compute waveforms in Python and save the workspace; the agent can
  then load it and run ``classify_neurons_hippie`` directly.

How it works
------------

1. **Feature extraction** — For each neuron, SpikeLab computes:

   * *Waveform* (50 samples, min-max normalized to [-1, 1])
   * *ISI histogram* (100 log-spaced bins from 1 ms to 5 s, log(x+1)
     transformed, then min-max normalized)
   * *Autocorrelogram* (100 bins, 0–100 ms, min-max normalized)

2. **Encoding** — Three modality-specific ResNet18 encoders project each
   neuron's features into a shared 30-D latent space, conditioned on the
   recording technology (``tech_id``).

3. **UMAP** — The 30-D embeddings are projected to 2-D using cosine-distance
   UMAP for visualization and clustering.

4. **HDBSCAN** — Density clusters are found in the 2-D UMAP space.
   Neurons that do not belong to any cluster receive label ``-1``.

Checkpoint
----------

The pretrained model (``hippie_techcond_v1.ckpt``) is hosted at
`huggingface.co/Jesusgf23/hippie`_.  It is downloaded automatically on
first use and cached locally (HuggingFace default cache, or override with
``cache_dir``).  The file is 293 MB; subsequent calls use the local cache.

.. _huggingface.co/Jesusgf23/hippie: https://huggingface.co/Jesusgf23/hippie

The model was pretrained on 11 labeled electrophysiology datasets spanning
mouse, rat, and macaque across multiple brain regions and recording
technologies.