Data Loaders

Functions for loading spike train data from various file formats, including pickle, NWB, and Neo-compatible formats.

Lightweight loaders that convert common neurophysiology formats into spikedata.SpikeData objects.

Supported inputs (best-effort, optional deps):

HDF5 (generic): spike times, (indices,times), or raster matrices
NWB: reads Units table spike_times (via pynwb if available, else h5py)
KiloSort/Phy outputs: spike_times.npy + spike_clusters.npy (+ optional TSV)
SpikeInterface: from a SortingExtractor
IBL (International Brain Laboratory): via ONE API + brainwidemap

Times are converted to milliseconds to match SpikeData conventions. These helpers avoid hard dependencies: optional libraries are imported lazily.

spikelab.data_loaders.data_loaders.load_spikedata_from_hdf5(filepath, *, raster_dataset=None, raster_bin_size_ms=None, spike_times_dataset=None, spike_times_index_dataset=None, spike_times_unit='s', fs_Hz=None, group_per_unit=None, group_time_unit='s', idces_dataset=None, times_dataset=None, times_unit='s', raw_dataset=None, raw_time_dataset=None, raw_time_unit='s', length_ms=None, metadata=None)[source]

Load spike trains from a generic HDF5 file using one of four supported input styles.

Exactly one input style must be specified. The four styles are: raster matrix, ragged arrays, group-per-unit, and paired arrays.

Parameters:

filepath (str) – Path to the HDF5 file.
raster_dataset (str | None) – Dataset path for a 2D raster/counts matrix (units x time). Activates raster style.
raster_bin_size_ms (float | None) – Bin width in milliseconds. Required for raster style.
spike_times_dataset (str | None) – Dataset path for flat concatenated spike times. Activates ragged style (requires spike_times_index_dataset).
spike_times_index_dataset (str | None) – Dataset path for cumulative end-of-unit indices into the flat spike times array.
spike_times_unit (str) – Time unit for ragged spike times (‘s’, ‘ms’, or ‘samples’).
fs_Hz (float | None) – Sampling frequency in Hz. Required when any time unit is ‘samples’.
group_per_unit (str | None) – HDF5 group path containing one dataset per unit. Activates group-per-unit style.
group_time_unit (str) – Time unit for group-per-unit datasets (‘s’, ‘ms’, or ‘samples’).
idces_dataset (str | None) – Dataset path for unit index array. Activates paired-arrays style (requires times_dataset).
times_dataset (str | None) – Dataset path for spike times array (paired with idces_dataset).
times_unit (str) – Time unit for paired spike times (‘s’, ‘ms’, or ‘samples’).
raw_dataset (str | None) – Dataset path for optional raw analog data.
raw_time_dataset (str | None) – Dataset path for the raw data time vector.
raw_time_unit (str) – Time unit for the raw time vector (‘s’, ‘ms’, or ‘samples’).
length_ms (float | None) – Recording duration in milliseconds. If not provided, inferred from the latest spike time.
metadata (Mapping | None) – Additional metadata to attach to the resulting SpikeData.

Returns:

The loaded spike train data.

Return type:

sd (SpikeData)

Raises:

ValueError – If not exactly one input style is specified, or if required arguments are missing.

spikelab.data_loaders.data_loaders.load_spikedata_from_hdf5_raw_thresholded(filepath, dataset, *, fs_Hz, threshold_sigma=5.0, filter=True, hysteresis=True, direction='both')[source]

Threshold-and-detect spikes from an HDF5 dataset of raw traces.

Parameters:

filepath (str) – Path to HDF5 file.
dataset (str) – HDF5 dataset path containing raw traces shaped (channels, time).
fs_Hz (float) – Sampling frequency in Hz.
threshold_sigma (float) – Threshold in units of per-channel standard deviation.
filter (dict | bool) – If True, apply default Butterworth bandpass; if dict, pass to filter; if False, no filtering.
hysteresis (bool) – Use rising-edge detection if True.
direction (str) – ‘both’ | ‘up’ | ‘down’.

Returns:

The detected spike train data.

Return type:

sd (SpikeData)

spikelab.data_loaders.data_loaders.load_spikedata_from_nwb(filepath, *, prefer_pynwb=True, length_ms=None, start_time_ms=None, allow_no_units=False)[source]

Load spike trains from an NWB file’s Units table.

Parameters:

filepath (str) – Path to the NWB file.
prefer_pynwb (bool) – If True, try pynwb first; if False, try h5py.
length_ms (float | None) – Recording duration in milliseconds. When None, reads from the file-level length_ms attribute (written by export_spikedata_to_nwb); falls back to inferring from the latest spike time if the attribute is absent.
start_time_ms (float | None) – Recording start time in milliseconds. When None, reads from the file-level start_time attribute (written by export_spikedata_to_nwb); falls back to 0.0 if the attribute is absent. Mirrors the length_ms ladder.
allow_no_units (bool) – When True, files without a Units table return a SpikeData with N=0 and empty trains rather than raising ValueError. The file-level metadata in sd.metadata is still populated. Useful for metadata- only callers (e.g. ingestion pipelines that need to gate on “is this sorted?” without crashing on unsorted inputs). Only honored on the pynwb path; the h5py fallback still requires the /units group.

Returns:

The loaded spike train data. Under the pynwb: path, sd.metadata is populated with file-level NWB metadata in addition to the usual source_file / format: identifier, session_description, session_start_time (ISO string), subject_id, species, sex, age, date_of_birth (ISO), device_names (sorted list), sampling_rate_hz, duration_seconds, unit_count, and electrodes_by_channel ({channel_id: {"location", "group_name", "x", "y", "z"}}). Each entry in sd.neuron_attributes gains a location_label key (textual region from the electrodes table) and group_name key alongside the existing location 3D coordinate list. The h5py fallback path doesn’t populate these extra fields — length_ms / start_time remain the only file-level attrs it carries.

Return type:

sd (SpikeData)

spikelab.data_loaders.data_loaders.load_spikedata_from_kilosort(folder, *, fs_Hz, spike_times_file='spike_times.npy', spike_clusters_file='spike_clusters.npy', cluster_info_tsv=None, time_unit='samples', include_noise=False, length_ms=None, channel_map_file='channel_map.npy', channel_positions_file='channel_positions.npy')[source]

Load KiloSort/Phy outputs into SpikeData.

Parameters:

folder (str) – Path to the KiloSort/Phy output directory.
fs_Hz (float) – Sampling frequency in Hz.
spike_times_file (str) – Path to the spike_times.npy file.
spike_clusters_file (str) – Path to the spike_clusters.npy file.
cluster_info_tsv (str | None) – Path to the cluster info TSV file.
time_unit (str) – Unit of the spike times (‘samples’, ‘s’, or ‘ms’).
include_noise (bool) – If True, include noise clusters.
length_ms (float | None) – Recording duration in milliseconds.
channel_map_file (str) – Filename of the channel map file relative to folder. Expected format: 1D numpy array mapping cluster indices to channel numbers.
channel_positions_file (str) – Filename of the channel positions file relative to folder. Expected format: 2D numpy array of shape (channels, 3) containing channel positions.

Returns:

The loaded spike train data.

Return type:

sd (SpikeData)

Notes

This loader does not extract or include waveform data; only spike times and cluster assignments are loaded.
Reads spike_times.npy (samples) and spike_clusters.npy; groups times per cluster and converts to ms using fs_Hz.

spikelab.data_loaders.data_loaders.load_spikedata_from_spikeinterface(sorting, *, sampling_frequency=None, unit_ids=None, segment_index=0)[source]

Convert a SpikeInterface SortingExtractor-like object to SpikeData.

Parameters:

sorting (object) – Exposes get_unit_ids(), get_sampling_frequency(), get_unit_spike_train(…).
sampling_frequency (float | None) – Optional override for sampling frequency (Hz).
unit_ids (Sequence | None) – Optional subset of unit IDs to include. When provided, the order of the returned SpikeData’s units follows the caller’s order (after presence validation).
segment_index (int) – Segment index for multi-segment sortings.

Returns:

The converted spike train data.

Return type:

sd (SpikeData)

Notes

When unit_ids is None, the resulting unit order follows sorting.get_unit_ids() order, which is backend-dependent (KiloSort returns sequential IDs; some SpikeInterface variants reorder by sort metric). Two SpikeData objects built from different backends may therefore index the same physical unit at different positions. Pass an explicit unit_ids sequence when the unit ordering matters across backends.
neuron_attributes[i]["unit_id"] records the original backend ID, providing a stable mapping from position to source ID irrespective of the order convention.

spikelab.data_loaders.data_loaders.load_spikedata_from_spikeinterface_recording(recording, *, segment_index=0, threshold_sigma=5.0, filter=False, hysteresis=True, direction='both')[source]

Convert a SpikeInterface BaseRecording-like object into SpikeData.

Parameters:

recording (object) – Exposes get_traces(segment_index=…), get_sampling_frequency(), get_num_channels().
segment_index (int) – Segment index for multi-segment recordings.
threshold_sigma (float) – Threshold in units of per-channel standard deviation.
filter (dict | bool) – If True, apply default Butterworth bandpass; if dict, pass to filter; if False, no filtering.
hysteresis (bool) – Use rising-edge detection if True.
direction (str) – ‘both’ | ‘up’ | ‘down’.

Returns:

The converted spike train data.

Return type:

sd (SpikeData)

spikelab.data_loaders.data_loaders.load_spikedata_from_pickle(filepath, *, allow_remote=False, aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, region_name=None)[source]

Load a SpikeData object from a pickle file.

Warning

Only load pickle files from trusted sources. Pickle deserialization can execute arbitrary code and should never be used with untrusted data. The file is deserialized before type checking — malicious payloads execute regardless of the subsequent isinstance check. Remote (S3) loads require allow_remote=True so the caller has to opt in.

Parameters:

filepath (str) – Path to the pickle file, or an S3 URL (s3://bucket/key). Remote URLs require allow_remote=True.
allow_remote (bool) – When False (default), S3 URLs are rejected with a ValueError. Pass True to opt in to loading a pickle from a remote bucket; a UserWarning is also emitted at the call site so the risk surfaces in batch-job logs.
aws_access_key_id (str | None) – AWS access key ID for S3 downloads.
aws_secret_access_key (str | None) – AWS secret access key for S3 downloads.
aws_session_token (str | None) – AWS session token for temporary credentials.
region_name (str | None) – AWS region name for S3 access.

Returns:

The deserialized SpikeData object.

Return type:

sd (SpikeData)

spikelab.data_loaders.data_loaders.load_spikedata_from_ibl(eid, pid, *, length_ms=None, collection=None)[source]

Load spike trains for a single IBL probe into SpikeData.

Authenticates against the public IBL server automatically. Only units labelled as good (label == 1) in the Brain-Wide Map unit table are included. Trial event times are stored in SpikeData.metadata as individual numpy arrays, all in milliseconds.

Parameters:

eid (str) – IBL experiment ID (UUID string).
pid (str) – IBL probe ID (UUID string).
length_ms (float | None) – Recording duration in milliseconds. If not provided, the maximum spike time across all units is used.
collection (str | None) – If provided, skip the heuristic collection search and load spikes directly from this collection (e.g. "alf/probe00/pykilosort"). Saves 3-4 network round-trips per call when the caller already knows the canonical collection (e.g. in batch workflows that resolve the collection once and reuse it). None (default) falls back to the PID-suffix heuristic + fallback chain.

Returns:

Loaded spike train data.

neuron_attributes carries the Beryl region per unit plus, when the Brain-Wide Map table provides them, Allen acronym and atlas_id, the Cosmos parcellation parent, and per-unit QC fields (firing_rate, presence_ratio, amp_median, contamination, drift, noise_cutoff, cluster_id). metadata carries:

Existing trial fields: eid, pid, n_trials, trial_start_times, trial_end_times, stim_on_times, stim_off_times, go_cue_times, response_times, feedback_times, first_movement_times, choice, feedback_type, contrast_left, contrast_right, probability_left.

File-level identification: identifier (= eid), format ("IBL"), unit_count, sampling_rate_hz (when present on the spikes object), duration_seconds.

Session metadata (best-effort, one extra REST call): session_start_time, session_end_time, lab, task_protocol, project, session_number, procedures, qc.

Subject metadata (same REST chain): subject_id, species, sex, date_of_birth, age_weeks, strain, genotype, responsible_user.

Probe insertion (best-effort REST call): probe_name, probe_model, and insertion_* coordinates for any of {x, y, z, theta, phi, depth} present.

electrodes_by_channel (best-effort one.load_object("channels") call): per-channel location (Allen acronym), atlas_id (Allen Structure ID), x/y/z (ML/AP/DV in mm), local_x/local_y (probe-relative, μm), raw_index. Key shape matches the NWB loader’s electrodes_by_channel for cross-format consumers.

All time arrays are in milliseconds.

Return type:

sd (SpikeData)

Notes

Requires one-api and brainwidemap packages (optional dependencies).
Spike times are converted from seconds (IBL convention) to milliseconds.
Trial times are converted from seconds to milliseconds.
When collection is None, the probe collection is inferred from the PID suffix; falls back through alf/probe00/pykilosort, alf/probe01/pykilosort, and alf.
Session, subject, insertion, and channels lookups are best-effort. A failure on any of them yields an absent metadata field rather than raising — the spike-train load succeeds as long as the units table is reachable.

spikelab.data_loaders.data_loaders.query_ibl_probes(target_regions=None, *, min_units=0, min_fraction_in_target=0.0)[source]

Search the IBL Brain-Wide Map database for probes matching given criteria.

Authenticates against the public IBL server automatically. Filters probes by brain region and unit count. Returns matching (eid, pid) pairs alongside a per-probe statistics DataFrame.

Parameters:

target_regions (list[str] | None) – Beryl atlas region names to filter by (e.g. ["MOs", "MOp"]). If None, no region filter is applied.
min_units (int) – Minimum number of good units required per probe. Default 0 (no minimum).
min_fraction_in_target (float) – Minimum fraction (0–1) of good units that must fall within target_regions. Ignored when target_regions is None. Default 0.0.

Returns:

List of (eid, pid) pairs for: probes that pass all filters, sorted by descending good unit count.
stats (pd.DataFrame): One row per matching probe with columns:: eid, pid, n_good_units, and (when target_regions is not None) n_in_target and fraction_in_target.

Return type:

probes (list[tuple[str, str]])

Notes

Requires one-api and brainwidemap packages (optional dependencies).
bwm_units() fetches the full Brain-Wide Map unit table from the IBL server; this may take several seconds on first call.

spikelab.data_loaders.data_loaders.load_spikedata_from_dandi(asset_id, *, dandiset_id=None, version='draft', download_dir=None, api_token=None, api_base='https://api.dandiarchive.org/api', request_timeout_seconds=30.0, download_timeout_seconds=600.0, allow_no_units=False, length_ms=None, start_time_ms=None)[source]

Download one DANDI NWB asset and load it as a SpikeData.

Resolves the asset’s download URL via DANDI’s asset-detail endpoint (or accepts a direct URL when asset_id looks like one), streams the bytes to disk, then delegates to load_spikedata_from_nwb() for the actual NWB parsing. DANDI provenance fields are added to SpikeData.metadata.

Parameters:

asset_id (str) – Either a DANDI asset UUID (the asset_id field from list_dandi_assets()) or a fully-qualified asset download URL.
dandiset_id (str | None) – Owning dandiset id (e.g. "000006"). Used to build the source_reference provenance string; optional when only the asset_id is known.
version (str) – Dandiset version. Recorded on metadata.
download_dir (str | None) – Directory the downloaded file lives in. When None, a tempfile.TemporaryDirectory is used and the file is deleted after the load. When a path is supplied, the directory is created if needed and the file is kept — caller manages cleanup. The file path is then recorded on SpikeData.metadata as downloaded_path.
api_token (str | None) – Personal access token. Required for embargoed dandisets. Defaults to the DANDI_API_TOKEN env var when not supplied.
api_base (str) – API root override.
request_timeout_seconds (float) – Per-API-call timeout (asset detail lookup, etc.).
download_timeout_seconds (float) – Per-download timeout. Default 10 min covers ~50 GB at 100 Mbps; raise for very large assets on slow links.
allow_no_units (bool) – Passed through to load_spikedata_from_nwb(). True lets metadata- only callers load files without a Units table.
length_ms (float | None) – Passed through.
start_time_ms (float | None) – Passed through.

Returns:

Loaded spike data. sd.metadata carries all: the keys load_spikedata_from_nwb() populates, plus DANDI-specific fields: dandi_asset_id, dandi_dandiset_id (when dandiset_id is provided), dandi_version, source_reference (DANDI URL), downloaded_path (only when download_dir is supplied).

Return type:

sd (SpikeData)

Raises:

urllib.error.URLError – On network / HTTP failure.
ValueError / ImportError – From the delegated NWB load.

Notes

Public dandisets work without authentication. Embargoed dandisets need a personal access token (Account → My Tokens on dandiarchive.org). Streaming download keeps memory bounded independent of asset size; on-disk space proportional to the file is required.

DANDI also hosts raw recordings (NWB files with an ElectricalSeries acquisition but no Units table). This loader does NOT materialise the raw voltage traces — the function name signals “spike data only”. For metadata triage on raw assets, pass allow_no_units=True: the returned SpikeData has N=0 but metadata is fully populated (subject, session, electrodes_by_channel, sampling_rate_hz, duration_seconds, etc.). Loading the raw ElectricalSeries as a SpikeInterface BaseRecording is a separate operation; pair this loader’s metadata triage with SpikeInterface’s NWB reader on the downloaded_path for that case.

spikelab.data_loaders.data_loaders.list_dandi_assets(dandiset_id, *, version='draft', path_glob=None, api_token=None, api_base='https://api.dandiarchive.org/api', page_size=100, request_timeout_seconds=30.0)[source]

Yield assets in a DANDI dandiset version.

Parameters:

dandiset_id (str) – Six-digit DANDI identifier (e.g. "000006"). Leading zeros matter.
version (str) – Dandiset version. "draft" (default) is the in-progress version; published versions are tagged like "0.231012.0".
path_glob (str | None) – Optional glob pattern (e.g. "*.nwb") the API filters on server-side. Cheaper than client-side filtering when most assets aren’t of interest.
api_token (str | None) – Personal access token. Required for embargoed dandisets. Defaults to the DANDI_API_TOKEN env var when not supplied; public dandisets work without one.
api_base (str) – API root. Override for staging or self-hosted DANDI.
page_size (int) – Per-page result count. Default 100 — the iterator pages internally, so this only affects request granularity.
request_timeout_seconds (float) – Per-request timeout.

Yields:

dict –

One asset per yielded value, with keys:: asset_id (str, UUID), path (str, dandiset-relative), size (int, bytes), download_url (str), dandiset_id (str), version (str).

Notes

Pagination is handled transparently — caller iterates without worrying about next_page. Large dandisets can have thousands of assets, so consumers should consume the iterator lazily rather than materialising the full list.

spikelab.data_loaders.data_loaders.load_recording_from_dandi(asset_id, zarr_dest, *, dandiset_id=None, version='draft', electrical_series_path=None, overwrite=False, download_dir=None, keep_nwb=False, api_token=None, api_base='https://api.dandiarchive.org/api', request_timeout_seconds=30.0, download_timeout_seconds=600.0, save_kwargs=None)[source]

Download a DANDI NWB asset and convert its raw ElectricalSeries to SpikeInterface Zarr format.

Complementary to load_spikedata_from_dandi(): that one is for pre-sorted Units tables; this one is for the raw voltage traces (the ElectricalSeries acquisition objects DANDI hosts but that no analysis tooling pre-processes for you). Output is a Zarr directory that any consumer of SpikeInterface — e.g. a spike sorter run later — can re-open via spikeinterface.core.read_zarr_recording().

Parameters:

asset_id (str) – DANDI asset UUID or a fully-qualified asset URL. Same shapes accepted by load_spikedata_from_dandi().
zarr_dest (str) – Target Zarr directory. Created if absent.
dandiset_id (str | None) – Owning dandiset id, used for the source_reference provenance string.
version (str) – Dandiset version. Recorded on metadata.
electrical_series_path (str | None) – When the NWB file has multiple ElectricalSeries objects, the HDMF location of the one to convert (e.g. "acquisition/ElectricalSeriesAP"). None (default) lets SpikeInterface auto-pick — works when there’s exactly one.
overwrite (bool) – When True, an existing zarr_dest is removed first. When False (default), an existing target raises.
download_dir (str | None) – Directory the downloaded .nwb lives in. None uses a tempdir. The NWB is removed after the Zarr write unless keep_nwb=True.
keep_nwb (bool) – When True, the downloaded .nwb is left on disk after the Zarr is written. Useful for callers that want to content-hash the original bytes (e.g. gateway ingestion). The downloaded path is included in the return dict under downloaded_nwb_path.
api_token (str | None) – DANDI personal access token. Defaults to DANDI_API_TOKEN env var.
api_base (str) – API root override.
request_timeout_seconds (float) – Per-API-call timeout.
download_timeout_seconds (float) – Per-download timeout.
save_kwargs (dict | None) – Forwarded to BaseRecording.save() — e.g. {"n_jobs": 4, "chunk_duration_s": 1.0}. Defaults to {}.

Returns:

Conversion outcome with the following keys:

zarr_path: Absolute path to the Zarr directory.
recording_metadata_path: JSON sidecar (DANDI provenance + NWB file-level metadata + recording shape).
downloaded_nwb_path: Present only when keep_nwb is True. Path to the source NWB file.
dandi_asset_id, dandi_dandiset_id (when supplied), dandi_version, source_reference: provenance.
sampling_rate_hz, n_channels, n_samples, duration_seconds: recording shape, surfaced from the SpikeInterface extractor for callers that don’t want to re-open the Zarr just to check.
Subject + session fields merged from the NWB metadata (when present): identifier, subject_id, species, sex, session_start_time, etc.

Return type:

dict

Raises:

ImportError – If spikeinterface (or its NWB extractor) isn’t installed.
ValueError – If asset detail fetch fails or the NWB file has no ElectricalSeries the extractor can resolve.
FileExistsError – If zarr_dest exists and overwrite=False.

Notes

Streaming download keeps memory bounded. Zarr writes are proportional to the recording size — a 1-hour, 384-channel, 30 kHz Neuropixels session is ~80 GB raw → ~20–40 GB with the default LZ4 compressor. Plan disk accordingly.

SpikeInterface’s NWB extractor decides chunking + dtype from the source ElectricalSeries. Pass save_kwargs={"n_jobs": N} to parallelise the chunk write for large recordings.

For DANDI assets that ARE pre-sorted (Units table present), use load_spikedata_from_dandi() instead — that path stops at the spike trains rather than rewriting the voltage traces.

spikelab.data_loaders.data_loaders.load_spikedata_from_spikelab_sorted_npz(filepath, *, length_ms=None)[source]

Load a SpikeLab compiled sorting result (.npz) into SpikeData.

These .npz files are produced by sort_with_kilosort2()’s compile_results step and contain per-unit spike trains, electrode locations, waveform templates, and quality metrics.

Parameters:

filepath (str) – Path to the .npz file.
length_ms (float | None) – Recording duration in milliseconds. Inferred from the latest spike time when None.

Returns:

The loaded spike train data with neuron attributes: (unit_id, location, electrode, template, amplitudes, etc.).

Return type:

sd (SpikeData)