scASprofiler-impute CLI

The scASprofiler-impute CLI implements a VAE-GAN framework for imputing highly sparse single-cell alternative splicing (AS) / junction count matrices. This command supports two modes: training and imputation. In training mode, scASprofiler learns a conditional generative model that captures cluster-specific splicing distributions, where each cell is associated with a discrete label (cluster / cell type). In imputation mode, it uses the trained decoder to generate in silico splicing profiles and then performs weighted KNN-style imputation (median over nearest synthetic neighbors) to fill in missing entries, while preserving observed values.

you can generate a complete single-cell splice junction counts matrix by the command line like this:

scASprofiler-impute  train   --data-Sj /out/filter_sj_counts.csv   --data-c onlyfillna_as_PRJEB15062_smart_seq2_label.txt   --outdir ./  --clusters 2   --n-epochs 1000 --batch-size 8   --drop-prob 0.1 --patience 10   --overwrite   --run-impute   --name scasp_drop_0.1   -k 10

By default, you will have one output file in the outdir,the output file contains all information for quantify, for scASprofiler-quantify.

Options

There are more parameters for setting (scASprofiler-impute train --help always give the version you are using):

parameter settings

Options:
    data_sj: path to the splicing junction matrix file (feature × cell; missing values as NaN), used as the model input for training/imputation.

    data_c: path to the cell label/cluster file (one label per cell, aligned to the columns of the SJ matrix), used for conditional generation.

    outdir: output directory for saving checkpoints and imputed results.

    name: user-defined job name used to prefix output files; if empty, a name is derived from input filenames.

    n_epochs: number of training epochs.

    batch_size: mini-batch size used during training.

    drop_prob: fraction of observed entries randomly masked to build a pseudo-validation set for early stopping and model selection.

    patience: early-stopping patience; training stops if validation MSE does not improve for this many epochs.

    threthold: convergence threshold parameter (reserved for convergence control; may be used to judge training stabilization depending on implementation).

    channels: number of input channels for the reshaped SJ “image” (typically 1 for a single matrix).

    latent_dim: dimensionality of the latent space in the VAE encoder/decoder.

    clusters: number of clusters/classes in the label file (one-hot conditioning dimension).

    overwrite: whether to overwrite existing checkpoints with the same job configuration.

    run_impute: whether to run imputation immediately after training finishes.

    no_run_impute: disable automatic post-training imputation.

    sim_size: number of synthetic samples generated per cluster during imputation (larger gives smoother KNN statistics but costs more time/memory).

    k: number of nearest synthetic neighbors used in mask-aware KNN imputation.