scASprofiler-perp CLI
scASprofiler provides three CLI directly available in your Python path:
scASprofiler-perp, scASprofiler-impute, scASprofiler-quantify .
This module implements the core functionality of scASprofiler for building a high-quality single-cell splicing junction (SJ) count matrix. Starting from STAR junction outputs, it performs junction preprocessing, gene annotation mapping with a GTF file, intron grouping by splice sites (3’ / 5’), and multi-round quality control (QC) filtering. The final outputs include group-aware SJ matrices, a junction metadata table, and a normalized/padded matrix ready for downstream modeling.
you can generate a filitered single-cell splice junction counts matrix a list of Sj files by the command line like this:
# for smart-seq
scASprofiler-perp run --sj-dir /SJ --gtf gencode.v46.annotation.gtf --outdir ./out --samples-ps 1 --sites-ps 20 --sites-thres 10 --samples-thres 1000
# for droplet, e.g. 10x Genomics
scASprofiler-perp run --sj-dir /SJ --gtf gencode.v46.annotation.gtf --outdir ./out --samples-ps 1 --sites-ps 20 --sites-thres 10 --samples-thres 1000 --x10
By default, you will have three output files in the outdir: filter_sj_counts.csv,
sj_meta.csv, and raw_sj_counts.csv. The
filter_sj_counts.csv contains all information for imputation, e.g., for
scASprofiler-impute.
Options
There are more parameters for setting (scASprofiler-perp run --help always give the version
you are using):
parameter settings
Options:
sj_dir: directory of STAR splicing junction files (e.g., SJ.out.tab); used as the input junction table for filtering and grouping.
gtf: GTF annotation file path.
outdir: output directory for all generated results.
samples_ps: group-wise thresholding parameter; minimum number of observed (non-NaN) cells per junction within an intron group—junctions below this are set to missing (NaN).
sites_ps: group-wise thresholding parameter; minimum total counts per cell within an intron group—cells below this are set to missing (NaN) within that group.
sites_thres: site-level QC threshold; minimum number of expressing cells per junction (row-wise non-NaN count) to retain a junction.
samples_thres: cell-level QC threshold; minimum number of expressing junctions per cell (column-wise non-NaN count) to retain a cell.
use_ray: whether to enable Ray parallelism for group-wise threshold filtering (useful for large datasets).
num_cpus: number of CPUs to allocate for Ray-based parallel computation.
filter_unique_gene: whether to keep only junctions uniquely assigned to a single gene (reduces cross-gene ambiguity).
keep_multi_gene: whether to retain junctions that map to multiple genes (less strict; may include ambiguous loci).
use_multi: whether to include multi-mapped reads/junction counts if present in the input matrix (behavior depends on how the upstream counts were generated).
plate: pipeline switch; whether to use the plate-based workflow (smart-seq2).
x10: alias for the 10x (droplet-based) pipeline; if enabled, overrides the plate/tenx selection.