测序数据生成表达矩阵

质控Fastp

fastp -i sample.raw.r1.fq.gz \
-I sample.raw.r2.fq.gz \
-o sample.clean.r1.fq.gz \
-O sample.clean.r2.fq.gz \
-j sample.QC.json \
-h sample.QC.html \
--adapter_sequence_r2 AAAAAAAAAAA

Fastp安装: https://github.com/OpenGene/fastp

1 2	wget http://opengene.org/fastp/fastp chmod a+x ./fastp

Barcode过滤UMI-tools

umi_tools whitelist --stdin sample.clean.r1.fq.gz \
--extract-method=regex \
--bc-pattern="(?P<cell_1>.{9})(?P<discard_1>.{12})(?P<cell_2>.{9})(?P<discard_2>.{13})(?P<cell_3>.{9})(?P<umi_1>.{8})(?<plotT>TTTTTTTT){s<=2}.*" \
--expect-cells=10000  \
--plot-prefix=sample \
--log2stderr \
--subset-reads=100000000 \
--knee-method=density \
--allow-threshold-error > sample.whitelist.txt

umi_tools extract --extract-method=regex \
--bc-pattern="(?P<cell_1>.{9})(?P<discard_1>.{12})(?P<cell_2>.{9})(?P<discard_2>.{13})(?P<cell_3>.{9})(?P<umi_1>.{8})(?<plotT>TTTTTTTT){s<=2}.*" \
--stdin sample.clean.r1.fq.gz \
--stdout sample.extracted.r1.fq.gz \
--read2-in sample.clean.r2.fq.gz \
--read2-out=sample.extracted.r2.fq.gz \
--filter-cell-barcode \
--whitelist=sample.whitelist.txt

umi_tools安装: python3 -m pip install umi_tools

比对STAR

参考基因组构建索引

STAR --runMode genomeGenerate \
--genomeDir /opt/star/index \
--genomeFastaFiles GRCh38.p13.genome.fa \
--sjdbGTFfile gencode.v43.annotation.gtf \
--sjdbOverhang 149 \ #read最大长度-1
--runThreadN 10

参考基因组和注释文件GTF下载: https://www.gencodegenes.org/human/

参考基因组: 选择 Genome sequence (GRCh38.p13) + ALL
注释文件GTF: 选择 Comprehensive gene annotation + CHR

比对

STAR --runThreadN 4 \
--genomeDir /opt/star/index \
--readFilesIn sample.extracted.r2.fq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix sample

STAR安装: https://github.com/alexdobin/STAR

wget https://github.com/alexdobin/STAR/archive/2.7.10b.tar.gz
tar -xzf 2.7.10b.tar.gz
cd STAR-2.7.10b/source
make STAR

表达定量FeatureCounts

featureCounts -T 4 \
-a gencode.v43.annotation.gtf \
-g gene_name \
-o sample \
sampleAligned.sortedByCoord.out.bam

umi_tools count \
--per-gene \
--gene-tag=XT \
--assigned-status-tag=XS \
--per-cell \
--wide-format-cell-counts \
-I sample.sorted.bam \
-S sample.counts.tsv.gz

FeatureCounts安装(下载解压即可用): https://sourceforge.net/projects/subread/

官方工作流

软件和文件准备

安装Docker:
参考: https://giftbear.github.io/2021/12/01/Linux%E5%AE%89%E8%A3%85Docker/
安装CWL:
python3 -m pip install cwlref-runner
下载参考基因组和注释文件:
http://bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/Rhapsody-WTA/GRCh38-PhiX-gencodev29/GRCh38-PhiX-gencodev29-20181205.tar.gz
http://bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/Rhapsody-WTA/GRCh38-PhiX-gencodev29/gencodev29-20181205.gtf
下载流程cwl和yml文件:
https://bitbucket.org/CRSwDev/cwl/downloads/

运行流程

配置yml文件: 设置测序数据，参考基因组和注释文件位置
运行: /opt/software/python/bin/cwl-runner --outdir ./ rhapsody_wta_1.12.1.cwl template_wta_1.12.1.yml

*参考文档: https://www.bdbiosciences.com/content/dam/bdb/marketing-documents/BD_Single_Cell_Genomics_Analysis_Setup_User_Guide_v2.pdf