全基因组重测序数据鉴定4个群体群体特异性酶切标签序列

1、准备pear软件组装的重测序数据(将双端变为单端)

输入:sample_clean_1.fastq.gz、sample_clean_2.fastq.gz

输出:sample.assembled.fasta –> pop.fasta(pop:WCC、RCC、WR1、WR2)

1
2
3
4
5
6
7
8
## 假设每个样本已经有sample.assembled.fasta
## 合并WCC、RCC、WR1、WR2 4个群体的组装数据
# WCC:bx*.fasta x 10、bc*.fasta x 10;RCC:hx*.fasta x 10、hc*.fasta x 10;WR1:h1c*.fasta x20;WR2:h2x*.fasta x 10、h2c*.fasta x 10

cat bx*.fasta bc*.fasta > WCC.fasta
cat hx*.fasta hc*.fasta > RCC.fasta
cat h1c*.fasta > WR1.fasta
cat h2x*.fasta h2c*.fasta > WR2.fasta

2 、使用Extract_Cut_Site.pl脚本提取4个群体的酶切标签序列

输入:pop.fasta

输出:pop.BsaXI.tag.fasta

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 机器:carp
# 账户:liudong
# 工作目录:/home/liudong/zengy/Enzyme_Extract
# 目录结构:
[liudong@carp Enzyme_Extract]$ ll -rth
总用量 4.0K
drwxrwxr-x. 2 liudong liudong 4.0K 3月  29 11:19 01_all_assembled_fasta
drwxrwxr-x. 2 liudong liudong   10 4月   2 22:27 02_all_pop_BsaXI_enzyme_tag_fasta
drwxrwxr-x. 2 liudong liudong   10 4月   2 22:30 scripts
############
## 生成第一步运行酶切标签序列提取
for pop in WCC RCC WR1 WR2;do echo "perl scripts/Extract_cut_site.pl -r 01_all_assembled_fasta/${pop}.fasta -c [AGTC]{9}AC[AGTC]{5}CTCC[ATGC]{7}-[ATGC]{7}GGAG[AGTC]{5}GT[ATGC]{9} -l 27-27 -o 02_all_pop_BsaXI_enzyme_tag_fasta/${pop}.BsaXI.tag.fasta";done > 01_Extract_cut_site.sh
# 提交任务脚本
nohup bash 01_Extract_cut_site.sh > 01_Extract_cut_site.log &

3、对提取的4个群体的BsaXI酶切序列进行各自内部去冗余序列操作

输入:pop.BsaXI.tag.fasta

输出:pop.BsaXI.tag.rmdup.fasta

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 批量生成脚本
ls *.fasta |while read fa;do echo "cd-hit -i ${fa} -o ${fa/.fasta/}.rmdup.fasta -c 0.8 -T 4 -n 4 -M 2000";done > 02_rmdup.sh
# 运行脚本
nohup bash 02_rmdup.sh  > 02_rmdup.log &
# work dir
(genome) [liudong@carp 02_all_pop_BsaXI_enzyme_tag_fasta]$ pwd
/home/liudong/zengy/Enzyme_Extract/02_all_pop_BsaXI_enzyme_tag_fasta
(genome) [liudong@carp 02_all_pop_BsaXI_enzyme_tag_fasta]$ ls
02_rmdup.log  RCC.BsaXI.tag.fasta        WCC.BsaXI.tag.fasta        WR1.BsaXI.tag.fasta        WR2.BsaXI.tag.fasta
02_rmdup.sh   RCC.BsaXI.tag.fasta.add.A  WCC.BsaXI.tag.fasta.add.A  WR1.BsaXI.tag.fasta.add.A  WR2.BsaXI.tag.fasta.add.A

4、 对去除冗余序列后的4个群体的酶切标签tag序列进行群体间的相似度比较,去除群体间相似度>阈值(0.8开始)的序列

最终阈值:0.6

外部程序:blast2bed (shell脚本实际是,chmod 755即可使用)

输入:RCC.BsaXI.tag.rmdup.fasta、WCC.BsaXI.tag.rmdup.fasta、WR1.BsaXI.tag.rmdup.fasta、WR2.BsaXI.tag.rmdup.fasta

输出:all.pop.rmdup.fa、${pop}.specific.tag.fasta

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 合并所有群体提取出来的BsaXI酶切标签序列
cat *.BsaXI.tag.rmdup.fasta > all.pop.fa
# 使用cd-hit对所有序列进行去冗余,保留下来的就是所有群体之间互不相似的特异性酶切标签序列
nohup cd-hit -i all.pop.fa -o all.pop.rmdup.fa  -c 0.6  -T 4 -n 3  > all_cdhit.log &
# 将每个群体原来的序列标签单独提出来作为备用搜索文件,去冗余之后的序列也提取id作为搜索文件
for pop in WCC RCC WR1 WR2;do grep '>' ${pop}.BsaXI.tag.rmdup.fasta | sed 's/>//g' > ${pop}.id.txt;done
grep '>' all.pop.rmdup.fa | sed 's/>//g' > all.id.txt
# 搜索并分配去掉冗余序列后的序列其每条序列所对应群体的序列ID,并且提取对应群体序列作为最终结果
for pop in WCC RCC WR1 WR2;do cat all.id.txt |while read id;do grep $id ${pop}.id.txt;done > ${pop}.selected.id.txt;done
for pop in WCC RCC WR1 WR2;do seqkit grep -f ${pop}.selected.id.txt ${pop}.BsaXI.tag.rmdup.fasta > ${pop}.selected.BsaXI.tag.fasta;done
# 提取对应id的原始reads
ls *.selected.id.txt | while read file;do awk -F'-' '{print $1}' $file >  ${file/.id.txt/}.reads.id.txt;done
ls *.reads.id.txt | while read file;do seqkit grep -f $file ../01_all_assembled_fasta/${file/.selected.reads.id.txt/}.fasta > ${file/.selected.reads.id.txt/}.selected.reads.fasta;done
# 将tag序列回帖至群体的reads序列中
## 给每个群体的reads建库
for pop in WCC RCC WR1 WR2;do makeblastdb -dbtype nucl -in ${pop}.selected.reads.fasta -input_type fasta -parse_seqids -out ${pop}.blastdb;done
## 比对每个群体的特异酶切标签至其对应的群体reads中
for pop in WCC RCC WR1 WR2;do blastn  -task blastn-short -word_size 27 -evalue 1 -query ${pop}.selected.BsaXI.tag.fasta -db ${pop}.blastdb -out ${pop}.txt -outfmt 6 ;done   
## 过滤掉,保留每个查询序列的最高匹配
for pop in WCC RCC WR1 WR2;do awk 'BEGIN{FS=OFS="\t"} !arr[$1]++{print $0}' ${pop}.txt > t;mv t ${pop}.txt;done
## blast outfmt6格式转bed
for pop in WCC RCC WR1 WR2;do bash ./blast2bed ${pop}.txt;done
# 看一下所有bed文件重复的参考reads序列id的行,输出到share.bed(就是说一个ref被多个query比对上)
awk '{file_map[$1] = file_map[$1] FILENAME " "; count[$1]++; lines[$1] = lines[$1] $0 "\n"} END {for (val in count) if (count[val] > 1) print val " appears in: " file_map[val] "\n" lines[val]}' WCC.txt.bed  RCC.txt.bed  WR1.txt.bed  WR2.txt.bed  > share.bed

最终文件:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
WCC.txt.bed
RCC.txt.bed
WR1.txt.bed
WR2.txt.bed
WCC.selected.BsaXI.tag.fasta
RCC.selected.BsaXI.tag.fasta
WR1.selected.BsaXI.tag.fasta
WR2.selected.BsaXI.tag.fasta
WCC.selected.reads.fasta
RCC.selected.reads.fasta
WR1.selected.reads.fasta
WR2.selected.reads.fasta

用于后续设计引物和筛选合适序列位置。