Remove a specified scaffold from a FASTA file

Aim

From a genome sequence file (in fasta format), remove the scaffold named “MT” (=the mitochrondrial genome) and its sequence.

Obstacle

There are many possible solutions to this task: this easiest way is definitely to open the file using a plain text editor and delete the relevant lines manually.
Although it’s also intuitive to use command-line tools, such as sed, grep, or awk, to edit files, the sequence may be split into multiple lines (You can actually store the entire sequences into one single line, but the file will be so difficult and slow to view or edit using the macOS default plain text editor) and simple command-line codes may not be able to correctly capture this.
In some cases, the genome sequence file may be extremely large (such as the example we encounter in this post), which makes it difficult to edit through plain text editor. Thus we need bioinformatics tools that can specifically deal with fasta-format files to solve this problem.

Data preparation

The genome sequence file of zebrafish (Danio rerio) from Ensembl release 104 is extremely large in size (53GB). Such large size made it almost impossible to edit directly using plain text editor.
- Zebrafish genome on Ensembl: https://asia.ensembl.org/Danio_rerio/Info/Index
- Site for download: http://ftp.ensembl.org/pub/release-104/fasta/danio_rerio/dna/
- We use the file: 　Danio_rerio.GRCz11.dna.toplevel.fa.gz　 which contains the sequence of all chromosomes and scaffolds, and unzip the file by .

Let’s start by trying to edit an example or truncated file with less information so that we can clearly see whether our code works as expected.
- The example file should contain scaffolds with IDs containing and not containing “MT” (the target to be deleted).
- The sequence of each scaffold should span multiple lines.

Sample data

>MT
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

>MT-ND1 Mus musculus, mitochondrion
GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTA
ACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTT
GGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCA
ATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACA
CTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTA
GGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA
TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATT
TCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTAC
TCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCC
ATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTG
ACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTC
GCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT
ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATA
ATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTC
CGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTA
TGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATAT

>Hsapiens_MT-COX1
ATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTA
TTATTCGGCGCATGAGCTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAG
CTGGGCCAGCCAGGCAACCTTCTAGGTAACGACCACATCTACAACGTTATCGTCACAGCC
CATGCATTTGTAATAATCTTCTTCATAGTAATACCCATCATAATCGGAGGCTTTGGCAAC
TGACTAGTTCCCCTAATAATCGGTGCCCCCGATATGGCGTTTCCCCGCATAAACAACATA
AGCTTCTGACTCTTACCTCCCTCTCTCCTACTCCTGCTCGCATCTGCTATAGTGGAGGCC
GGAGCAGGAACAGGTTGAACAGTCTACCCTCCCTTAGCAGGGAACTACTCCCACCCTGGA
GCCTCCGTAGACCTAACCATCTTCTCCTTACACCTAGCAGGTGTCTCCTCTATCTTAGGG
GCCATCAATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA
ACGCCCCTCTTCGTCTGATCCGTCCTAATCACAGCAGTCCTACTTCTCCTATCTCTCCCA
GTCCTAGCTGCTGGCATCACTATACTACTAACAGACCGCAACCTCAACACCACCTTCTTC
GACCCCGCCGGAGGAGGAGACCCCATTCTATACCAACACCTATTCTGATTTTTCGGTCAC
CCTGAAGTTTATATTCTTATCCTACCAGGCTTCGGAATAATCTCCCATATTGTAACTTAC
TACTCCGGAAAAAAAGAACCATTTGGATACATAGGTATGGTCTGAGCTATGATATCAATT
GGCTTCCTAGGGTTTATCGTGTGAGCACACCATATATTTACAGTAGGAATAGACGTAGAC
ACACGAGCATATTTCACCTCCGCTACCATAATCATCGCTATCCCCACCGGCGTCAAAGTA
TTTAGCTGACTCGCCACACTCCACGGAAGCAATATGAAATGATCTGCTGCAGTGCTCTGA
GCCCTAGGATTCATCTTTCTTTTCACCGTAGGTGGCCTGACTGGCATTGTATTAGCAAAC
TCATCACTAGACATCGTACTACACGACACGTACTACGTTGTAGCCCACTTCCACTATGTC
CTATCAATAGGAGCTGTATTTGCCATCATAGGAGGCTTCATTCACTGATTTCCCCTATTC
TCAGGCTACACCCTAGACCAAACCTACGCCAAAATCCATTTCACTATCATATTCATCGGC
GTAAATCTAACTTTCTTCCCACAACACTTTCTCGGCCTATCCGGAATGCCCCGACGTTAC
TCGGACTACCCCGATGCATACACCACATGAAACATCCTATCATCTGTAGGCTCATTCATT
TCTCTAACAGCAGTAATATTAATAATTTTCATGATTTGAGAAGCCTTCGCTTCGAAGCGA
AAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCA
CCCTACCACACATTCGAAGAACCCGTATACATAAAATCTAGA

>HoxA1 partial
TGAAACATTTTCAGCACCAGTCACCCACTGTTCCCAACTGCTTGTCAACAATGGGCCAGA
ACTGTGGAGCTGGCCTAAACAATGACAGTCCTGAGGCCCTTGAGGTCCCCTCTTTGCAGG
ACTTTAGCGTTTTCTCCACAGATTCCTGCCTGCAGCTTTCAGATGCAGTTTCACCCAGTT
TGCCAGGTTCCCTCGACAGTCCCGTAGATATTTCAGCTGACAGCTTAGACTTTTTTACAG
ACACACTCACCACAATCGACTTGCAGCATCTGAATTACTAAAAACATTAAAGCAAAACAA
AGCATCACCAAACAAAAACTCCTTTGACCAGGTGGTTTTGCCTTCTTTTATTTGGGAGTT
TATTTTTTATTTTCTTCTTGACCTACCCCTTCCCTCCTTTAAGTGTTGAGGATTTTCTGT
TTAGTGATTCCCTGACCCAGTTTCAAACAGAGCCATCTTTTACAGATTATTTTGGAGTTT
TAGTTGTTTT

Let’s name the file as:　sample.fa
The file contains four scaffolds with IDs:
1. MT
2. MT-ND1
3. Hsapiens_MT-COX1
4. HoxA1
Note: Only the string in front of the first space is recognized as the ID.

Solution

Use the tool 　seqkit　^[1] to accomplish our task.
It is a standalone tool which does not require pre-configuration and has no dependencies.^[1]
- Download link: https://bioinf.shenwei.me/seqkit/download/
- Unzip by: tar -zxvf seqkit*.tar.gz
- Try whether the downloaded executable works by typing ./seqkit version. This should return the version of the downloaded tool. (mine: 2.1.0)

The function 　seqkit grep　 can be used to search sequences by ID or sequence motifs.
To exclude sequences that match a certain regular expression, we can set the options
　-rvip　^[2]
- -r : specify that the following pattern is regular expression
- -v : keep only the sequences that do NOT match the regular expression
- -i : ignore upper/lower case
- -p "pattern" : followed by the specified pattern (=regular expression)
- Other options are documented in detail here: https://bioinf.shenwei.me/seqkit/usage/#grep

［Example 1］Remove the scaffold named exactly “MT”

./seqkit grep -rvip "^MT$" sample.fa  > outfile.1.fa

The remained sequences:
- ② MT-ND1
- ③ Hsapiens_MT-COX1
- ④ HoxA1
The option -i is unnecessary if upper/lower case has to be matched strictly.

［Example 2］Remove scaffolds with IDs starting with “MT”

./seqkit grep -rvip "^MT" sample.fa  > outfile.2.fa

The remained sequences:
- ③ Hsapiens_MT-COX1
- ④ HoxA1

［Example 3］Remove scaffolds with IDs ending with the number “1”

Three of the sequences actually have IDs ending with the number “1”. How about removing all these three sequences?

./seqkit grep -rvip "1$" sample.fa  > outfile.3.fa

The remained sequences:
- ① MT

References

Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Plos One 11, e0163962 (2016).
https://www.biostars.org/p/446424/

BIOLOGIST J

Remove a specified scaffold from a FASTA file

Aim

Obstacle

Data preparation

Sample data

Solution

References

Like this:

One response

Leave a Reply to FASTA file manipulation: number of nucleotides per line – BIOLOGIST JCancel reply

Remove a specified scaffold from a FASTA file

Aim

Obstacle

Data preparation

Sample data

Solution

References

Share this:

Like this:

One response

Leave a Reply to FASTA file manipulation: number of nucleotides per line – BIOLOGIST JCancel reply

Discover more from BIOLOGIST J