Aim
Given a file of nucleotide (or amino acid) sequences in FASTA format, how to change the number of nucleotides (or amino acids) shown per line?
Sample data
- The sample file contains information of sequences in FASTA format:
- The sequence ID line is marked by a starting
>symbol.
- The sequence ID line is marked by a starting
- Each sequence is “chopped” into multiple lines, each line showing 20 nucleotides.
- Now we want to edit this file so that each line shows 40 nucleotides.
>mitochondrion partial, Hippocampus hippocampus
GCCATCGTAGCTTAAGTCTT
AAAGCATAACACTGAAGATG
TTATTATGAACCCTAGAAAA
TTCCGAAGGCACAAAGGCTT
GGTCCTAGCTTTACTATTAT
TTACTACCAAACTTACACAT
GCAAGCATCCGCACCCCCGT
GAGAATGCCCTTAACCCTCT
TATGAGATCAAGGAGCTGGT
ATCAGGCGCATATAATTGCC
CACAACACCTTGCTTAGCCA
CACCCCCAAGGGAATTCAGC
AGTGATAAACATTAAGCCAT
AAGTGTAAACTTGACTTAGT
TAAGGTTTTTAGAGCCGGTA
AAACTCGTGCCAGCCACCGC
GGTTATACGAGAGGCTCAAG
ATAATAGAAATCGGCGTAAA
Solution
- Use the FASTA/Q file manipulation tool
seqkit[1] to accomplish our task.- Installation: https://bioinf.shenwei.me/seqkit/download/
- ref. another previous post: Remove a specified scaffold from a FASTA file
- We will use the function
seqkit seq[2] to change the length of nucleotides shown per line.- Set the desired length with the option
[3].-w - This is a global option (that can be used with any of the sub-functions of
seqkit) to set how many nucleotides to show per line in the output file.
- Set the desired length with the option
[Example 1]40 nucleotides to be shown per line
./seqkit seq -w 40 infile.fa > outfile.fa
- The output file
"outfile.fa"will show 40 nucleotides per line for each sequence.
[Example 2]The entire sequence in one single line
./seqkit seq -w 0 infile.fa > outfile.fa
- Setting the option parameter to
-w 0will combine all “chopped” sequences into a unified long string that is shown on a single line. - Note that when the sequences are extremely long (such as chromosome-level assembly), this single-line file may become very heavy for plain text editor to process.
- Similarly, for FASTQ[4] file (which
seqkitcan also handle), setting-w 0will convert the multi-line FASTQ to a standard four-line FASTQ file.
References
- Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Plos One 11, e0163962 (2016).
- https://bioinf.shenwei.me/seqkit/usage/#seq
- https://bioinf.shenwei.me/seqkit/usage/#seqkit
- https://en.wikipedia.org/wiki/FASTQ_format

Leave a Reply