FASTA file manipulation: number of nucleotides per line

Aim

Given a file of nucleotide (or amino acid) sequences in FASTA format, how to change the number of nucleotides (or amino acids) shown per line?


Sample data

  • The sample file contains information of sequences in FASTA format:
    • The sequence ID line is marked by a starting > symbol.
  • Each sequence is “chopped” into multiple lines, each line showing 20 nucleotides.
    • Now we want to edit this file so that each line shows 40 nucleotides.
>mitochondrion partial, Hippocampus hippocampus
GCCATCGTAGCTTAAGTCTT
AAAGCATAACACTGAAGATG
TTATTATGAACCCTAGAAAA
TTCCGAAGGCACAAAGGCTT
GGTCCTAGCTTTACTATTAT
TTACTACCAAACTTACACAT
GCAAGCATCCGCACCCCCGT
GAGAATGCCCTTAACCCTCT
TATGAGATCAAGGAGCTGGT
ATCAGGCGCATATAATTGCC
CACAACACCTTGCTTAGCCA
CACCCCCAAGGGAATTCAGC
AGTGATAAACATTAAGCCAT
AAGTGTAAACTTGACTTAGT
TAAGGTTTTTAGAGCCGGTA
AAACTCGTGCCAGCCACCGC
GGTTATACGAGAGGCTCAAG
ATAATAGAAATCGGCGTAAA

Solution

  • We will use the function  seqkit seq [2] to change the length of nucleotides shown per line.
    • Set the desired length with the option  -w [3].
    • This is a global option (that can be used with any of the sub-functions of seqkit) to set how many nucleotides to show per line in the output file.

[Example 1]40 nucleotides to be shown per line

./seqkit seq -w 40 infile.fa > outfile.fa
  • The output file "outfile.fa" will show 40 nucleotides per line for each sequence.

[Example 2]The entire sequence in one single line

./seqkit seq -w 0 infile.fa > outfile.fa
  • Setting the option parameter to -w 0 will combine all “chopped” sequences into a unified long string that is shown on a single line.
  • Note that when the sequences are extremely long (such as chromosome-level assembly), this single-line file may become very heavy for plain text editor to process.
  • Similarly, for FASTQ[4] file (which seqkit can also handle), setting -w 0 will convert the multi-line FASTQ to a standard four-line FASTQ file.

References

  1. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Plos One 11, e0163962 (2016).
  2. https://bioinf.shenwei.me/seqkit/usage/#seq
  3. https://bioinf.shenwei.me/seqkit/usage/#seqkit
  4. https://en.wikipedia.org/wiki/FASTQ_format

Leave a Reply

Discover more from BIOLOGIST J

Subscribe now to keep reading and get access to the full archive.

Continue reading