FASTA format - Principum

> [!fact]- Fun Fact! > FASTA is the name of software developed in 1985 used for protein similarity searching. Although not many remember the software, the file format lives on ubiquitously in bioinformatics. > FASTA also is an acronym for fast-all as it was developed to work with both sets of alphabet, nucleotide and peptide with previously had their own formats and software, FASTN and FASTP respectively (being the software and file format used). FASTA is a ubiquitous file format which contains sequence data in either nucleotide or amino acid codes. It contains a header and sequence which look something like: ```markdown >Sequence_1 ATGCGATCGATGCATGCA ``` #### The header The header **must** always be unique within the file and begin with the carat symbol `>` , preferably with only hyphens `_` splitting the name. Other symbols can cause issues with some software commonly used in bioinformatics. The header can also include a significant amount of data about the origins of the data held in the file. This extra contextual information is known as the source modifiers. For example (taken from NCBI website): ```markdown ›Seq1 [organism=Streptomyces lavendulae] [strain=456A] Streptomyces lavendulae strain 456A mitomycin radical oxidase (mcrA) gene, complete cds. ›ABCD [organism=Plasmodium falciparum] [isolate=ABCD] Plasmodium falciparum isolate ABCD merozoite surface protein 2 (msp2) gene, partial cds. ›DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds. ``` There are a fixed number of source modifiers, which can be found here: [NCBI Source Modifiers](https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html) >[!info] Note > Not all software utilise the source headers, and some will even crash. So keep in mind it maybe a safe option to "clean" the headers and reduce them to just the sequence ID. #### The Sequence The sequence section of the file requires the use of [IUPAC](https://iupac.org/) symbols, no `?` or `-`, which will be cleaned out and therefore change your sequence. For ambiguous sequence it is better to use `N`. It is also recommended that there is a return (new line) symbol every 80 characters, this isn't mandatory but does help keeping everything standardised and looking clean. More information of IUPAC symbols can be found here: [IUPAC codes](https://www.bioinformatics.org/sms/iupac.html). ##### References IUPAC Codes - https://www.bioinformatics.org/sms/iupac.html - Accessed November 2023 IUPAC Site - https://iupac.org/ - Accessed November 2023 NCBI Fasta Format - https://www.ncbi.nlm.nih.gov/genbank/fastaformat/ - Accessed November 2023 NCBI Source modifiers - https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html - Accessed November 2023 Wikipedia - https://en.wikipedia.org/wiki/FASTA - Accessed November 2023