Skip to content Skip to sidebar Skip to footer

How To Split Fasta File

This code is for extract and split sequences from fasta file outfile=open('outf','w') for line in open('input'): if line[0]=='>': outfile.write('\n') else:

Solution 1:

There are some issues in your code.

First, you only want certain lines in your file, throwing out others, then outputting the desired lines to a file. I'm not sure why the last step is needed. Direct processing of the lines is more efficient.

def processLines(inputname):
    all_codons=[]
    for line in open(inputname):
        if line[0]==">":
            seq=line.strip()
            codons = [seq[i:i+3]for i in xrange(0, len(seq), 3) if
                      len(seq[i:i+3])==3]
        all_codons.append(codons)
    return all_codons

Also, every call to identical_segment will generates a dictionary that you use as mapping from str to scores. It may become expensive when number of calls scales. To avoid this, you can try two ways:

code={"a":0,"c":1,"g":2,"t":3} 
defidentical_segment(input_string):
   .... # what you have written

or create a class whose instance contains the dictionary.

In order to process multiple files, do:

output = [processLines(filename) for filename in filenames]
# filenames is an iterable

or if you want to map the input name to output:

outputDict = {filename: processLines(filename) for 
              filename in filenames}

After all, call your analyzing function on each output and write them to an output file.

To summarize what you should pick up in this post:

  1. Output files may not be the best option, since file IO is expensive. If you write it to some file, it means you have to read it in again, which is doubly expensive.

  2. The same object should not be created over and over again. Proofread your code to make sure this does not happen.

  3. Partition your main task into several small tasks, then think of a simple and intuitive way for each task to start with. In this example, we have processfiles-> analysis-> output_result

  4. Comprehension is a useful way to iterate things in Python, and it's more readable. You can search for list comprehension and dictionary comprehension to learn more.

Try something out yourself. I'll be more than happy to read your improved code here.

Solution 2:

Try using BioPython to extract the nucleotide sequences from your fasta file. Using this package,

from Bio import AlignIO

for record in AlignIO.parse('filename.fasta', 'fasta'):
    print record.id, record.seq

# or store in a new file
seqs = []
for record in AlignIO.parse('filename.fasta', 'fasta'):
    seqs.append(record.seq + '\n')

withopen(outfile, 'w') as out:
    out.writelines(seqs)

Post a Comment for "How To Split Fasta File"