Menu
  • HOME
  • TAGS

Python: How to print out sequences with length n from sliding window in FASTA file?

python,python-2.7,python-3.x,biopython,fasta

So I guess "seq_record.seq" is the whole DNA sequece like in human1 "ATCGCGTC". You can write like this: from Bio import SeqIO with open("test1_out.txt","w") as f: for seq_record in SeqIO.parse("test1.fasta", "fasta"): for i in range(len(seq_record.seq) - 4) : f.write(str(seq_record.id) + "\n") f.write(str(seq_record.seq[i:i+5]) + "\n") #first 5 base positions ...

Subset sequence data in fasta file based on IDs stored in listed data frames

r,subsetting,fasta,seq

This can be done with the following code: split(fastafile[GOI$ID], rep(1:3,each=2)) $`1` $`1`$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" $`1`$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" $`2` $`2`$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca" $`2`$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg" $`3` $`3`$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg" $`3`$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg" As to why your lapply code is not working. One reason is...

Remove a specific pattern in fasta sequences

python,fasta

You could use re.sub function. with open('myfile.fasta') as f: with open('outfile.fasta', 'w') as out: for line in f: if line.startswith('>'): out.write(line) else: out.write(re.sub(r'[\[\]]|/.', '', line)) /. matches / and also the character following forward slash. [\[\]] character class which matches [ or ] symbols. | called alternation operator or logical...

validate text box input (for fasta format) before submission

javascript,php,fasta

In order to validate it using JavaScript, you can use the following function: /* * Validates (true/false) a single fasta sequence string * param fasta the string containing a putative single fasta sequence * returns boolean true if string contains single fasta sequence, false * otherwise */ function validateFasta(fasta) {...

Python: How to extract DNA sequence based on a text file with binary content?

python,python-2.7,bioinformatics,biopython,fasta

for this is better to use biopython from Bio import SeqIO mask = ["1"==_.strip() for _ in open("mask.txt")] seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")] seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag] for seq in seqs_filter: print seq.format("fasta") you get: >human2 GCTTGCGCTAG >human3 TTCGCTAG explanation...

Trim first N bases in multi fasta file with awk and print with max width format

awk,gawk,fasta

To answer your specific questions, you can specify the width of an output field using the * format modifier: $ awk 'BEGIN{printf "%s\n", "foo"}' foo $ awk 'BEGIN{printf "%*s\n", 10, "foo"}' foo and no, there is no join function to put arrays back together into a string (the opposite of...

Memory limit in converting FASTA file string to list

python,string,list,python-2.7,fasta

I believe this line str_Reading_Frame1=open("Ychromosome.fa", "r").read() is the problem reading a huge string into memeory at once. And the recursion you are doing definitely doesn't help with performance. As well as the stack frames for each recursive call you are slicing a huge string N times which should be O(N^2)...

Want to add random string to identifier line in fasta file

perl,random,add,identifier,fasta

Your program isn't working because the regex ^(\S+)\s+(.*) matches every line in the input file. For instance, \S+ matches CTTCATCGCACATGGATAACTGTGTACCTGACT; the newline at the end of the line matches \s+; and nothing matches .*. Here's how I would encode your solution. It simply appends $current_id to the end of any...

How to generate matrix from fasta files

python,numpy,bioinformatics,biopython,fasta

I use Biopython for parser fasta files from Bio import SeqIO #change by path fasta files list fasta_files = [ "test.fasta", "test2.fasta" ] m_out = {} #store matrix divergence_out = {} #store divergence result for name_file in fasta_files: memory = set() m_out[name_file] = {} divergence_out[name_file] = {} for seq1 in...

How to randomly extract FASTA sequences using Python?

python,extract,bioinformatics,extraction,fasta

If you are working with fasta files use BioPython, to get n sequences use random.sample: from Bio import SeqIO from random import sample with open("foo.fasta") as f: seqs = SeqIO.parse(f,"fasta") print(sample(list(seqs), 2)) Output: [SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021',...

Editing Uniref FASTA header ID

sed,fasta

Try this with GNU sed to replace first _ by | and first whitespace by |: sed 's/_/|/;s/ /|/' file > new_file or this to edit file: sed -i 's/_/|/;s/ /|/' file ...

Python: How to find coordinates of short sequences in a FASTA file?

python,python-2.7,bioinformatics,biopython,fasta

Using BioPyton from Bio import SeqIO for long_sequence_record in SeqIO.parse(open('long_sequences.fasta'), 'fasta'): long_sequence = str(long_sequence_record.seq) for short_sequence_record in SeqIO.parse(open('short_sequences.fasta'), 'fasta'): short_sequence = str(short_sequence_record.seq) if short_sequence in long_sequence: start = long_sequence.index(short_sequence) + 1 stop = start + len(short_sequence) - 1 print short_sequence_record.id, start, stop ...

How can I remove first line from fasta file? [duplicate]

python,string,python-2.7,fasta

If you need the whole file's content, why not read all lines at once and immediately slice away the first line? with open('path','r') as f: content = f.readlines()[1:] output="".join(content) ...

Deduplicate FASTA, keep a seq id

command-line,formatting,bioinformatics,fasta

linearize and sort/uniq -c awk '/^>/ {if(N>0) printf("\n"); ++N; printf("%s ",$0);next;} {printf("%s",$0);} END { printf("\n");}' input.fa | \ sort -t ' ' -k2,2 | uniq -f 1 -c |\ awk '{printf("%s_%s\n%s\n",$2,$1,$3);}' >seqID_2_2 AGGGCACGCCTGCCTGGGCGTCACGC >seqID_1_1 CCCGGCCGTCGAGGC >seqID_3_3 CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA ...

Change the identifier line name to random shortened name in fasta file

perl,random,identifier,short,fasta

Your problem can simply be fixed by resetting $string to an empty string just inside the while loop. But this is needlessly complex (and also inefficient -- you generate and throw away random identifiers when you are not looking at a line starting with >); I would go with just...

(biostrings)writeXStringSet-Error message - 'x' must be an XStringSet object

r,fasta,writetofile

Ask questions about Bioconductor packages on the Bioconductor support site. You have a character vector, but want a DNAStringSet (X=DNA in this case, but could also be AA if this were an amino acid sequence). dna = DNAStringSet(seq) Likely you intend to have names on your sequence, c(foo="AAA", bar="ATCG") or...

How to extract short sequence using window with specific step size?

python,extract,extraction,biopython,fasta

You can use a for loop with range, using the third step parameter for range. This way, it's a bit cleaner than using a while loop. If the data can not be divided by the chunk size, then the last chunk will be smaller. data = "ACCCGATTT" step = 2...

Using Bio.SeqIO to write single-line FASTA

python,python-2.7,bioinformatics,biopython,fasta

BioPython's SeqIO module uses the FastaIO submodule to read and write in FASTA format. The FastaIO.FastaWriter class can output a different number of characters per line but this part of the interface is not exposed via SeqIO. You would need to use FastaIO directly. So instead of using: from Bio...

regular expression to find certain bases in a sequence

python,fasta

You need to use a character set. re.findall(r"[ATGCUN]", self.fastAsequence) Your code looks for a LITERAL "A,T,G,C,U,N", and outputs all occurrences of that. Character sets in regex allow for a search of the type: "Any of the following: A,T,G,C,U,N" rather than "The following: A,T,G,C,U,N "...

How to retrieve FASTA sequences according to coordinate information using Python?

python,bioinformatics,regular-language,fasta

In your code, positions is a defaultdict which has as keys the names from the BED file: >>> print positions.keys() ['chr10', 'chr6_apd_hap1'] And records is a dictionary which has as keys the headers of the FASTA file, minus the > at the beginning, but they still include the colon and...

Count GC content of fasta using python without error

python,string,fasta,dna-sequence

Here this should be just about all the code you need. from collections import Counter chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"] for i in chrome_list: file_ = open('{}.fa'.format(i), 'r') broken_file = file_.read().split('\n\n')...

Extracting the longest sequence from the tab delim file

r,sequence,fasta,longest-substring

Maybe there is a more elegant way... l <-list(ss_23_122_0_1 = "MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS", ss_23_167_0_1 = "WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW", ss_23_167_0_1 = "MAASDASDWEPWERIWERIWER", ss_23_167_0_1 = "QWEKCKLSDOIEOWIOWEUWWEUWEZURZEWURZUWEUZUQZUWZUE", ss_45_201_0_1 = "HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER", ss_45_201_0_1 = "ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE", ss_89_10_0_2 = "NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP") res <- split(l, names(l)) ind...

Convert/transform an abundance (OTU) table/data.frame (to a fasta file) in R

r,fasta

Try this, it goes through the dataframe line by line and concatenates repetitions of sequences : fasta_seq<-apply(df,1,function(x){ p<-x[1] paste(unlist(mapply(function(x,y,z){ if(as.numeric(y)>0) {paste(">",x,"_",(z+1):(z+y),"\n",p,"\n",sep="")} },colnames(df)[-1],as.numeric(x[-1]),c(0,lag(cumsum(as.numeric(x[-1])))[-1]),USE.NAMES=F)),collapse="") }) write(paste(fasta_seq,collapse=""),"your_file.txt") ...

Extract sequences from a FASTA file to multiple files, file based on header_IDs in a separate file

python,regex,biopython,fasta

A couple brief suggestions: If all your headers follow the same pattern, then you can extract the unique elements: record.description.split("_")[1] (yields "2040" from "CAP357_2040_011wpi_v1v3_1_008_00006_001.1") If you use a dict you can assemble collections of records: collected = {} for record in records: descr = record.description.split("_")[1] try: collected[descr].append(record) except KeyError: collected[descr]...

How to filter out sequences based on a given data using Python?

python,filter,filtering,bioinformatics,fasta

Check out look at BioPython. Here is a solution using that: from Bio import SeqIO input_file = 'a.fasta' merge_file = 'original.fasta' output_file = 'results.fasta' exclude = set() fasta_sequences = SeqIO.parse(open(input_file),'fasta') for fasta in fasta_sequences: exclude.add(fasta.id) fasta_sequences = SeqIO.parse(open(merge_file),'fasta') with open(output_file, 'w') as output_handle: for fasta in fasta_sequences: if fasta.id not...

matching and appending a string to headers

awk,sed,fasta

This should do: awk -F">|_" 'NF>2 {$0=$0" |"$2}1' file >uce-101_seqname |uce-101 GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA Set field separator to > or _ If line contains more than two field, recreate the line Print all lines. If you need to test for uce, then this should do: awk -F">|_" '$2~/^uce/ {$0=$0" |"$2}1' file ...

Bash: how to optimize/parallelize a search through two large files to replace strings?

bash,perl,sed,parallel-processing,fasta

#!/usr/bin/perl use strict; my $file1=shift; my %dic=(); open(F1,$file1) or die("cant find replcmente file\n"); while(<F1>){ # slurp File1 to dic if(/(.*)\s*(.*)/){$dic{$2}=$1} } while(<>){ # for all File2 lines s/(?<=>)(.*)/ $dic{$1} || $1/e; # sub ">id" by >dic{id} print } I prefer @cyrus solution, but if you need to do that often...

Splitting the data.frame into 2 columns

r,split,data.frame,fasta

Try lines <- readLines('deena.fasta') indx <- grepl('>', lines) Sequence <- tapply(seq_along(indx),cumsum(indx), FUN=function(x) paste(lines[tail(x,-1)], collapse="")) d1 <- data.frame(names=lines[indx], Sequence, stringsAsFactors=FALSE) head(d1,2) # names #1 >tm_sd_1256_2_1 #2 >tm_sd_5672_1_2 # Sequence # 1 MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFDASDJASDJ # 2 AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGSSADASDA[sample.fasta file][1] ...

How to change the coordinates format according to BED file format using Python?

python,bioinformatics,fasta

Well, you are adding a 1 to the index at which are finding the shorter sequence - start = long_sequence.index(short_sequence) + 1 <--- notice the +1 Don't do that and it should be fine. Also do not do -1 for the stop variable. You should instead add the starting sequence...

grep, Extracting A Subset Of Sequences from fasta file based on word in id line

perl,grep,extract,fasta

BioPerl is nice for doing such things. This little script will do the job : #!/usr/bin/perl -w use strict; use diagnostics; use warnings; use Bio::SeqIO; my $seqIOin = Bio::SeqIO->new(-format => 'fasta', -file => "<fasta_to_filter.fa"); my $seqIOout = Bio::SeqIO->new(-format => 'fasta', -file => ">selected_sequences.fa"); while (my $seq = $seqIOin->next_seq){ $seqIOout->write_seq($seq) if...

Error vcf-consensus script

bioinformatics,fasta,vcf,consensus,variants

your VCF only contain the chromosome '7' in column 1. but your fasta header is >gi|157696558|ref|NW_001838997.1| Homo sapiens chromosome 7 genomic scaffold, alternate assembly HuRef SCAF_1103279187418, whole genome shotgun sequence tabix would work if your fasta header was just: >7 ...

Reading at three different frames

python,bioinformatics,biopython,fasta,dna-sequence

First of all, you can not assign some values and name a variable like start+1, start+2 and so on. Next, as it is related to bioinformatics you can tag your question as bioinformatics. Also, you are repeating many of the stuffs three times that is too bad as a programmer....

how to rename fasta file headers using sed

sed,fasta

You can use only sed with its substitute command, checking if the line begins with > character, group the whole line and append your string at the end, like: sed 's/^\(>.*\)$/\1 Brassica rapa/' infile It yields: >Bra000001 Brassica rapa CTTATTTTCTCCTTCACCACCGTACCACAGAAAAAAACTGTGATTTTAAA AGCCACATTTACTTCTTTTTTTGTTGGGTCTAAATGTTAAAATAACATGT >Bra000002 Brassica rapa TTTATGTAGTACTGGACTAATCGGGTAGGGAAACAATCTTGATTTAGCAA TACAGTGTAATAACTAATAATCATATTCATATTCCATAAATCCAAATGTT ...

Python for loop skips over '>' symbol fasta format

python,for-loop,fasta

I think the reason that you're both getting the IndexError and the lines with > are not causing a break is that you're modifying data while iterating through it with the pop() calls. Another way to iterate through data is to just do it directly: # ... data = data.split()...

Input FASTA file required after local BLAST database is built?

fasta,blast

I removed the fasta file and the blast still ran fine. I was initially worried about deleting it as the fasta files are big. But I was able to find a small blast database to test it on. Thanks Llopis!