python,python-2.7,python-3.x,biopython,fasta
So I guess "seq_record.seq" is the whole DNA sequece like in human1 "ATCGCGTC". You can write like this: from Bio import SeqIO with open("test1_out.txt","w") as f: for seq_record in SeqIO.parse("test1.fasta", "fasta"): for i in range(len(seq_record.seq) - 4) : f.write(str(seq_record.id) + "\n") f.write(str(seq_record.seq[i:i+5]) + "\n") #first 5 base positions ...
This can be done with the following code: split(fastafile[GOI$ID], rep(1:3,each=2)) $`1` $`1`$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" $`1`$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" $`2` $`2`$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca" $`2`$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg" $`3` $`3`$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg" $`3`$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg" As to why your lapply code is not working. One reason is...
You could use re.sub function. with open('myfile.fasta') as f: with open('outfile.fasta', 'w') as out: for line in f: if line.startswith('>'): out.write(line) else: out.write(re.sub(r'[\[\]]|/.', '', line)) /. matches / and also the character following forward slash. [\[\]] character class which matches [ or ] symbols. | called alternation operator or logical...
In order to validate it using JavaScript, you can use the following function: /* * Validates (true/false) a single fasta sequence string * param fasta the string containing a putative single fasta sequence * returns boolean true if string contains single fasta sequence, false * otherwise */ function validateFasta(fasta) {...
python,python-2.7,bioinformatics,biopython,fasta
for this is better to use biopython from Bio import SeqIO mask = ["1"==_.strip() for _ in open("mask.txt")] seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")] seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag] for seq in seqs_filter: print seq.format("fasta") you get: >human2 GCTTGCGCTAG >human3 TTCGCTAG explanation...
To answer your specific questions, you can specify the width of an output field using the * format modifier: $ awk 'BEGIN{printf "%s\n", "foo"}' foo $ awk 'BEGIN{printf "%*s\n", 10, "foo"}' foo and no, there is no join function to put arrays back together into a string (the opposite of...
python,string,list,python-2.7,fasta
I believe this line str_Reading_Frame1=open("Ychromosome.fa", "r").read() is the problem reading a huge string into memeory at once. And the recursion you are doing definitely doesn't help with performance. As well as the stack frames for each recursive call you are slicing a huge string N times which should be O(N^2)...
perl,random,add,identifier,fasta
Your program isn't working because the regex ^(\S+)\s+(.*) matches every line in the input file. For instance, \S+ matches CTTCATCGCACATGGATAACTGTGTACCTGACT; the newline at the end of the line matches \s+; and nothing matches .*. Here's how I would encode your solution. It simply appends $current_id to the end of any...
python,numpy,bioinformatics,biopython,fasta
I use Biopython for parser fasta files from Bio import SeqIO #change by path fasta files list fasta_files = [ "test.fasta", "test2.fasta" ] m_out = {} #store matrix divergence_out = {} #store divergence result for name_file in fasta_files: memory = set() m_out[name_file] = {} divergence_out[name_file] = {} for seq1 in...
python,extract,bioinformatics,extraction,fasta
If you are working with fasta files use BioPython, to get n sequences use random.sample: from Bio import SeqIO from random import sample with open("foo.fasta") as f: seqs = SeqIO.parse(f,"fasta") print(sample(list(seqs), 2)) Output: [SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021',...
Try this with GNU sed to replace first _ by | and first whitespace by |: sed 's/_/|/;s/ /|/' file > new_file or this to edit file: sed -i 's/_/|/;s/ /|/' file ...
python,python-2.7,bioinformatics,biopython,fasta
Using BioPyton from Bio import SeqIO for long_sequence_record in SeqIO.parse(open('long_sequences.fasta'), 'fasta'): long_sequence = str(long_sequence_record.seq) for short_sequence_record in SeqIO.parse(open('short_sequences.fasta'), 'fasta'): short_sequence = str(short_sequence_record.seq) if short_sequence in long_sequence: start = long_sequence.index(short_sequence) + 1 stop = start + len(short_sequence) - 1 print short_sequence_record.id, start, stop ...
python,string,python-2.7,fasta
If you need the whole file's content, why not read all lines at once and immediately slice away the first line? with open('path','r') as f: content = f.readlines()[1:] output="".join(content) ...
command-line,formatting,bioinformatics,fasta
linearize and sort/uniq -c awk '/^>/ {if(N>0) printf("\n"); ++N; printf("%s ",$0);next;} {printf("%s",$0);} END { printf("\n");}' input.fa | \ sort -t ' ' -k2,2 | uniq -f 1 -c |\ awk '{printf("%s_%s\n%s\n",$2,$1,$3);}' >seqID_2_2 AGGGCACGCCTGCCTGGGCGTCACGC >seqID_1_1 CCCGGCCGTCGAGGC >seqID_3_3 CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA ...
perl,random,identifier,short,fasta
Your problem can simply be fixed by resetting $string to an empty string just inside the while loop. But this is needlessly complex (and also inefficient -- you generate and throw away random identifiers when you are not looking at a line starting with >); I would go with just...
Ask questions about Bioconductor packages on the Bioconductor support site. You have a character vector, but want a DNAStringSet (X=DNA in this case, but could also be AA if this were an amino acid sequence). dna = DNAStringSet(seq) Likely you intend to have names on your sequence, c(foo="AAA", bar="ATCG") or...
python,extract,extraction,biopython,fasta
You can use a for loop with range, using the third step parameter for range. This way, it's a bit cleaner than using a while loop. If the data can not be divided by the chunk size, then the last chunk will be smaller. data = "ACCCGATTT" step = 2...
python,python-2.7,bioinformatics,biopython,fasta
BioPython's SeqIO module uses the FastaIO submodule to read and write in FASTA format. The FastaIO.FastaWriter class can output a different number of characters per line but this part of the interface is not exposed via SeqIO. You would need to use FastaIO directly. So instead of using: from Bio...
You need to use a character set. re.findall(r"[ATGCUN]", self.fastAsequence) Your code looks for a LITERAL "A,T,G,C,U,N", and outputs all occurrences of that. Character sets in regex allow for a search of the type: "Any of the following: A,T,G,C,U,N" rather than "The following: A,T,G,C,U,N "...
python,bioinformatics,regular-language,fasta
In your code, positions is a defaultdict which has as keys the names from the BED file: >>> print positions.keys() ['chr10', 'chr6_apd_hap1'] And records is a dictionary which has as keys the headers of the FASTA file, minus the > at the beginning, but they still include the colon and...
python,string,fasta,dna-sequence
Here this should be just about all the code you need. from collections import Counter chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"] for i in chrome_list: file_ = open('{}.fa'.format(i), 'r') broken_file = file_.read().split('\n\n')...
r,sequence,fasta,longest-substring
Maybe there is a more elegant way... l <-list(ss_23_122_0_1 = "MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS", ss_23_167_0_1 = "WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW", ss_23_167_0_1 = "MAASDASDWEPWERIWERIWER", ss_23_167_0_1 = "QWEKCKLSDOIEOWIOWEUWWEUWEZURZEWURZUWEUZUQZUWZUE", ss_45_201_0_1 = "HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER", ss_45_201_0_1 = "ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE", ss_89_10_0_2 = "NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP") res <- split(l, names(l)) ind...
Try this, it goes through the dataframe line by line and concatenates repetitions of sequences : fasta_seq<-apply(df,1,function(x){ p<-x[1] paste(unlist(mapply(function(x,y,z){ if(as.numeric(y)>0) {paste(">",x,"_",(z+1):(z+y),"\n",p,"\n",sep="")} },colnames(df)[-1],as.numeric(x[-1]),c(0,lag(cumsum(as.numeric(x[-1])))[-1]),USE.NAMES=F)),collapse="") }) write(paste(fasta_seq,collapse=""),"your_file.txt") ...
A couple brief suggestions: If all your headers follow the same pattern, then you can extract the unique elements: record.description.split("_")[1] (yields "2040" from "CAP357_2040_011wpi_v1v3_1_008_00006_001.1") If you use a dict you can assemble collections of records: collected = {} for record in records: descr = record.description.split("_")[1] try: collected[descr].append(record) except KeyError: collected[descr]...
python,filter,filtering,bioinformatics,fasta
Check out look at BioPython. Here is a solution using that: from Bio import SeqIO input_file = 'a.fasta' merge_file = 'original.fasta' output_file = 'results.fasta' exclude = set() fasta_sequences = SeqIO.parse(open(input_file),'fasta') for fasta in fasta_sequences: exclude.add(fasta.id) fasta_sequences = SeqIO.parse(open(merge_file),'fasta') with open(output_file, 'w') as output_handle: for fasta in fasta_sequences: if fasta.id not...
This should do: awk -F">|_" 'NF>2 {$0=$0" |"$2}1' file >uce-101_seqname |uce-101 GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA Set field separator to > or _ If line contains more than two field, recreate the line Print all lines. If you need to test for uce, then this should do: awk -F">|_" '$2~/^uce/ {$0=$0" |"$2}1' file ...
bash,perl,sed,parallel-processing,fasta
#!/usr/bin/perl use strict; my $file1=shift; my %dic=(); open(F1,$file1) or die("cant find replcmente file\n"); while(<F1>){ # slurp File1 to dic if(/(.*)\s*(.*)/){$dic{$2}=$1} } while(<>){ # for all File2 lines s/(?<=>)(.*)/ $dic{$1} || $1/e; # sub ">id" by >dic{id} print } I prefer @cyrus solution, but if you need to do that often...
Try lines <- readLines('deena.fasta') indx <- grepl('>', lines) Sequence <- tapply(seq_along(indx),cumsum(indx), FUN=function(x) paste(lines[tail(x,-1)], collapse="")) d1 <- data.frame(names=lines[indx], Sequence, stringsAsFactors=FALSE) head(d1,2) # names #1 >tm_sd_1256_2_1 #2 >tm_sd_5672_1_2 # Sequence # 1 MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFDASDJASDJ # 2 AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGSSADASDA[sample.fasta file][1] ...
Well, you are adding a 1 to the index at which are finding the shorter sequence - start = long_sequence.index(short_sequence) + 1 <--- notice the +1 Don't do that and it should be fine. Also do not do -1 for the stop variable. You should instead add the starting sequence...
BioPerl is nice for doing such things. This little script will do the job : #!/usr/bin/perl -w use strict; use diagnostics; use warnings; use Bio::SeqIO; my $seqIOin = Bio::SeqIO->new(-format => 'fasta', -file => "<fasta_to_filter.fa"); my $seqIOout = Bio::SeqIO->new(-format => 'fasta', -file => ">selected_sequences.fa"); while (my $seq = $seqIOin->next_seq){ $seqIOout->write_seq($seq) if...
bioinformatics,fasta,vcf,consensus,variants
your VCF only contain the chromosome '7' in column 1. but your fasta header is >gi|157696558|ref|NW_001838997.1| Homo sapiens chromosome 7 genomic scaffold, alternate assembly HuRef SCAF_1103279187418, whole genome shotgun sequence tabix would work if your fasta header was just: >7 ...
python,bioinformatics,biopython,fasta,dna-sequence
First of all, you can not assign some values and name a variable like start+1, start+2 and so on. Next, as it is related to bioinformatics you can tag your question as bioinformatics. Also, you are repeating many of the stuffs three times that is too bad as a programmer....
You can use only sed with its substitute command, checking if the line begins with > character, group the whole line and append your string at the end, like: sed 's/^\(>.*\)$/\1 Brassica rapa/' infile It yields: >Bra000001 Brassica rapa CTTATTTTCTCCTTCACCACCGTACCACAGAAAAAAACTGTGATTTTAAA AGCCACATTTACTTCTTTTTTTGTTGGGTCTAAATGTTAAAATAACATGT >Bra000002 Brassica rapa TTTATGTAGTACTGGACTAATCGGGTAGGGAAACAATCTTGATTTAGCAA TACAGTGTAATAACTAATAATCATATTCATATTCCATAAATCCAAATGTT ...
I think the reason that you're both getting the IndexError and the lines with > are not causing a break is that you're modifying data while iterating through it with the pop() calls. Another way to iterate through data is to just do it directly: # ... data = data.split()...
I removed the fasta file and the blast still ran fine. I was initially worried about deleting it as the fasta files are big. But I was able to find a small blast database to test it on. Thanks Llopis!