The rle function (run length encoding) can be combined with the diff function to identify the locations where the differences are 1 and have length greater than or equal to 2. The diff-operation actually shifts the names by 1 so there's no need to add 1 to the names()-result. That...

You were close, try this mean(iris$Sepal.Length[which(iris$Species != 'setosa')]) or mean(iris$Sepal.Length[iris$Species != 'setosa']) or mean(iris[iris$Species!= "setosa", "Sepal.Length"]) ...

You can use grep to find those rows whose names start with "BN". Using x for the object instead of df (df is a function in R): x[grep("^BN", row.names(x)),] ## c1 c2 c3 ## BN4 9 1 2 ## BN8 4 6 8 ...

lapply(your_list, function(x) split(x, ceiling(seq_along(x)/10))) ...

You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names. d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3)) names(d) <- c("FullColName1", "FullColName2", "FullColName3", "PartString1()","PartString2()") d[grepl("[[:punct:]]", names(d))]...

You can use the dcast function from reshape2: library(reshape2) df2 <- dcast(df1, Site_code ~ Species_code, fill = 0) df2 # Site_code AMP MRN TFP XNP # 1 0 50 100 0 # 2 15 5 0 20 Short and simple. You could also use reshape from the stats package, which...

This will do it: colnames(x)[c(1:4,6:8)] #[1] "col1" "col2" "col3" "col4" "col6" "col7" "col8" ...

Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT. Set the "year" column as "key" (setkey(..)). Subset the rows that have "2004/2005" in the "year" columns (J(c(2004,..)), select the first two columns 1:2. library(data.table) # data.table_1.9.5 DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE] DT1 # idnr...

r,subset,dataframes,lm,subsetting

With your fit list, you can extract the coefficients and r-squared values with fit<-apply(aa,2,function(x) lm(x~aa$Tiempo)) mysummary <- t(sapply(fit, function(x) { ss<-summary(x); c(coef(x), r.square=ss$r.squared, adj.r.squared=ss$adj.r.squared) })) We use sapply to go over the list you created and extract the coefficients from the model and the r-squared values from the summary. The...

You could try res <- aggregate(cbind(P.value=Length) ~ Fish + DATE, Ehen, FUN = function(x) shapiro.test(x)$p.value) head(res,3) # Fish DATE P.value #1 C 2012-09-19 0.25510132 ##### #2 S 2012-09-19 0.11941675 #3 C 2012-09-20 0.04459457 shapiro.test(Ehen$Length[Ehen$DATE=='2012-09-19' & Ehen$Fish=='C']) # Shapiro-Wilk normality test #data: Ehen$Length[Ehen$DATE == "2012-09-19" & Ehen$Fish == "C"] # W...

r,loops,apply,subsetting,multiple-conditions

You can try to use a self-defined function in aggregate sum1sttwo<-function (x){ return(x[1]+x[2]) } aggregate(count~id+group, data=df,sum1sttwo) and the output is: id group count 1 2 A 14 2 8 A 11 3 10 B 12 4 11 B 11 5 16 C 8 6 18 C 7 04/2015 edit: dplyr...

r,ggplot2,stata,graphing,subsetting

very nicely done question. If you are still interested in a base solution: transactionID <- c(1, 2, 3, 4) date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05")) cost <- as.integer(c(1208, 23820, 402, 89943)) company <- c("ACo", "BInc", "CInd", "DOp") thedata <- data.frame(transactionID, date, cost, company) par(mar = c(5,7,3,2), tcl = .2, las...

One way is m1 <- apply(ar1, 2, `[`) m1[m1[,2]%in% 1:2 & m1[,3]==3 & m1[,4]==4,] # [,1] [,2] [,3] [,4] #[1,] 1 1 3 4 #[2,] 1 1 3 4 #[3,] 1 2 3 4 #[4,] 1 1 3 4 #[5,] 1 1 3 4 #[6,] 1 2 3 4 Or...

I think you intended to have this: d[,2:length(d)] ...

As mentioned in the comments, the data.frame requested is rather big to fit in memory of a reasonable desktop machine, and perhaps R is not the tool for this job. In any case, for a data.frame 1000 times smaller than requested, here is one way to do it. First simulate...

This kind of magic is called matrix indexing. If you had rows and cols be the ones you didn't want, or if matrix indexing allowed for negative values, it would be even easier. y <- matrix(0, nrow=5, ncol=2) y[cbind(rows,cols)] <- x[cbind(rows,cols)] y ## [,1] [,2] ## [1,] 65 345 ##...

Macro approach: I just provided a rough code. Optimize it according to your requirements. proc sql; select count(*) into :total from source_data; quit; %macro create_subsets(count,ds); %let cnt=%sysfunc(ceil(%sysevalf(&count/8))); %let num=1; %do i=1 %to &cnt; %if(&i = &cnt) %then %do; %let toread=&count; %end; %else %do; %let toread=&num+7; %end; data &ds._&i; set &ds(firstobs=&num...

Here, split is your friend. splitbycol <- function(df, colname) { split(df, df[[colname]]) } splitbycol(df, "Flag") ## $`0` ## Id Flag value1 value2 ## 3 125 0 19 8.4 ## 5 127 0 17 6.5 ## ## $`1` ## Id Flag value1 value2 ## 1 123 1 10 3.4 ## 2...

r,data.table,subset,subsetting

set.seed(400) library(data.table) DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT 1) DT[J("E"), .I] is a data table where .I as a variable representing the row number of E in the original dataset DT # x .I # 1: E 18 # 2: E 19 # 3: E...

This is what worked. prism.dd <- prism.d[(prism.d$fips %in% fips) ,] ...

r,date,for-loop,vector,subsetting

Using ave : transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum)) Date Time Axis1 Day Sum.A1.Daily 1 6/12/10 5:00:00 20 1 110 2 6/12/10 5:01:00 40 1 110 3 6/12/10 5:02:00 50 1 110 4 6/13/10 5:03:00 10 2 60 5 6/13/10 5:04:00 20 2 60 6 6/13/10 5:05:00 30 2 60 ...

When working with data.table you need to use following to chose column by numbers: df2[, 2, with=FALSE] df2[, .SD, .SDcols=2] This will still return data.table, not a vector. As always on list you can also use below to return a vector: df2[[2]] ...

You can use relist: relist(seq_along(unlist(rec_list)), skeleton = rec_list) # [[1]] # [[1]][[1]] # [1] 1 2 3 4 5 # # [[1]][[2]] # [1] 6 # # # [[2]] # [[2]][[1]] # [1] 7 8 9 10 # # [[2]][[2]] # [1] 11 12 13 14 15 16 ...

matlab,image-processing,matrix,mean,subsetting

Method 1 Using mat2cell and cellfun AC = mat2cell(A, repmat(2,size(A,1)/2,1), repmat(2,size(A,2)/2,1)); out = cellfun(@(x) mean(x(:)), AC); Method 2 using im2col out = reshape(mean(im2col(A,[2 2],'distinct')),size(A)./2); Method 3 Using simple for loop out(size(A,1)/2,size(A,2)/2) = 0; k = 1; for i = 1:2:size(A,1) l = 1; for j = 1:2:size(A,2) out(k,l) = mean(mean(A(i:i+1,j:j+1)));...

r,variables,paste,ls,subsetting

I really don't understand why you're trying to use paste() here. This function is typically only for concatenating character values. Method 2 returns an error because paste(data[c(1, 4:7)], sep="") returns c("1", "4", "5", "6", "7") which is a character vector and when you index with a character vector R looks...

r,count,subsetting,memory-efficient

assuming your data.frame is called df, using data.table: library(data.table) setDT(df)[ , .(Freq = .N), by = .(id, value)] using dplyr: libary(dplyr) group_by(df, id, value) %>% summarise(Freq = n()) You should choose one of those two packages (dplyr or data.table) and learn it really thoroughly. In the long run you will...

How about just using which.max since that would select just one value (i.e. the index of the first occurrence of the maximum value): as.data.frame(lapply(df, function(x) x[-which.max(x)])) # t1 t2 t3 # 1 1 7 1 # 2 2 3 1 # 3 3 1 1 ...

you can split the data.frame using split(), and as long as your IDs are of the format letters-then-numbers you can strip off the trailing numbers using gsub as in: stringsPart <-gsub('[0-9]*$','',myData$Sample.ID) listOfSubDataFrames <- split(myData,stringsPart) By the way, the regular expression matches zero or more (*) numbers ([0-9]) that appear at...

Try indx <- grep('line', names(DG)) DG[indx[as.numeric(sub('.*_', '', names(DG)[indx])) %in% dup$Number]] # line_59 #1 0 ...

r,logic,aggregate,subset,subsetting

The problem is that == is alternating between the values of "honda" and "harley" and comparing with the value in the relevant position of your "manufacturer" variable. On the other hand, %in% (as suggested by MrFlick) and | are checking across the entire "manufacturer" variable before deciding which values to...

You can use idxmax(), which returns the indices of the max values: maxes = df2.groupby('B')['C'].idxmax() df2.loc[maxes] Output: Out[11]: A B C D 0 test1 one 311 9 3 test4 two 41 6 ...

javascript,function,arguments,subsetting

JavaScript passes arrays and objects to functions by passing a copy of the reference to the array/object, so it is likely to be fine to pass the whole thing. Just know that if you mutate it in the function, those changes will affect your original array/object! If this is browser...

here is my solution. I use grep function. data <- Country[-grep("LUX|MLT", Country)] ...

From your error output in your previous question here, you have " Friday", " Monday" as inputs. The leading spaces are being stripped as people here try and reproduce, you need to use dput(Unt) rather than paste so things like this don't happen. I guess your saturday and sunday columns...

df$reg_country is a factor variable, which contains the information of all possible levels in the levels attribute. Check levels(df_subset$reg_country). Factor levels only have a significant impact on data size if you have a huge number of them. I wouldn't expect that to be the case. However, you could use droplevels(df_subset$reg_country)...

Try this: Subs1<-subset(DATA, (!is.na(DATA[,2])) & (!is.na(DATA[,3]))) The second parameter of subset is a logical vector with same length of nrow(DATA), indicating whether to keep the corresponding row....

We can accomplish this using data.table as follows: setkeyv(setDT(D1), "k") cols = c("D", "E", "A") D1[D2, (cols) := D2[, cols]] setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table. setkey() sorts the data.table by the column specified (here k)...

python,pandas,dataframes,subsetting

You could use X['var2'].iloc[[0,1]]: In [280]: X['var2'].iloc[[0,1]] Out[280]: 0 NaN 4 9 Name: var2, dtype: float64 Since X['var2'] is a view of X, X['var2'].iloc[[0,1]] is safe for both access and assignments. But be careful if you use this "chained indexing" pattern (such as the index-by-column-then-index-by-iloc pattern used here) for assignments,...

r,dataframes,subset,subsetting

How about NameList <- c("Fire","Earth") idx <- match(NameList, names(MAINDF)) idx <- sort(c(idx-1, idx)) NewDF <- MAINDF[,idx] Here we use match() to find the index of the desired column, and then we can use index subtraction to grab the column before it...

The easiest answer is m3<-m2 m3[lower.tri(m3)] <- NA Or you can do m3<-`[<-`(m2, lower.tri(m2), NA) which looks very awkward but it works. You're basically intercepting the value that from []<-what would normally replace the variable in the assignment and sending it to a new variable.\ And your strategy of m3...

You don't want ifelse. You want if and else. ifelse is used if you have a condition vector. You only have a single condition value. myfunction = function(MyCondition="low"){ # special criteria lowRandomNumbers=c(58,61,64,69,73) highRandomNumbers=c(78,82,83,87,90) # subset the data based on MyCondition mydata <- if(MyCondition=="low") mydata[mydata$mydata %in% lowRandomNumbers==TRUE,] else mydata mydata <-...

Here's how I would do it Use lubridate to create a month variable to group by in dplyr and then get means. library(lubridate) library(dplyr) df %>% group_by(month = month(df$V1)) %>% summarize(mean = mean(velocity)) month mean 1 4 95.36530 2 5 95.53038 3 6 95.35810 4 7 95.24634 5 8 95.81465...

You can simply use the apply function like follows: ans <- compositions(9, m = 2) ans[,apply(ans,2,function(x) all(x<=6))] with output: [,1] [,2] [,3] [,4] [1,] 6 5 4 3 [2,] 3 4 5 6 Alternatively you can use dplyr which would look cleaner and might be more intuitive: library(dplyr) expand.grid(d1=1:6, d2=1:6)...

Maybe this: ind <- Reduce(`|`,lapply(data,function(x) x %in% c(-99,-999))) > data[!ind,] a b 1 100 23 3 322 25 4 155 25 ...

If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast: library(dplyr) # For each indivID, calculate percent of days in each month with...

You can use split for this. > dat ## Group Element Value Note ## 1 1 AAA 11 Good ## 2 1 ABA 12 Good ## 3 1 AVA 13 Good ## 4 2 CBA 14 Good ## 5 2 FDA 14 Good ## 6 3 JHA 16 Good ##...

python,numpy,multidimensional-array,subsetting

You've gotten a handful of nice examples of how to do what you want. However, it's also useful to understand the what's happening and why things work the way they do. There are a few simple rules that will help you in the future. There's a big difference between "fancy"...

Just pasting from the comments so that the question remains answered. You can use split (suggested by @docendo discimus) and difftime(from @Laurik) to get the expected dataset. Assuming that "time1" is the "time" column in your dataset ("dat"), convert "time1" to "POSIXlt" class using strptime, use difftime to get the...

What you're describing is more a problem of finding all subsequences. You can't really do that by two nested for loops. It's easier to solve recursively as follows: // Find all subsets public List<String> helper(String input) { if (input.isEmpty()) return Collections.singletonList(""); List<String> result = new ArrayList<>(); List<String> subResult = helper(input.substring(1));...

This can be done with the following code: split(fastafile[GOI$ID], rep(1:3,each=2)) $`1` $`1`$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" $`1`$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" $`2` $`2`$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca" $`2`$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg" $`3` $`3`$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg" $`3`$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg" As to why your lapply code is not working. One reason is...

Try library(dplyr) df1 %>% group_by(ID) %>% filter(Point_A < median(Point_A) - 7.5, Point_B < median(Point_B) - 7.5) Or, as per @Frank suggestion in the comments: mycond <- function(x) x < median(x) - 7.5 df1 %>% group_by(ID) %>% filter(mycond(Point_A), mycond(Point_B)) Which gives: #Source: local data frame [2 x 3] #Groups: ID #...

You can split the 'df2' and use lapply Filter(Negate(is.null), lapply(split(df2, df2$type), function(x) { x1 <- subset(df, number==x$number) if(nrow(x1)>0) { transform(x1, grade=x$type[1]) } })) ...

r,list,vectorization,subsetting

You mean something like this? x <- lapply(mon, `[[`, 'x') str(x) # List of 1428 # $ N1402: Time-Series [1:50] from 1990 to 1994: 2640 2640 2160 4200 3360 2400 3600 1920 4200 4560 ... # $ N1403: Time-Series [1:50] from 1990 to 1994: 1680 1920 120 1080 840 1440...

Assuming your Tree looks kind of like this Tree<-list( "Error", "<p>Hello</p>", "<h1>Heading</h1>", "Error", "<strong>Bold</strong" ) then this should work: Tree[Tree != "Error"] ...

r,loops,for-loop,max,subsetting

This is a straightforward grouped calculation. The difficult part has already been done for us. We can use aggregate. aggregate(. ~ Day, per1, max) # Day Stat1 Stat2 Stat3 # 1 10 2.12 1.92 2.11 # 2 11 1.90 1.87 1.93 ...

I will assume you want to work on the first half of your matrix and replace only the values which suit a certain criterion. In your case > 0. set.seed(357) alpha <- matrix(0,10) beta <- matrix(rnorm(5),5) beta [,1] [1,] -1.2411173 [2,] -0.5832050 [3,] 0.3947471 [4,] 1.5042111 [5,] 0.7667997 Only the...

r,data.table,vectorization,subsetting

In data.table: dt[, prob2 := pnorm(val, mean(val), sd(val), lower.tail=FALSE), by=diag] Seems to match what you want: head(dt) # med diag val prob prob2 #1: p E 91 0.04713131 0.04713131 #2: f E 3 0.92991675 0.92991675 #3: o B 26 0.83792988 0.83792988 #4: t C 38 0.70877125 0.70877125 #5: g E...

You can try which.min with(df, X[which.min(Y)]) #[1] "A" Or suppose you have duplicate minimum values in Y, you can use ==. For example df$Y[3] <- 1 with(df,X[ Y == min(Y)]) #[1] "A" "C" which.min returns only the index of first minimum value with(df, X[which.min(Y)]) #[1] "A" data df <- structure(list(X...

python,list,filtering,subset,subsetting

You need realize methods hash and eq on object class A: def __init__(self, a): self.attr1 = a def __hash__(self): return hash(self.attr1) def __eq__(self, other): return self.attr1 == other.attr1 def __repr__(self): return str(self.attr1) Example: l = [A(5), A(4), A(4)] print list(set(l)) print list(set(l))[0].__class__ # ==> __main__.A. It's a object of class...

library(dplyr) df %>% group_by(Id) %>% filter("Yes"%in% Status & "No" %in% Status) ...

The index argument in subsetting is allowed to be "missing" (see ?"["): ff1 = function(x, i) x[i] ff2 = function(x, i = TRUE) x[i] ff3 = function(x, i = seq_along(x)) x[i] ff4 = function(x, i = substitute()) x[i] a = sample(10) a # [1] 3 8 2 6 9 7...

r,sorting,data-mining,stata,subsetting

I tried to answer your questions at the end. First, an example data frame to play around with: set.seed(123) df <- data.frame(id=c(paste0(letters[1:10], 1:10)), matrix(sample(1:20, 500, replace=T), nrow=100, ncol=5)) colnames(df)[2:6] <- paste0("var", 1:5) 1. Count values of a variable For the first question, I'm not sure why you wouldn't do this...

You're not subsetting using equality, you are coercing the numerics 1:10 to logical--and any numeric other than 0 is coerced to TRUE. Run, e.g., !(1:10) # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE You get 10 FALSEs, so when you subset a any vector of length...

A data frame has two indices: the first is for the rows, the second for the columns. So, in order to get the the 8 out of your data frame, you would have to type df[2,3], since it is the element in row 2 and column 3. Now regarding your...

You can try library(dplyr) inner_join(small.a, small.b, by=c('V2'='V1')) Or using merge merge(small.a, small.b, by.x='V2', by.y='V1') ...

Working through an example shows where it is going wrong: a <- data.frame(VAL=c(1,1,1,23,24)) a # VAL #1 1 #2 1 #3 1 #4 23 #5 24 These work: a$VAL %in% c(23,24) #[1] FALSE FALSE FALSE TRUE TRUE a$VAL==23 | a$VAL==24 #[1] FALSE FALSE FALSE TRUE TRUE The following doesn't work...

You could try (assuming that rows are the rownames of the dataset) df1[colSums(df2)!=0,,drop=FALSE] # something #n1 34 #n2 62 #n3 15 #n5 93 Suppose, if colSums are not 0, this gets all the rows, df2$n4[1] <- 3 df1[colSums(df2)!=0,,drop=FALSE] # something #n1 34 #n2 62 #n3 15 #n4 29 #n5 93...

You need to reference the rownames of your data.frame: dfsub[rownames(dfsub) == 271,] #where dfsub is your subsetted data.frame EDIT: as @koekenbakker commented, there is a shorthand to reference the rownames by using ''. So this would be: dfsub['271',] #where dfsub is your subsetted data.frame and 271 the rowname ...

r,data.frame,assignment-operator,subsetting

If speed is a concern I would look to data.table (I normally look there anyway). library(data.table) setDT(P2Y12R_binding_summary)[SYSTEM=="4NTJ" & VARIANT=="D294N", Ki_expt := 42.2 ] an Example using diamonds: library(data.table) dummydf <- diamonds setDT(dummydf)[cut =="Premium" & color =="J", example := 42.2 ] dummydf[!is.na(example)] carat cut color clarity depth table price x y...

r,timestamp,posixct,subsetting,posixlt

You misunderstand a critical difference between POSIXlt and POSIXct: POSIXlt is a 'list type' with components you can access as you do POSIXct is a 'compact type' that is essentially just a number You almost always want POSIXct for comparison and effective storage (eg in a data.frame, or to index...

uniprotconvert[ uniprotconvert$V2 %in% genelist, ] Will do the job....