python,pandas,machine-learning,missing-data

What is the best way to handle missing values in data set? There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy...

You can do try[try <= .8] <- 5 The NA values will remain as NA Or create a logical condition to exclude the NA values try[try <=.8 & !is.na(try)] <- 5 ...

r,if-statement,average,missing-data

If you are sure you don't have any consecutive NA values and the first and last elements are never NA, then you can do df1<-c(2,2,NA,10, 20, NA,3) idx<-which(is.na(df1)) df1[idx] <- (df1[idx-1] + df1[idx+1])/2 df1 # [1] 2.0 2.0 6.0 10.0 20.0 11.5 3.0 This should be more efficient than a...

r,row,interpolation,na,missing-data

You could also use the na.approx function from the zoo package. Note that this has a slightly different behavior (than the solution by @flodel) when you have two consecutive NA values. For the first and last row you could then use na.locf. y <- na.approx(x) y[nrow(y), ] <- na.locf(y[(nrow(y)-1):nrow(y), ])[2,...

You can try merge/expand.grid res <- merge( expand.grid(group=unique(df$group), time=unique(df$time)), df, all=TRUE) res$data[is.na(res$data)] <- 0 res # group time data #1 A 1 5 #2 A 2 6 #3 A 3 0 #4 A 4 7 #5 B 1 8 #6 B 2 9 #7 B 3 10 #8 B 4...

c++,algorithm,vector,counting,missing-data

So I thought I'd post an answer. I don't know anything in std::algorithm that accomplishes this directly, but in combination with vector<bool> you can do this in O(2N). template <typename T> T find_missing(const vector<T>& v, T elem){ vector<bool> range(v.size()); elem++; for_each(v.begin(), v.end(), [&](const T& i){if((i >= elem && i -...

parameters,null,sparse-matrix,missing-data,ampl

Instead of specifying a default value, you can work with a sparse matrix. For example: param m integer > 0; set C within {1..m,1..m}; param A{C}; data; param m := 4; param: C: A: 1 2 3 4 := 1 36 . . -2 2 . 7 3 . 3...

Wouldn't na.omit(data) do? It seems the cleanest and fastest way to me. By the way, your code does not work because you cannot do !=NA Use is.na() instead (but na.omit() is better): data[!is.na(data[,2]),] ...

python,for-loop,count,pandas,missing-data

The best way to do this is with the count method of DataFrame objects: In [18]: data = randn(1000, 3) In [19]: data Out[19]: array([[ 0.1035, 0.9239, 0.3902], [ 0.2022, -0.1755, -0.4633], [ 0.0595, -1.3779, -1.1187], ..., [ 1.3931, 0.4087, 2.348 ], [ 1.2746, -0.6431, 0.0707], [-1.1062, 1.3949, 0.3065]]) In...

You could use np.sum to get the counts for each column: import numpy as np import pandas as pd df = pd.DataFrame({'c1':[1, np.nan, np.nan], 'c2':[2, 2, np.nan]}) np.sum(df.isnull()) Out[4]: c1 2 c2 1 dtype: int64 ...

r,machine-learning,classification,missing-data,predict

There are a number of ways to go about this but here is one. I also tried using it on your dataset but it's either too small, has too many linear combinations or something else because it's not converging. Amelia - http://fastml.com/impute-missing-values-with-amelia/ data(mtcars) mtcars1<-mtcars[rep(row.names(mtcars),10),] #increasing dataset #inserting NAs into dataset...

python,numpy,statistics,scipy,missing-data

You can not mathematically perform a test statistic based on nan. Unless you find proof/documentation of special treatment of nan, you can not rely on that. My experience is that in general, even numpy does not treat nan specially, for example for median. Instead the results are whatever they happen...

You can use the subset function: subset(list_outdegrees,Followers!=0 | Friends!=0 | Statuses!=0) ...

Based on the data showed, it is not very clear whether you have other values in 'first' for each combination of 'second' and 'third'. If there is only a single value and you need to replace the '' with that, then you could try library(data.table) setDT(df1)[, replace(first, first=='', first[first!='']), list(second,...

c,header,syntax-error,missing-data

Change: typedef struct Portion{ char *pName; float pPrice; float pQuantity; Portion *next; }Portion; to: typedef struct Portion{ char *pName; float pPrice; float pQuantity; struct Portion *next; // <<< }Portion; Also note that you shouldn't cast the result of malloc in C, so change e.g. Portion *newPortion = (Portion*)malloc(sizeof(Portion)); to: Portion...

matlab,octave,pca,missing-data,least-squares

Thanks for all your help. I went through the references and was able to find their matlab code on the als algorithm from two of the references. For anybody wondering, the source code can be found in these two links: 1) http://research.ics.aalto.fi/bayes/software/index.shtml 2) https://www.cs.nyu.edu/~roweis/code.html...

I suspect that your 'missing values' are blanks (""). If you code them as NA instead, you make life easier. A small example (of what I guess is going on) # sample data with some 'missing values' x <- c("high", "", "low", "", "high", "") x table(x) # high low...

Try something like this d1[is.na(d2)] <- NA ...

I would just do something like that (although I'm not sure if your example is real or just toy and then it probably won't fit for your requirements). Assuming that dat is your data dat2 <- data.frame(R = rep(seq_len(3), each = 5), Y = rep(seq_len(5), 3), N = 0) dat2$N[paste(dat2$R,...

For the second count I think just subtract the number of rows from the number of rows returned from dropna: In [14]: from numpy.random import randn df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) df Out[14]:...

This is not the prettiest thing, but here's how I handle problems like this: library(data.table) data <- data.table(data) data[, rowid:=1:.N, keyby = id] ## flag_a data[, flag_a_min:=min(rowid[!is.na(flag_a)]), keyby = id] data[, flag_a_max:=flag_a_min+2] data[rowid <=flag_a_max & rowid >= flag_a_min, flag_a:=min(na.omit(flag_a))] ## flag_b data[, flag_b_min:=min(rowid[!is.na(flag_b)]), keyby = id] data[, flag_b_max:=flag_b_min+2] data[rowid <=flag_b_max...

This is a variation on a problem documented since 2000 as an FAQ: see here The variation lies in limiting how far non-missing values are copied. But it falls easily to the same idea. The last known value was recorded in certain years which we can copy down the dataset:...

fitted gives in-sample one-step forecasts. The "right" way to do what you want is via a Kalman smoother. A rough approximation good enough for most purposes is obtained using the average of the forward and backward forecasts for the missing section. Like this: x <- AirPassengers x[90:100] <- NA fit...

Find min and max C2 per C1 from test in a CTE (does not have to be a CTE) and join that to OtherTable using between. Fetch the Data value correlated sub query using top(1) ordered by C2 desc with C as ( select C1, min(C2) minC2, max(C2) maxC2 from...

You can also use addNA, x <- c(1, 1, 2, 2, 3, NA) addNA(x) # [1] 1 1 2 2 3 <NA> # Levels: 1 2 3 <NA> which is basically a convenience function for factoring with exclude = NULL...

ios,error-handling,itunesconnect,missing-data,leaderboard

Yi Yuans Answer is correct, unf. I am to noob to upvote it. That's exactly what Apple recommended to me in the same situation by pointing out the Game Center configuration manual. I thought they are mad, as I did set the default Leaderboard before, and was unable to make...

ios,localization,xcode6,missing-data,xliff

The strings were not included in the XLIFF, because they were attributed strings. It makes sense that Xcode would not automatically localize attributed strings, because different parts of the string may have different attributes. How is the translator to know which parts of the translated string get which attributes? This...

You can directly set the value rather than relying on loops: data[data[, 3] == 1, 1] <- 0 This is setting the value in column 1 to 0 when the value in column 3 is 1. I think your function is replacing [1, 1] with 0 because when s is...

First, we will define the two thresholds you specified. (I set the second one to 4 so we can work consistently with "<" and ">", instead of the error-prone "<" and ">="). threshold.data <- 10 threshold.NA <- 4 Now, the key is to work with run length encoding on is.na(y)....

You can use dat[order(rowSums(is.na(dat))), ] where dat is the name of your data frame....

Thank you John for your help. I found a solution which works for me. It looks quiet different but I learned a lot out of it. If you are interested here is how it looks now. SELECT DISTINCTROW Base_Customer_RevenueYearQ.SalesRep, Base_Customer_RevenueYearQ.CustomerId, Base_Customer_RevenueYearQ.Customer, Base_Customer_RevenueYearQ.RevenueYear, CustomerPaymentPerYearQ.[Sum Of PaymentPerYear] FROM Base_Customer_RevenueYearQ LEFT JOIN CustomerPaymentPerYearQ...

r,missing-data,logistic-regression

If you have missing data, NAs, R will strip these when the modelling functions does formula -> model.frame -> model.matrix() etc., because the default in all these functions is to have na.action = na.omit. In other words, rows with NAs are deleted before the actual computations are performed. This deletion...

r,function,subset,mean,missing-data

That's the way I fixed it: pollutantmean <- function(directory, pollutant, id = 1:332) { #set the path path = directory #get the file List in that directory fileList = list.files(path) #extract the file names and store as numeric for comparison file.names = as.numeric(sub("\\.csv$","",fileList)) #select files to be imported based on...

This gives you the first column with the linearly-extrapolated values filled in for NA. You can adapt for the last column. firstNAfill <- function(x) { ans <- ifelse(!is.na(x[1]), x[1], ifelse(sum(!is.na(x))<2, NA, 2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]] ) ) return(ans) } dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])})) Result: Year1 Year2 Year3...

php,mysql,sql,result,missing-data

What is the purpose of two while loops (except of stealing your first row – in the inner loop you reassign the value of the $row variable from the outer loop's first iteration)? Move all your assignments/output in the first while loop a remove the second (you are missing curly...

Convert your factor to character then run the previous line again: df$GROUPE <- as.character(df$GROUPE) df$GROUPE[is.na(df$GROUPE)] <- "OTHER" You can refactor the df$GROUPE variable after: df$GROUPE=as.factor(df$GROUPE) ...

sql,compare,records,missing-data

You can generate the missing records by doing a cross join to generate all combinations and then removing those that are already there. For example: select fd.fulldate, c.D_ACTIECODE from (select distinct fulldate from fact_quantity_tmp) fd cross join (select D_ACTIECODE from C_DS_BD_AP_A) c left join fact_quantity_tmp fqt on fqt.fulldate = fd.fulldate...

Gregory Jefferis, who maintains the package cleared the mystery: preprocessing with knnImpute silently removes any columns containing NA values, so both your variables x and y are lost before predict(). Here is an example that works without crashing. The trick is to have a few more related columns without NAs....

You can simply list the two variables in complete.cases() and sum() the output. x <- c(1, 2, 3, NA, NA, NA, 5) y <- c(1, NA, 3, NA, 3, 2, NA) complete.cases(x, y) #[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE sum(complete.cases(x, y)) #[1] 2 The sum of a logical...

Try this: Subs1<-subset(DATA, (!is.na(DATA[,2])) & (!is.na(DATA[,3]))) The second parameter of subset is a logical vector with same length of nrow(DATA), indicating whether to keep the corresponding row....

I am inventing some data here, I hope I got your questions right. data chickens; do tag=1 to 560; output; end; run; data registered; input date mmddyy8. antenna tag; format date date7.; datalines; 01012014 1 1 01012014 1 2 01012014 1 6 01012014 1 8 01022014 1 1 01022014 1...

python,arrays,python-3.x,numpy,missing-data

I think you're looking for numpy.in1d, something like: a[a == 0] = b[~np.in1d(b, a)] For boolean arrays, ~ acts as a not....

python,pandas,missing-data,calculated-columns

The new column 'z' get its values from column 'y' using df['z'] = df['y']. This brings over the missing values so fill them in using fillna using column 'x'. Chain these two actions: >>> df['z'] = df['y'].fillna(df['x']) >>> df x y z 0 1 NaN 1 1 2 8 8...

I believe this should do it: null_data = df[df.isnull().any(axis=1)] ...

This short example illustrates one of possible ways of introducing a new level into a factor: x <- factor(c(NA, NA, "a", "b", NA, "b")) x[is.na(x)] <- "c" # this won't work, no such level as "c" in levels(x) ## Warning message: ## In `[<-.factor`(`*tmp*`, is.na(x), value = "c") : ##...

r,data.frame,apply,na,missing-data

x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ] ...

database-design,modeling,missing-data,representation

If information about an entity is not needed to understand another entity, these are not cognitively dependant on each other and can be normalised. What this means in general practice is that you should make separate tables for the two entities and use foreign keys to refer between them. Imagine...

python,csv,pandas,missing-data

I found a way to get it more or less working. I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome! import re import pandas as pd import numpy as np # clear quoting characters def filterTheField(s):...

time-series,gfortran,missing-data

I eventually found this out. I can convert the year, month, day, hour, minute, and second to a Julian Day number. I then convert the expected time difference to a Julian Day value (I say value as it's not technically a Julian Day number). Subtract two consecutive Julian Day numbers....