This question already has an answer here:
I am working with a csv data set with around 1 million records. I need to perform two operations on the data set:
- Prepare a dataset that do not have those rows that have some missing (blank) values in them.
- Prepare another data set that replaces empty values with unknown.
I have tried to use excel for it but that is taking too much time. Please someone help with the way it can be done in R?
Best How To :
To get complete cases, use this:
complete_df <- df[complete.cases(df),]
complete.cases returns a logical vector that tells you which rows of dataframe df are complete, and you can use that to subset the data.
To replace the NAs, you can use this:
new_df <- df
new_df[is.na()] <- 'Unknown'
But this has the effect of possibly changing the datatypes of the columns with missing data. For example, if you have a column of numeric data and you put the missing variables as 'Unknown' then that whole column is now a character variable, so be aware of this.