Menu
  • HOME
  • TAGS

How to reshape and summarise categorical data from long to wide?

r,data,reshape,categorical-data

The xtabs function is reasonably simple to use. The only cognitive jump is to realize that there is no LHS unless you want to do summation of a third variable: > xtabs( ~var2+var3, data=db) var3 var2 G H K X 1 1 0 Y 2 0 1 You don't want...

Merging pandas categorical Series with renaming

python,python-3.x,pandas,categorical-data

One way of doing this is just to assign the category to all elements in the series of the category you want to rename: In [59]: N Out[59]: 0 a 1 b 2 c 3 a Name: NEW_TEST, dtype: category Categories (3, object): [a < b < c] In [60]:...

Extract unique strings from a factor string variable

r,unique,categorical-data

If you wanted to extract the unique names of actors, you can get the indicated actors with the as.character function, split it on the commas with strsplit, combine together all vectors in the resulting list with unlist, and grab the unique names with unique: (all.actors <- unique(unlist(strsplit(as.character(actor), ", ")))) #...

R - set reference level of factor to NA

r,r-factor,categorical-data

You have to specify the correct NA type, which is NA_character_, but that then throws out the NA, which is presumably a bug. A workaround is to specify levels directly yourself: # throw out NA's to begin with egg_size = factor(c("large","small",NA), exclude = NA) # but then add them back...

Regression gives error on one of the input variables “contrasts can be applied only to factors with 2 or more levels”

r,regression,categorical-data

It objects to one of your variables (as gvrocha said); you might have a factor with only one level, or a string. A tip to quickly track down the offending variable(s) is to do interval bisection and increase/decrease the col indices till you trigger the error. Best to use the...

One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

r,categorical-data

dd <- read.table(text=" RACE AGE.BELOW.21 CLASS HISPANIC 0 A ASIAN 1 A HISPANIC 1 D CAUCASIAN 1 B", header=TRUE) with(dd, data.frame(model.matrix(~RACE-1,dd), AGE.BELOW.21,CLASS)) ## RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS ## 1 0 0 1 0 A ## 2 1 0 0 1 A ## 3 0 0 1 1 D ##...

Histogram color fills with categorical variables in R

r,histogram,categorical-data

What you're describing doesn't sound like a histogram (which is a very specific plot for continuous random variables to estimate the kernel density); sounds like you just want a bar chart. I believe this is what you're looking for myDF <- data.frame( ref=c("A","A","A","C","C","C","G","G","G","T","T","T"), alt=c("C","G","T","A","G","T","A","C","T","A","C","G"), Freq=c(5,2,3,6,9,6,8,6,7,4,6,4) ) library(ggplot2) ggplot(myDF, aes(ref, Freq,...

Clustering using gower distance in R

r,cluster-analysis,categorical-data

You can use the vegan package to generate your gower matrix, and then create your clusters using the cluster package. gow.mat <- vegdist(dataframe, method="gower") Then you can feed that matrix into the PAM function. The example below will use the gower distance to generate 5 clusters clusters <- pam(x =...

Heatmap with categorical variables and with phylogenetic tree in R

r,heatmap,categorical-data,phylogeny

I actually figured it out by myself. For those that are interested, here is my script: #load packages library("ape") library(gplots) #retrieve tree in newick format with three species mytree <- read.tree("sometreewith3species.tre") mytree_brlen <- compute.brlen(mytree, method="Grafen") #so that branches have all same length #turn the phylo tree to a dendrogram object...

Creating factor variables 'weekend' and 'weekday' from date

r,categorical-data

You can use base R df1$date <- as.Date(df1$date) #create a vector of weekdays weekdays1 <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday') #Use `%in%` and `weekdays` to create a logical vector #convert to `factor` and specify the `levels/labels` df1$wDay <- factor((weekdays(df1$date) %in% weekdays1), levels=c(FALSE, TRUE), labels=c('weekend', 'weekday') #Or df1$wDay <- c('weekend', 'weekday')[(weekdays(df1$date)...

Fill multiple nulls for categorical data

python,pandas,null,categorical-data

You could use stats.rv_discrete: from scipy import stats counts = df.category.value_counts() dist = stats.rv_discrete(values=(counts.index, counts/counts.sum())) fill_values = dist.rvs(size=df.shape[0] - df.category.count()) df.loc[df.category.isnull(), "category"] = fill_values EDIT: For general data(not restricted to integers) you can do: dist = stats.rv_discrete(values=(np.arange(counts.shape[0]), counts/counts.sum())) fill_idxs = dist.rvs(size=df.shape[0] - df.category.count()) df.loc[df.category.isnull(), "category"] =...

Factor variables in r

r,statistics,glm,categorical-data

Don't convert your categorical variable into numeric variables - this will create a very different model [your attempts would not have worked anyway] There is no such thing as a "regression" estimate for the entire variable. If a categorical variable has n categories, the standard approach will create n-1...

How can I produce a MATLAB bar graph of categorical responses?

matlab,categorical-data

Try this. First I created some sample data, for 10 students answering 5 questions: choices=['A' 'B' 'C' 'D'] %// I'll use this variable later as well for m=1:10 for n=1:5 answers{m,n}=choices(randi(4)); end end answers Then to find the frequency of each answer to each question, freq, I looked at each...

How to create a continuous variable from a categorical variable

r,variables,categorical-data

This samples uniformly between the minimum and maximum within each group, returning the same number of values as your original dataframe: df = read.table(file='clipboard', header=TRUE) library(plyr) ddply(df, .(Class_age), function(x) { level = x$Class_age[1] min_max = as.numeric(strsplit(as.character(level), '-')[[1]]) x$age = runif(nrow(x), min=min_max[1], max=min_max[2]) return(x) }) Example output: Class_age age 1 20-22...

R: making a different between categorical and numeric predictors

r,summary,categorical-data

The problem is that you first create a matrix, whose elements must be of the same type, and then you convert to data frame. You have to create the data frame from the beginning: alles <- data.frame(isNoun = as.factor(isNoun), isVerb = as.factor(isVerb), length, labels = as.factor(labels)) ...

rolling majority on non-numeric data

pandas,categorical-data

Here is one way to do it by defining your own rolling apply function. import pandas as pd df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]}) df['b'] = np.where(df.a == 1, 'A', 'B') print(df) Out[60]: a b 0 1 A 1 1 A 2 1 A 3 1 A 4 1 A 5...

How to use biserial to calculate correlation between continuous and categorical variable?

r,correlation,continuous,categorical-data

Since y is not dichotomous, it doesn't make sense to use biserial(). From the documentation: The biserial correlation is between a continuous y variable and a dichotmous x variable, which is assumed to have resulted from a dichotomized normal variable. Instead use polyserial(), which allows more than 2 levels. polyserial()...

Merging data with text identifiers

categorical-data,dataset

In Excel you could use the levenshtein distance function, as in the following link http://stackoverflow.com/questions/4243036/levenshtein-distance-in-excel&ved=0CB8QFjAA&usg=AFQjCNElzSYO4pVLI2aWi7XJBuT3ltrD5A...

Converting factors to numeric values in R

r,categorical-data

For converting the currency # data df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , "$25,000"), educ = c("High School Diploma", "Current Undergraduate", "PhD"),stringsAsFactors=FALSE) # Remove comma and dollar sign temp <- gsub("[,$]","", df$sal) # remove text temp <- gsub("[[:alpha:]]","", temp) # get average over range df$ave.sal <-...

Creating Dummy Sets in MATLAB for statistics

matlab,correlation,categorical-data

That's pretty simple to do. In terms of two-dimensional problems, data that exhibit a positive correlation are those data points where there is clearly an increase in the output for every increase in the independent value r. Similarly, for negative correlation, there is a decrease in the output for every...

Pandas Ordinal Variable Treatment in Similarity Calculation

python,pandas,distance,similarity,categorical-data

If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map: In [560]: df = pd.DataFrame({'id' : range(1,9), 'rate' : ['bad',...

R: find frequencies over 3rd quartile in table

r,table,frequency,categorical-data,contingency

I think you can speed this up by writing a bit more precise function and then using aggregate to get the results. You could also use by if you want a more list-based approach, which might be more useful for your next use. I think it will still be slow,...

R - convert from categorical to numeric for KNN

r,knn,r-caret,categorical-data

When data are read in via read.table, the data in the first column are factors. Then data$iGender = as.integer(data$Gender) would work. If they are character, a detour via factor is easiest: data$iGender= as.integer(as.factor(data$Gender)) ...

Combining factor levels in data frame column

r,taxonomy,categorical-data,merging-data

The problem is that the number of levels does not equal the number of labels in the factor creation, nor is it length 1. From ?factor: labels either an optional character vector of labels for the levels (in the same order as levels after removing those in exclude), or a...

How to plot categorical vs numerical data with vioplot in R?

r,boxplot,categorical-data

You misspelled "purple" when you subsetted. Also, in the vioplot function, your first four arguments need to be vectors, not data frames. This code should work. blue <- subset(beeflowers4, colors=="blue", select=c(pollen, colors)) green <- subset(beeflowers4, colors=="green", select=c(pollen, colors)) purple <- subset(beeflowers4, colors=="purple", select=c(pollen, colors)) red <- subset(beeflowers4, colors=="red", select=c(pollen, colors))...

formatting data in R categorical

r,data,data.frame,categorical-data

You can do it with dplyr very simply: library(dplyr) df %>% group_by(reason,location) %>% summarize(zeros = sum(zero_ones==0), ones = sum(zero_ones==1)) # reason location zeros ones #1 s c 1 2 #2 s h 1 1 #3 v c 1 4 #4 v h 2 1 ...