Try this. First I created some sample data, for 10 students answering 5 questions: choices=['A' 'B' 'C' 'D'] %// I'll use this variable later as well for m=1:10 for n=1:5 answers{m,n}=choices(randi(4)); end end answers Then to find the frequency of each answer to each question, freq, I looked at each...

r,taxonomy,categorical-data,merging-data

The problem is that the number of levels does not equal the number of labels in the factor creation, nor is it length 1. From ?factor: labels either an optional character vector of labels for the levels (in the same order as levels after removing those in exclude), or a...

You misspelled "purple" when you subsetted. Also, in the vioplot function, your first four arguments need to be vectors, not data frames. This code should work. blue <- subset(beeflowers4, colors=="blue", select=c(pollen, colors)) green <- subset(beeflowers4, colors=="green", select=c(pollen, colors)) purple <- subset(beeflowers4, colors=="purple", select=c(pollen, colors)) red <- subset(beeflowers4, colors=="red", select=c(pollen, colors))...

You can use base R df1$date <- as.Date(df1$date) #create a vector of weekdays weekdays1 <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday') #Use `%in%` and `weekdays` to create a logical vector #convert to `factor` and specify the `levels/labels` df1$wDay <- factor((weekdays(df1$date) %in% weekdays1), levels=c(FALSE, TRUE), labels=c('weekend', 'weekday') #Or df1$wDay <- c('weekend', 'weekday')[(weekdays(df1$date)...

You have to specify the correct NA type, which is NA_character_, but that then throws out the NA, which is presumably a bug. A workaround is to specify levels directly yourself: # throw out NA's to begin with egg_size = factor(c("large","small",NA), exclude = NA) # but then add them back...

r,data,reshape,categorical-data

The xtabs function is reasonably simple to use. The only cognitive jump is to realize that there is no LHS unless you want to do summation of a third variable: > xtabs( ~var2+var3, data=db) var3 var2 G H K X 1 1 0 Y 2 0 1 You don't want...

python,python-3.x,pandas,categorical-data

One way of doing this is just to assign the category to all elements in the series of the category you want to rename: In [59]: N Out[59]: 0 a 1 b 2 c 3 a Name: NEW_TEST, dtype: category Categories (3, object): [a < b < c] In [60]:...

r,table,frequency,categorical-data,contingency

I think you can speed this up by writing a bit more precise function and then using aggregate to get the results. You could also use by if you want a more list-based approach, which might be more useful for your next use. I think it will still be slow,...

r,heatmap,categorical-data,phylogeny

I actually figured it out by myself. For those that are interested, here is my script: #load packages library("ape") library(gplots) #retrieve tree in newick format with three species mytree <- read.tree("sometreewith3species.tre") mytree_brlen <- compute.brlen(mytree, method="Grafen") #so that branches have all same length #turn the phylo tree to a dendrogram object...

r,knn,r-caret,categorical-data

When data are read in via read.table, the data in the first column are factors. Then data$iGender = as.integer(data$Gender) would work. If they are character, a detour via factor is easiest: data$iGender= as.integer(as.factor(data$Gender)) ...

r,statistics,glm,categorical-data

Don't convert your categorical variable into numeric variables - this will create a very different model [your attempts would not have worked anyway] There is no such thing as a "regression" estimate for the entire variable. If a categorical variable has n categories, the standard approach will create n-1...

It objects to one of your variables (as gvrocha said); you might have a factor with only one level, or a string. A tip to quickly track down the offending variable(s) is to do interval bisection and increase/decrease the col indices till you trigger the error. Best to use the...

python,pandas,null,categorical-data

You could use stats.rv_discrete: from scipy import stats counts = df.category.value_counts() dist = stats.rv_discrete(values=(counts.index, counts/counts.sum())) fill_values = dist.rvs(size=df.shape[0] - df.category.count()) df.loc[df.category.isnull(), "category"] = fill_values EDIT: For general data(not restricted to integers) you can do: dist = stats.rv_discrete(values=(np.arange(counts.shape[0]), counts/counts.sum())) fill_idxs = dist.rvs(size=df.shape[0] - df.category.count()) df.loc[df.category.isnull(), "category"] =...

For converting the currency # data df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , "$25,000"), educ = c("High School Diploma", "Current Undergraduate", "PhD"),stringsAsFactors=FALSE) # Remove comma and dollar sign temp <- gsub("[,$]","", df$sal) # remove text temp <- gsub("[[:alpha:]]","", temp) # get average over range df$ave.sal <-...

r,cluster-analysis,categorical-data

You can use the vegan package to generate your gower matrix, and then create your clusters using the cluster package. gow.mat <- vegdist(dataframe, method="gower") Then you can feed that matrix into the PAM function. The example below will use the gower distance to generate 5 clusters clusters <- pam(x =...

Here is one way to do it by defining your own rolling apply function. import pandas as pd df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]}) df['b'] = np.where(df.a == 1, 'A', 'B') print(df) Out[60]: a b 0 1 A 1 1 A 2 1 A 3 1 A 4 1 A 5...

In Excel you could use the levenshtein distance function, as in the following link http://stackoverflow.com/questions/4243036/levenshtein-distance-in-excel&ved=0CB8QFjAA&usg=AFQjCNElzSYO4pVLI2aWi7XJBuT3ltrD5A...

matlab,correlation,categorical-data

That's pretty simple to do. In terms of two-dimensional problems, data that exhibit a positive correlation are those data points where there is clearly an increase in the output for every increase in the independent value r. Similarly, for negative correlation, there is a decrease in the output for every...

If you wanted to extract the unique names of actors, you can get the indicated actors with the as.character function, split it on the commas with strsplit, combine together all vectors in the resulting list with unlist, and grab the unique names with unique: (all.actors <- unique(unlist(strsplit(as.character(actor), ", ")))) #...

r,data,data.frame,categorical-data

You can do it with dplyr very simply: library(dplyr) df %>% group_by(reason,location) %>% summarize(zeros = sum(zero_ones==0), ones = sum(zero_ones==1)) # reason location zeros ones #1 s c 1 2 #2 s h 1 1 #3 v c 1 4 #4 v h 2 1 ...

The problem is that you first create a matrix, whose elements must be of the same type, and then you convert to data frame. You have to create the data frame from the beginning: alles <- data.frame(isNoun = as.factor(isNoun), isVerb = as.factor(isVerb), length, labels = as.factor(labels)) ...

dd <- read.table(text=" RACE AGE.BELOW.21 CLASS HISPANIC 0 A ASIAN 1 A HISPANIC 1 D CAUCASIAN 1 B", header=TRUE) with(dd, data.frame(model.matrix(~RACE-1,dd), AGE.BELOW.21,CLASS)) ## RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS ## 1 0 0 1 0 A ## 2 1 0 0 1 A ## 3 0 0 1 1 D ##...

python,pandas,distance,similarity,categorical-data

If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map: In [560]: df = pd.DataFrame({'id' : range(1,9), 'rate' : ['bad',...

r,correlation,continuous,categorical-data

Since y is not dichotomous, it doesn't make sense to use biserial(). From the documentation: The biserial correlation is between a continuous y variable and a dichotmous x variable, which is assumed to have resulted from a dichotomized normal variable. Instead use polyserial(), which allows more than 2 levels. polyserial()...

This samples uniformly between the minimum and maximum within each group, returning the same number of values as your original dataframe: df = read.table(file='clipboard', header=TRUE) library(plyr) ddply(df, .(Class_age), function(x) { level = x$Class_age[1] min_max = as.numeric(strsplit(as.character(level), '-')[[1]]) x$age = runif(nrow(x), min=min_max[1], max=min_max[2]) return(x) }) Example output: Class_age age 1 20-22...

What you're describing doesn't sound like a histogram (which is a very specific plot for continuous random variables to estimate the kernel density); sounds like you just want a bar chart. I believe this is what you're looking for myDF <- data.frame( ref=c("A","A","A","C","C","C","G","G","G","T","T","T"), alt=c("C","G","T","A","G","T","A","C","T","A","C","G"), Freq=c(5,2,3,6,9,6,8,6,7,4,6,4) ) library(ggplot2) ggplot(myDF, aes(ref, Freq,...