Menu
  • HOME
  • TAGS

Starting an application when starting Apache Server

java,apache,statistics,server

On Windows: Create a batch file; which will start both Apache and Tomcat services one after another. See below: sc start MyService ...

Detect two data stream deviated from each other

algorithm,math,statistics,dynamic-programming

I think you are looking for correlation coefficient remember last N values from both streams cyclic buffers are ideal for this compute the correlation coefficient detect the similarity drop for example like this: if (correlation_coefficient>-0.997) return "drop below 99.7%"; ...

Lasso Generalized linear model in Python

python,statistics,scikit-learn,statsmodels,cvxopt

statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial. http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html and there should be some examples available in the documentation or unit tests. It uses an interior...

python consecutive counts of an occurence with length

python,statistics

This is similar to a run length encoding problem, so I've borrowed some ideas from that Rosetta code page: import itertools a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0] b = [] for item, group in itertools.groupby(a): size = len(list(group)) for i in range(size): if item == 0: b.append(0) else: b.append(size) b Out[8]: [0, 0, 4, 4,...

“Icon” (ISOTYPE) charts in R shiny with Javascript

javascript,r,plot,statistics,shiny

Answering my own question since, after all, I have found some resources that fit my use case and they seem viable for development. Hopefully it'll come in handy for the comunity later down the road :) After further investigation, I found the name of "pictogram charts" as an alternative way...

stdtr in python giving nan for p-value while doing t-test

python,statistics,scipy,p-value

The degrees of freedom you are passing to the formula are negative. In [6]: import numpy as np from scipy.special import stdtr ​ dof = -2176568 tf = -11.374250 2*stdtr(dof, -np.abs(tf)) Out[6]: nan If positive: In [7]: import numpy as np from scipy.special import stdtr ​ dof = 2176568 tf...

Comparing datasets to nonstandard probability distributions in Python

python,statistics,scipy,probability

(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying...

Random Forest Classifier Matlab v/s Python

python,matlab,machine-learning,statistics,random-forest

The Problem There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results. First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often...

Access variables of a regression model in R

r,statistics

Thanks to Roland You can get the model string by accessing the summary attributes (have a look at str(s)) s <- summary(fit) mymod <- paste(attr(s$terms, "term.labels"), collapse=" + ") mymod [1] "x1 + x2 + x1:x2" However, you can get the data by passing the model fit to model.matrix model.matrix(fit)...

Angular statistics doesn't make sense

statistics,mean,angle,direction

The values don't look right Angles that differ by 360 are equivalent. So, -28.8551147 == 331.145, which is the arithmetic mean of the two values you provided. If you would like to ensure that your values are always in [0,360), you should add 360 if the values are less...

Weighted mean in numpy/python

python,numpy,statistics,mean,weighted

You actually have 2 different questions. How to make data discrete, and How to make a weighted average. It's usually better to ask 1 question at a time, but anyway. Given your specification: xmin = -100 xmax = 100 binsize = 20 First, let's import numpy and make some data:...

How does the proc/diskstats work to present that values? And for proc/stat and meminfo?

linux,unix,statistics,monitoring,proc

There surely is a .c file and it's a part of the Linux kernel. If you really want to see how it's done you can start unwinding it e.g. from here: http://lxr.free-electrons.com/source/block/genhd.c?v=3.8 Reading from procfs is not the worst method to get the stats, actually that's what it's made for....

Find previous hour and next hour in R

r,datetime,statistics

Assuming your time were a variable "X", you can use round or trunc. Try: round(X, "hour") trunc(X, "hour") This would still require some work to determine whether the values had actually been rounded up or down (for round). So, If you don't want to have to think about that, you...

How to choose a value between the two, with given probability in Excel?

excel,statistics,probability

As I stated in the comments; generate random number with RAND from 0 to 1, compare with the probability. If it is bigger then it is 0, else 1. =IF(RAND()>=A1,0,1) ...

C# Linq Dictionary IO

c#,linq,dictionary,io,statistics

I'd suggest to create custom class which can hold/store related data. Let it be Statisztika with the following fields/properties: Day, PersonId, Visitor and CountOfVisits. Statisztika class definition: public class Statisztika { private int iday = 0; private int ipersonid = 0; private int ivisitor =0; private int icount =0; //class...

Matlab: What are the ways to determine the distribution of the data

matlab,statistics,distribution

If I understand correctly, you are asking how to decide which distribution to choose once you have a few fits. There are three major metrics (IMO) for measuring "goodness-of-fit": Chi-Squared Kolmogrov-Smirnov Anderson-Darling Which to choose depends on a large number of factors; you can randomly pick one or read the...

How to calculate KNN Variable Importance in R

r,machine-learning,statistics,classification

Basically, your question boils down to having some variables (Word1, Word2, and Word3 in your example) and a binary outcome (Author in your example) and wanting to know the importance of different variables in determining that outcome. A natural approach would be training a regression model to predict the outcome...

Rolling average pairwise correlation - code doesn't work as expected

r,statistics,time-series,correlation,xts

What about using rollapply in different way? As you dont supply the complete dataset, here a demonstration how I mean it: set.seed(123) m <- matrix(rnorm(100), ncol = 10) rollapply(1:nrow(m), 5, function(x) cor.mean(m[x,])) [1] -0.080029692 -0.038168840 -0.058443824 0.005699772 -0.014459878 -0.021569173 As I just figured out, you can also use the function...

Detecting significant changes in data

c#,algorithm,data,graph,statistics

A simple way is to calculate the difference between every two neighbouring samples, eg diff= abs(y[x point 1] - y[x point 0]) and calculate the standard deviation for all the differences. This will rank the differences in order for you and also help eliminate random noise which you get if...

Weighted Average by Year in ragged data frame in R

r,statistics,weighted-average

There are many ways. I would prefer via a data.table. First convert your data into a data.table: require(data.table) #tested in data.table 1.9.4 setDT(mydata) > mydata Fruit.Type Year Primary.Wgt Primary.Loss.PCT Retail.Wgt Retail.Loss.PCT 1: Oranges.F 1970 16.16 3 15.68 11.6 2: Oranges.F 1971 15.73 3 15.26 11.6 3: Oranges.F 1972 14.47 3...

How to perform operation on each row in R (apply()?)

r,statistics

You could use apply to generate a list of the Fisher test results: tests <- apply(data2, 1, function(x) fisher.test(matrix(na.omit(x), nrow=2, byrow=TRUE))) Then you could access row-specific tests with standard list indexing tests[[4]] # Fisher's Exact Test for Count Data # # data: matrix(na.omit(x), nrow = 2, byrow = TRUE) #...

R optimization with equality and inequality constraints

r,statistics,mathematical-optimization,minimization

On this occasion optim will not work obviously because you have equality constraints. constrOptim will not work either for the same reason (I tried converting the equality to two inequalities i.e. greater and less than 15 but this didn't work with constrOptim). However, there is a package dedicated to this...

R: how to use the bpower function to calculate 2-sample binomial test power

r,statistics

The signature for the function is args(bpower) # function (p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha = 0.05) so if unnamed, the third parameter will be interpreted as the odds ratio. So yes, you made a mistake in your code. You should explicitly name your parameters to avoid this...

Precipitation history of a city

statistics,weather

Weather Underground has free historical data. Here's data for Helsinki from January 1, 2015 through June 18, 2015 (for example). You can customize the data range and manually download as a CSV file.

Java generate random number based on boxplot

java,statistics,boxplot,random-sample

Supposed that min, a, median, b, max values separate quartiles of distribution (http://en.wikipedia.org/wiki/Quartile): static public double next(Random rnd, double median, double a, double b, double min, double max) { double d = -3; while (d > 2.698 || d < -2.698) { d = rnd.nextGaussian(); } if (Math.abs(d) < 0.6745)...

walking through two tables (matrices) matching columns and applying a function in R

r,statistics

Try mapply(function(x,y) t.test(x,y)$p.value, as.data.frame(controls), as.data.frame(patients)) # V1 V2 V3 V4 V5 V6 V7 V8 #0.8481788 1.0000000 0.4605294 1.0000000 0.6436604 1.0000000 1.0000000 1.0000000 # V9 V10 V11 #1.0000000 1.0000000 1.0000000 assuming that "controls" and "patients" are matrix data controls <- structure(c(1253, 2311.3, 1314.83, 9.88, 12.74, 11.39, 20.8, 6.82, 18.12, 17.88, 17.88,...

Calculating Kendall's tau using scipy and groupby

python,pandas,statistics,scipy

One way to calculate this is to use apply on the groupby object: >>> import scipy.stats as st >>> df.groupby(['station_id']).apply(lambda x: st.kendalltau(x['year'], x['Sum'])) station_id 210018 (-0.2, 0.62420612399) 215400 (0.4, 0.327186890661) dtype: object ...

Cannot generalize my Genetic Algorithm to new Data

statistics,genetic-algorithm,prediction,generalization

The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines: A GA will learn to do exactly...

R by() round stat.desc

r,statistics,rounding

You can explicitly pass the value to the function by defining the FUN=function() value. Using the mtcars dataset library(pastecs) Note that the first line is an abbreviated version of the second by(mtcars$mpg, mtcars$am, stat.desc, norm = TRUE, basic = TRUE) by(mtcars$mpg, mtcars$am, function(X) stat.desc(X, norm = TRUE, basic = TRUE))...

Memory-efficient Benjamini-Hochberg FDR correction using numpy/h5py

python,numpy,statistics,hdf5,h5py

In case anybody else stumbles across this: The way I solved this was to first extract all p-values that had a chance of passing the FDR correction threshold (I used 1e-5). Memory-consumption was not an issue for this, since I could just iterate through the list of p-values on disk....

Regression loop in R for data frames

r,loops,statistics,data.frame,regression

here is a quick rewrite of your code, this should give you what you are looking for. Assigning a value of each column is unnecessary since myData should be a data.frame, as such you can access each column with it's column name. rm(list=ls()) myData <-read.csv(file="C:/Users/Documents/myfile.csv",header=TRUE, sep=",") for(i in names(myData)) {...

Fill in missing rows with R data.table

r,statistics,data.table

With help from @akrun and @eddi, here's the idiomatic (?) way: mycols = c("description","date","location") setkeyv(DT0,mycols) DT1 <- DT0[J(do.call(CJ,lapply(mycols,function(x)unique(get(x)))))] # alternately: DT1 <- DT0[DT0[,do.call(CJ,lapply(.SD,unique)),.SDcols=mycols]] The identifier column is missing for the new rows, but can be filled: setkey(DT1,description) DT1[unique(DT0[,c("description","identifier"),with=FALSE]),identifier:=i.identifier] ...

Multiply Probability Distribution Functions

r,statistics,probability-density

The efficient way to do this sort of operation is to use a convolution: convolve(a, rev(b), type="open") # [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05 This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the...

Arima.sim issues in R

r,math,statistics,time-series,forecasting

You seem to be confused between modelling and simulation. You are also wrong about auto.arima(). auto.arima() does allow exogenous variables via the xreg argument. Read the help file. You can include the exogenous variables for future periods using forecast.Arima(). Again, read the help file. It is not clear at all...

Automatically truncating a curve to discard outliers in matlab

matlab,plot,statistics,outliers

This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable. So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast...

Include values to the barplot and pie charts in R

r,statistics,bar-chart,pie-chart

Here is a general approach: pie has a labels parameter so you can just use that, use \n for a line break, and add text under the name barplot has a return value which are the x-coordinates of each bar, so just use text along with the data (for the...

how to: Separate a continuous variable by % proportions?

statistics,stata

First, make good use of Stata's help files: e.g., search percentiles returns a list of possible commands. Two commands that will likely be of use are summarize (with the detail option; note that you can use return list afterwards to view/store results [regardless of whether the detail option was specified])...

Draw geom_smooth only for fits that are significant

r,ggplot2,statistics

You can calculate the p-values by group and then subset in geom_smooth (per the commenters): # Determine p-values of regression p.vals = sapply(unique(d$z), function(i) { coef(summary(lm(y ~ x, data=d[z==i, ])))[2,4] }) plt <- ggplot(d) + aes(x=x, y=y, color=z) + geom_point() # Select only values of z for which regression p-value...

R dtw package: cumulative cost matrix decreases at some points along the path?

r,statistics

This is due to the use of a "multi-step" recursion like asymmetricP05. Such a pattern allows the warping path to be composed of long segments, e.g. knight's moves. To verify the monotonicity, you should only consider the starting positions of each of the "knight's moves" - not all of the...

Removing outliers in one step

r,data,statistics,analytics,outliers

I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it...

P-value, significance level and hypothesis

statistics,p-value

You are mistaking what the significance means in terms of the p-value. I will try to explain below: Let's assume a test about the means of two populations being equal. We will perform a t-test to test that by drawing one sample from each population and calculating the p-value. The...

Removing a prior sample while using Welford's method for computing single pass variance

algorithm,math,statistics,variance,standard-deviation

Given the forward formulas Mk = Mk-1 + (xk – Mk-1) / k Sk = Sk-1 + (xk – Mk-1) * (xk – Mk), it's possible to solve for Mk-1 as a function of Mk and xk and k: Mk-1 = Mk - (xk - Mk) / (k - 1)....

R - Need help simulating estimators

r,statistics,normal-distribution

Since both T1 and T2 rely on X1, X2, Y1, and Y2, you should first simulate those four random variables: X1 <- rnorm(1e4, mu1, sigma) X2 <- rnorm(1e4, mu1, sigma) Y1 <- rnorm(1e4, mu2, sigma) Y2 <- rnorm(1e4, mu2, sigma) Then you can run your code to get all simulated...

Statsmodels - Wald Test for significance of trend in coefficients in Linear Regression Model (OLS)

python,statistics,linear-regression,statsmodels

A linear hypothesis has the form R params = q where R is the matrix that defines the linear combination of parameters and q is the hypothesized value. In the simple case where we want to test whether some parameters are zero, the R matrix has a 1 in the...

How do you know if a data set is right for linear regression if it has multiple features?

machine-learning,statistics,linear-regression

So Linear Regression assumes your data is linear even in multiple dimensions. It wont be possible to visualize high dimensional data unless you use some methods to reduce the high dimensional data. PCA can do that but bringing it down to 2 dimensions won't be helpful. You should do Cross...

Splitting strings from integers in R

r,statistics

You could do: df <- data.frame(V1 = c("adad131341", "adadar45365", "cavsbsb425", "daadvsv46567567")) library(dplyr) library(stringr) df %>% mutate(V2 = str_extract(V1, "[0-9]+"), V3 = str_extract(V1, "[aA-zZ]+")) Which gives: # V1 V2 V3 #1 adad131341 131341 adad #2 adadar45365 45365 adadar #3 cavsbsb425 425 cavsbsb #4 daadvsv46567567 46567567 daadvsv ...

Pandas TimeGrouper: Drop “non full groups”

python,pandas,statistics

I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of: 1) Drop enough observations (at the start or end) such that you have a multiple of 2 years...

Is R able to compute contingency tables on big file without putting the whole file in RAM?

r,file-io,statistics,contingency

You can use the package ff for this which uses the hard disk drive instead of RAM but it is implemented in a way that it doesn't make it (significantly) slower than the normal way R uses RAM. This if from the package description: The ff package provides data structures...

OLS Breusch Pagan test in Python

python,statistics,canopy,pysal

The problem is that the regression results instance of statsmodels is not compatible with the one in pysal. You can use breushpagan from statsmodels, which takes OLS residuals and candidates for explanatory variables for the heteroscedasticity and so it does not rely on a specific model or implementation of a...

Error with applying data type in method to variables in main tester class

java,arrays,methods,statistics

There are two errors : First : When you do this - test.findMin(num) you are trying to pass parameter num. But num is not array! It is a number. You probably want to do this : test.findMin(array) Second : You can implicitly convert integer to double, because you can be...

How to explain a higher percentage of point variability using kmeans clustering? [closed]

r,statistics,cluster-analysis,k-means

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case). To understand how accurate your clustering algorithm is at...

Python p value from t-statistic giving nan

python,statistics,scipy,p-value

Re what happens here internally. Well, the Student t distribution is defined for dof > 0, at least in scipy.stats: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.t.html. Hence a nan: In [11]: stats.t.sf(-11, df=10) Out[11]: 0.99999967038443183 In [12]: stats.t.sf(-11, df=-10) Out[12]: nan ...

How to compare two products based on their ratings?

statistics

See these links: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html http://www.evanmiller.org/bayesian-average-ratings.html http://www.evanmiller.org/ranking-items-with-star-ratings.html on closer inspection of last link, he is just calculating mean - standard error of the mean and using that for ranking. ...

Java: Probability distribution support

java,math,statistics

You’d be looking at java.lang.math, and from the docs: The class Math contains methods for performing basic numeric operations such as the elementary exponential, logarithm, square root, and trigonometric functions. So in short no. For distributions other than Gaussian you’re going to have to look elsewhere. In terms of third...

Calculate variance in bash

linux,bash,awk,statistics,variance

Standard deviation formula is described in http://www.mathsisfun.com/data/standard-deviation.html So basically you need to say: for i in items sum += [(item - average)^2]/#items Doing it in your sample input: 5 av=5/1=5 var=(5-5)/1=0 5 av=10/2=5 var=(5-5)^2+(5-5)^2/2=0 5 av=15/3=5 var=3*(5-5)^2/3=0 10 av=25/4=6.25 var=3*(5-6.25)^2+(10-6.25)^2/4=4.6875 So in awk we can say: $ awk 'BEGIN {FS=OFS=","}...

How to compute Minitab-equivalent quartiles using NumPy

python,numpy,statistics,minitab

Here's an attempt to implement Minitab's algorithm. I've written these functions assuming that you've already dropped missing observations from the series a: # Drop missing obs x = df.aquatic[~ pd.isnull(df.aquatic)] def get_quartile1(a): a = a.sort(inplace=False) pos1 = (len(a) + 1) / 4.0 round_pos1 = int(np.floor((len(a) + 1) / 4.0)) first_part...

How do I calculate the standard deviation between weighted measurements?

statistics,standard-deviation,weighted-average

I just found this wikipedia page discussing data of equal significance vs weighted data. The correct way to calculate the biased weighted estimator of variance is , though this on-the-fly implementation is more efficient computationally as it does not require calculating the weighted average before looping over the sum on...

Generate random numbers with pre-defined median in Matlab ?

matlab,random,statistics,median

Matlab function rand generates (pseudo)-random numbers uniformly distributed on the interval [0,1]. The median of this distribution is 0.5. You can make the median to be m by adding m-0.5 to each number. The function function array = generateNumbers(m, n, medianValue) array = rand(m,n)-0.5 + medianValue; end returns a random...

Generalized additive models for calibration

r,statistics,probability,prediction,calibration

The warning is telling you that predict.gam doesn't recognize the value you passed to the type parameter. Since it didn't understand, it decided to use the default value of type, which is "terms". Note that predict.gam with type="terms" returns information about the model terms, not probabilties. Hence the output values...

How can I reformat a table in R?

r,statistics,reformatting

Using dcast from reshape2 : library(reshape2) dcast(dat,V1~V2,fill=0) V1 1 2 3 4 1 pat1 2 0 1 2 2 pat2 0 0 3 0 3 pat3 4 3 0 0 Where dat is : dat <- read.table(text='V1 V2 V3 pat1 1 2 pat1 3 1 pat1 4 2 pat2 3...

Which statistical measures for 4 class classification?

machine-learning,statistics,classification,multilabel-classification

There are several available metrics, described in the following paper: Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437. See Table 3 on page 4 (430) - it contains brief description and formula for 8 metrics; choose the...

What is the relationship between marks and covariates in point process

process,statistics,point,spatstat

For point processes the distinction between marks and covariates are: Marks are values attached to each point (settlement) and often the mark is not meaningful at other locations. Covariates are conceptually meaningful/available throughout the entire survey region (observation window). Mark values can in principle be anything, but basically only two...

Test for statistical difference in proportions using R

r,statistics,glm

Well, you should first look a bit into regression analysis like has been commented. You have some issues in understanding there. But, this is what you want: obsGroupA <- round(runif(40, 240, 63535)) obsGroupB <- round(runif(40, 2478, 95063)) obsGroupC <- round(runif(40, 3102, 104799)) propGroupA <- obsGroupA/(obsGroupA + obsGroupB + obsGroupC) propGroupB...

Histogram-like summary for interval data

r,statistics,histogram

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though. I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package: require(data.table) subject <- data.table(interval = paste("int", 1:4, sep=""), start = c(2,10,12,25), end = c(7,14,18,28)) query...

Which scaling technique does it use?

matlab,statistics,matlab-guide,matlab-deployment

That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all...

python: finding the value of a random variable for a cdf

python,statistics,scipy,normal-distribution,cdf

edit: you actually need import norm from scipy.stats. I found the answer. You need to use ppf in scipy.stats which stands for "percent point function". So let's say you have a normal distribution with stdDev = 1, and mean = 0 and you want to find the value at which...

How the function auto.arima() in R determines d?

python,r,statistics,statsmodels

This link says it is determined using repeated KPSS tests. I see no reason why it couldn't be implemented in Python, it would just need to be written. Otherwise, you could use rpy2 and just call auto.arima from python. from rpy2 import * import rpy2.robjects as RO RO.r('library(forecast)') # use...

The confidence interval by the intercept with linear regression in R

r,statistics

The broom package can return confidence intervals for regression model estimates. require(broom) A <- c(12,11,12,15,13,16,13,18,11,14) B <- c(50,51,62,45,63,76,53,68,51,74) model <- lm(A~B) tidy(model, conf.int = TRUE, conf.level = 0.99) term estimate std.error statistic p.value conf.low conf.high 1 (Intercept) 6.8153948 3.75608761 1.814493 0.1071515 -5.78773401 19.418524 2 B 0.1127252 0.06240674 1.806299 0.1085031 -0.09667358...

Python, beautifulsoup scraping specific or exact numbers from a stat table

python,table,statistics,beautifulsoup,screen-scraping

I think this is more along the lines of what you are looking for. You can't filter the year like you were trying to do, you have to have an if statement and filter it out yourself. from bs4 import BeautifulSoup from urllib import urlopen url = 'http://www.nfl.com/player/tombrady/2504211/careerstats' html =...

Subtract mean from a variable by group in bigmemory in R

r,functional-programming,statistics

The simplest approach would probably be to use the bigsplit function and a for loop for in-place modification. idx <- bigsplit(data, 1) for(i in seq(length(idx))){ data[idx[[i]],2] <- data[idx[[i]],2] - mean_var1[i] } It appears like you will want the former but if you wanted a subset returned of a reasonable size...

scipy.mstats.theilslopes error in confidence limit if data have missing values

python,statistics,scipy

In the mstats module of scipy.stats, "missing values" are handled using a masked array. nan does not indicate a missing value. The following shows how you can convert your array y (which uses nan for missing values) into a masked array my: In [48]: x = np.arange(12) In [49]: y...

R: How should I specify the “init” and “fixed” parameters in the arima function in R?

r,statistics

Please read the documentation carefully. help(arima) clearly tells you that init relates to the initial values of parameters: init optional numeric vector of initial parameter values. Missing values will be filled in, by zeroes except for regression coefficients. Values already specified in fixed will be ignored. Similarly, fixed also relates...

Statistics of region of numpy array

python,numpy,statistics

You can do this easily with pandas. import pandas as pd data = np.random.random(20) stds = pd.rolling_std(data, window=7, center=True, min_periods=1) # min_periods to get the edges ...

Disaggregate one row of data to multiple rows

r,excel,statistics,dataset,google-adwords

Try library(data.table) setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword] # Keyword Clicked Converted # 1: SampleName 1 1 # 2: SampleName 1 1 # 3: SampleName 1 0 # 4: SampleName 1 0 # 5: SampleName 1 0 # 6: SampleName 0 0 # 7: SampleName 0 0...

How many degrees of freedom R package: 'fGarch'

r,statistics

garchFit uses 4 degrees of freedom by default, specified in the shape parameter, so I'd say if you didn't set it to be estimated with include.shape=TRUE, the degrees of freedom are the default, fixed at 4.

How scipy.stats handles nans?

python,numpy,statistics,scipy,missing-data

You can not mathematically perform a test statistic based on nan. Unless you find proof/documentation of special treatment of nan, you can not rely on that. My experience is that in general, even numpy does not treat nan specially, for example for median. Instead the results are whatever they happen...

Count occurrence of unique variables

r,count,statistics

Tabulating and collapsing Your example vector is vec <- letters[c(1,2,2,2,3,3,4,5,6)] To get a tabulation, use tab <- table(vec) To collapse infrequent items (say, with counts below two), use res <- c(tab[tab>=2],other=sum(tab[tab<2])) # b c other # 3 2 4 Displaying in two columns resdf <- data.frame(count=res) # count # b...

Mathematica: difficulty using Multinormal Distribution and InverseCDF functions

statistics,wolfram-mathematica,normal-distribution,cdf

1) MultinormalDistribution is now built in, so don't load MultivariateStatistics it unless you are running version 7 or older. If you do you'll see MultinormalDistribution colored red indicating a conflict. 2) this works: sig = .5; u = .5; dist = MultinormalDistribution[{0, 0}, sig IdentityMatrix[2]]; delta = CDF[dist, {xx, yy}]...

How to convert a list of tables into one big table in R

r,list,table,statistics,do.call

This produces your desired output if I understand it correctly: sink("output.txt") for (i in seq_along(z)) { cat(names(z)[i], '\n') # print out the header write.table(z[[i]], row.names = FALSE, col.names = FALSE) } sink() I open a connection to a text file with sink then loop over your list of tables and...

How to specify gamma distribution using shape and rate in Python?

python,statistics,scipy

Inverse scale (1/scale) is rate parameter. So if you have shape and rate you can create gamma rv with this code >>> from scipy.stats import gamma >>> rv = gamma(shape, scale = 1.0/rate) Read more about different parametrizations of Gamma distribution on Wikipedia: http://en.wikipedia.org/wiki/Gamma_distribution...

Sequence following percentage

algorithm,statistics

This is a problem known as Longest Increasing Subsequence and there are O(n log n) algorithms for it. To find the percentage, you just have to find LIS(V)/length(V). Here's an example implementation (O(n^2)) in Python EDIT: changed the code to clearly point where an O(n) step can be turned into...

Generate random numbers with logarithmic distribution and custom slope

javascript,algorithm,math,statistics,distribution

You mention a logarithmic distribution, but it looks like your code is designed to generate a truncated geometric distribution instead, although it is flawed. There is more than one distribution called a logarithmic distribution and none of them are that common. Please clarify if you really do mean one of...

Pruning rule based classification tree (PART algorithm)

r,statistics,classification,decision-tree,rweka

In order to prune the tree with PART you need to specify it in the control argument of the function: There is a complete list of the commands you can pass into the control argument here I quote some of the options here which are relevant to pruning: Valid options...

SAS statistic parameters of specific variables from data with conditions

variables,statistics,sas,conditional-statements,proc

I wouldn't create a separate data step as suggested by Alex A. That can be a bad habit to develop as, with large datasets, it can be extremely costly in terms of CPU. Rather, I would subset the Proc Means call but slightly differently from Alex A's suggestion since you...

Get a variables value from one dataset if falling in a range defined by two variables in another dataset in R

r,date,statistics,dataset

Here a solution based on the excellent foverlaps of the data.table package. library(data.table) ## coerce characters to dates ( numeric) setDT(x)[,c("date1","date2"):=list(as.Date(date1,"%d/%m/%Y"), as.Date(date2,"%d/%m/%Y"))] ## and a dummy date since foverlaps looks for a start,end columns setDT(y)[,c("date1"):=as.Date(date,"%d/%m/%Y")][,date:=date1] ## y must be keyed setkey(y,id,date,date1) foverlaps(x,y,by.x=c("id","date1","date2"))[, list(id,i.date1,date2,date,price)] id i.date1 date2 date price 1: A...

Can't use scipy stats function on nested list

python,numpy,statistics,scipy,nested-lists

You need to apply it on a numpy.array reflecting the nested lists. from scipy import stats import numpy as np dataset = np.array([[1.5,3.3,2.6,5.8],[1.5,3.2,5.6,1.8],[2.5,3.1,3.6,5.2]]) stats.mstats.zscore(dataset) works fine....

Why equation (5) is equal to equation (6)? [closed]

math,statistics

This is not an example of a Lagrange multiplier, and the two equations are not equivalent. However, the paper doesn't claim this: the text states that formula (5) is "modified" to get formula (6). Using a Lagrange multiplier would lead to a coupled system of two equations. Note how formula...

How to work with linear regression and confidence intervals in R?

r,statistics

Try this: > m <- lm(B~A) > predict(m, newdata=data.frame(A=14), interval='confidence', level=0.9) fit lwr upr 1 60.58495 54.72854 66.44135 ...

R plot 'Heat map' of set of draws

r,plot,statistics,confidence-interval

So, the hard part of this is transforming your data into the right shape, which is why it's nice to share something that really looks like your data, not just a single column. Let's say your data is this a matrix with 10,000 rows and 10 columns. I'll just use...

Quantifying importance of variables in Canonical Correspondence Analysis using R? (x-post from researchgate)

r,statistics,correlation,vegan

You want the anova() method that vegan provides for cca(), the function that does CCA in the package, if you want to test effects in a current model. See ?anova.cca for details and perhaps the by = "margin" option to test marginal terms. To do stepwise selection you have two...

Matlab: trying to estimate multifractal spectrum from time series by histogram box-counting

matlab,statistics,time-series,histogram,fractals

Your code seems to be generally bug-free but I made some changes since you perform needless repetitions over loops (I moved the outer loop inside and "vectorized" it since all moment calculations can be performed simultaneously for a given histogram. Also, it is building the histogram that takes longest). intel...

ehCache Statistics with spring boot

statistics,spring-boot,ehcache

If you can afford to use a snapshot release of spring boot this feature is being added to 1.3.0. Right now you won't get that in 1.2.X

Normal probability density function - GSL equivalent in Haskell

haskell,statistics,gsl

Here's an example which uses random-fu: import Data.Random -- for randomness import Text.Printf -- for printf import Data.Foldable -- for the for_ loop -- pdf and cdf are basically “Distribution -> Double -> Double” main = do -- defining normal distribution with mean = 10 and variation = 2 let...

Caret package for R. Which samples are held out?

r,statistics,r-caret

Let's see this through an example: First of all you need to specify another argument on the trainControl function, returnResamp='all' so that it returns info on all resamples. Example data: #classification example y <- rep(c(0,1), c(25,25)) x1 <- runif(50) x2 <- runif(50) df <- data.frame(y,x1,x2) Solution: Your code should be...

Integrating Power pdf to get energy pdf?

statistics,probability

Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps...