python,machine-learning,scikit-learn,data-analysis

1) Buy a machine with such a stupidly huge amount of RAM that you can cache the whole gram matrix. Cache size has the biggest performance impact on LibSVM, which is what scikit learn uses. 2) Use a different algorithm. Scikit learn is already calling out to LibSVM, which is...

python,excel,algorithm,pandas,data-analysis

Inadequate use of the function max. np.maximum (perhaps np.ma.max as well as per numpy documentation) works. Apparently regular max can not deal with arrays (easily). Replacing baseline=max(frame['level'],frame['level'].shift(1))#doesnt work with baseline=np.maximum(frame['level'],frame['level'].shift(1)) does the trick. I removed the other part to make it easier to read: In [23]: #q 1 analysis def...

python,numpy,pandas,scipy,data-analysis

Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful...

You an specify the level in the mean method: s.mean(level=0) # or: s.mean(level='deviceid') This is equivalent to grouping by the first level of the index and taking the mean of each group: s.groupby(level=0).mean()...

Every time you call np.vstack NumPy has to allocate space for a brand new array. So if we say 1 row requires 1 unit of memory np.vstack([container, container2]) requires an additional 900+5000 units of memory. Moreover, before the assignment occurs, Python needs to hold space for the old mergedContainer (if...

python,database-design,data-analysis

For simple database stuff, SQLAlchemy is your friend: http://www.sqlalchemy.org/ The documentation includes a fairly comprehensive tutorial that goes through the high-level concepts involved in working with a database, as well as how to design and work with the tables directly in Python. Here's a sample, showing how you can define...

python,scipy,curve-fitting,data-analysis

You didn't take the order of the parameters to curve_fit into account: Definition: curve_fit(f, xdata, ydata, p0=None, sigma=None, **kw) Docstring: Use non-linear least squares to fit a function, f, to data. Assumes ydata = f(xdata, *params) + eps Parameters f : callable The model function, f(x, ...). It must take...

python,json,pandas,data-analysis,kwargs

I don't really see anything wrong with your approach. If you wanted to use df.query you could do something like this, although I'd argue it's less readable. expr = " and ".join(k + "=='" + v + "'" for (k,v) in kwargs.items()) readJson = readJson.query(expr) ...

mysql,python-2.7,data-analysis

I'd use python: 1) setup python to work with mysql (loads of tutorials online) 2) define: from collections import defaultdict tokenDict = defaultdict(lambda: 0) the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0) 3) read each...

One way is to groupby and apply function to take list, and then convert to dict. In [92]: df.groupby(['A']).apply(lambda x: x['B'].tolist()).to_dict() Out[92]: {10: [100, 101], 20: [102]} ...

business-intelligence,data-analysis,qlikview,qliksense

You can find good videos on youtube (like this one: https://www.youtube.com/watch?v=Ef_BigFXCis). When you install qlikview, you have there a qlikview tutorial directory. Follow the pdf file and do everything on that pdf. it is very extensive, but after that you can start developing intermediate solutions. The PDF actually covers...

c#,sql-server,algorithm,data-analysis

Think this should satisfy your requirements: create function dbo.analyze(@qualification varchar(50), @date date) returns int as begin declare @result int; with cte as ( select t.*, rank() over (partition by t.[User] order by t.DateOfQualificationAssignment desc) r from theTable t -- no clue what the real table is named where t.DateOfQualificationAssignment <...

You could use a data.table like this: library(data.table) ## dt <- data.table(df) ## R> dt[condition==1, .(Mean=mean(value),Sd=sd(value)), by=group] group Mean Sd 1: 1 17.33333 4.932883 2: 2 35.33333 1.527525 ...

This answer by @GarethD, helped solved the problem: Thanks! The key is to unpivot your data so that you have one row per patient per symptom, then join this data to itself to get pairs of symptoms, then pivot the joined data back up to get your counts. Since I...

postgresql,pivot,crosstab,data-analysis

If the number of possible test_id is fixed and known the easiest way to do this is to use conditional expressions like this: select id, max(case when test_id = 'a' then 1 else 0 end) as a, max(case when test_id = 'b' then 1 else 0 end) as b, max(case...

One solution I can think of involves a self-join shifting one year. I will be using data.table for the simplicity of both joining and the grouping required. I'll also be changing some names and the year format for convenience. I have saved your data in a data.frame called dd: names(dd)...

python,machine-learning,scikit-learn,data-mining,data-analysis

You can fit the StandardScaler instance on a large enough subset that fit in RAM at once (say a couple of GB of data) and then re-use the same fixed instance of the scaler to transform the rest of the data one batch at a time. You should be able...

r,data-analysis,data-manipulation

This is a job for sweep (type ?sweep in R for the documentation) Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE))) Hi_SD <- apply(Hi,2,sd) Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD) ...

I think you're looking for rank where you mention sorting. Given your example, add: frame['sum_order'] = frame['sum'].rank() frame['sum_sq_order'] = frame['sum_sq'].rank() frame['max_order'] = frame['max'].rank() frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1) To get: a b c sum sum_sq max sum_order sum_sq_order max_order mean_order 0 1 4 2 7 21 4 1 1 1...

python,pandas,indexing,data-analysis

So let's dissect this: df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date') OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore. start = datetime(2010, 2, 5) end = datetime(2012, 10, 26) df_train_fly = pd.date_range(start, end, freq="W-FRI") df_train_fly =...

python,pandas,dataframes,data-analysis

For simple arithmetic like unit conversion, This will do it: dataframe['Compression Velocity'] *= 0.0254 ...

Here's one way to do it using the package venneuler: df <- read.table(header = TRUE, text = "user has_1 has_2 has_3 3431 true false true 3432 false true false 3433 true false false 3434 true false false 3435 true false false 3436 true false false", colClasses = c("numeric", rep("logical", 3)))...

python,python-2.7,scipy,data-analysis

It's a problem of data analysis. FFT works with complex number so the spectrum is symmetric on real data input : restrict on xlim(0,max(freqs)) . The sampling period is not good : increasing period while keeping the same total number of input points will lead to a best quality spectrum...

You can try df$Folium.id <- do.call(paste0, df) df # Folium.ten Folium.hom Folium.id #1 091001 00 09100100 #2 091001 01 09100101 #3 091002 00 09100200 #4 091003 00 09100300 #5 091003 01 09100301 #6 091003 02 09100302 ...

python,pandas,data-analysis,matplotlib-basemap,proj

This is resolved by changing m(cat_data.LONGITUDE, cat_data.LATITUDE) to m(cat_data.LONGITUDE.values, cat_data.LATITUDE.values), thanks to Alex Messina's finding. With a little further study of mine, pandas changed that Series data of DataFrame (derived from NDFrame) should be passed with .values to a Cython function like basemap/proj since v0.13.0 released on 31 Dec 2013...

You can use the pandas groupby-apply combo. Group the dataframe by "Item" and apply a function that calculates the process time. Something like: import pandas as pd def calc_process_time(row): ts = row["ProcessA_Timestamp].values if len(ts) == 1: return pd.NaT else: return ts[-1] - ts[0] #last time - first time df.groupby("Item").apply(calc_process_time) ...

Prior questions show you have NumPy installed. So using NumPy, you could set the zeros to NaN and then call np.nanmean to take the mean, ignoring NaNs: import numpy as np data = np.genfromtxt('data') data[data == 0] = np.nan means = np.nanmean(data[:, 1:], axis=1) yields array([ 22.1 , 22.08 ,...

python,numpy,pandas,time-series,data-analysis

In [34]: pd.Series(*zip(*((b,a) for a,b in data))) Out[34]: 2008-07-15 15:00:00 0.134 2008-07-15 16:00:00 0.000 2008-07-15 17:00:00 0.000 2008-07-15 18:00:00 0.000 2008-07-15 19:00:00 0.000 2008-07-15 20:00:00 0.000 2008-07-15 21:00:00 0.000 2008-07-15 22:00:00 0.000 2008-07-15 23:00:00 0.000 2008-07-16 00:00:00 0.000 dtype: float64 Or, eschewing the insane desire to make one-liners: dates, vals...

Since total_sent column is to be summed, it shouldn't be within the groupby keys. You can try the following: data1.groupby(['deviceid', 'time']).agg({'total_sent': sum}) which will sum the total_sent column for each group, indexed by deviceid and time....

matlab,data-analysis,neuroscience

I'm not aware of a simple existing Matlab function/toolbox that directly does what you want, but all the elements are there and you should be able to program something like that yourself. The information which voxel in an image belongs to which anatomical brain region is provided by so-called atlases,...

First of all use lowercase for variable names. Now, to take the 5th element of each list, from a list of lists, you can use map which applies a function to each item and returns the list of results: list_of_fifths = map(lambda x: x[4], Pressure) And example of running this:...

That is the same result as far as I can see, Pandas has been changing the default display of dataframes, the example in the book is the summary display and the display you got is the newer format that displays the begin/ end.. read up on display options... In the...

I can list a number of projects that can help you in this process: d3js, which you already mentioned. It is not only limited to visualizing but also looking at the API section of the library, offers many manipulation methods. DataSet offers a good way to manipulate data. TableTop is...

javascript,json,for-loop,data-analysis

I would recommend using a .json file since you are working with JSON{Objects}. I've made a demo of the file here and I've made a demo of the data analysis here. Note: Demo file is hosted in a personal server, so it may not work later on. Now the file...

you can use concat to concatenate a list of DataFrames df_list = [pd.read_csv(file_name, delimiter=',', names=header_row) for file_name in file_list] #opens your csv df = pd.concat(df_list) Then you calculate the mean via df.sales.mean() A small example a = pd.DataFrame({'sales' : [2,4,6] , 'other' : [1,2,1]}) b = pd.DataFrame({'sales' : [7,4,7] ,...

dt[dt['year']==2001].dropna(axis=1) ...

matlab,data-analysis,numerical-analysis,probability-density

I hope I get you right: assume you have a vector a, and you would like to get a normal distribution that represents it. [correct me if I am wrong] >>a = randn(2); //your vector >>x=[-10:0.1:10]; >> y = gaussmf(x,[std(a) mean(a)]); //std(a)=sigma,mean(a)=mu >> plot(x,y); ...

r,linear-regression,data-analysis

If the columns are exactly the same, you should be able to do something like this: data_2013_and_2014 <- rbind(data2013, data2014) new_model <- lm(x ~ y, data = data_2013_and_2014) Lookup ?rbind for more details....

full-text-search,text-mining,data-analysis,word-count,text-analysis

So it doesn't look like this is something anyone else really needs, but just in case, here's how I solved my problem. I used 2 different tools: Hermetic Word Frequency Advanced (http://www.hermetic.ch/wfca/wfca.htm) RapidMiner Studio (https://rapidminer.com/) with the Text Processing extension added through RapidMiner Marketplace The RapidMiner text processing tools worked...

python,pandas,statistics,data-analysis,statsmodels

Compliments to @behzad.nouri who penned this answer originally. I chose to post the entire answer rather than a link to it because links can sometimes change or be deleted. Perhaps you could detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the...

You want a dictionary: stages = {'18-24': 'young adult', '25-34': 'adult', ...} for age in age_group: lifestage = stages[age] This is the canonical replacement for a lot of elifs in Python. ...

python,matplotlib,pandas,data-analysis

This works for me: data_df.plot(x='Compression Velocity', y='Compression Force', xticks=d['Compression Velocity']) ...