Menu
  • HOME
  • TAGS

Replace improper commas in CSV file

Tag: regex,r,csv

This may have been asked before, but I couldn't find it. I have a list of CSV files (439 or so) where, in a few of the files, someone also used commas in editorial comments. The result is that I can't put the files into a data frame, since the files now do not have the same number of elements after splitting them. Anyways, the problem I'm facing looks like this:

vec1 <- paste("484,1213,0,62.0006,1,go -- late F1 max, but glide?")
vec2 <- paste("467,1387,0,62.0026,1,goes2")

ls <- list(vec1, vec2)

What I want to do is to have a data frame with six columns. If there wasn't a comma in the editorial comments for vec1, I could use (and have been using, until I found this problematic example) the following:

df <- ldply(ls, function(x)unlist(strsplit(x[1], split = ",")))

However, I'm getting the obvious error message that the results do not have the same number of lengths. Is there any way of getting rid of that comma, or turning it into a semi-colon, or ensuring that, if there are 7 elements in a vector, that 6 and 7 are combined?

If it helps, this is how I'm reading the files into R (I'm using scan because there is other information in the files that I want. There's some odd encoding issues going on here as well, but his seems to work).

data <- scan(file, fileEncoding="latin1", blank.lines.skip = FALSE, what = "list", sep = "\n", quiet = TRUE)   

Best How To :

If you need the comments, you still can replace the 6th comma with a semicolon and use your previous solution:

gsub("((?:[^,]*,){5}[^,]*),", "\\1;", vec1, perl=TRUE)

Regex explanation:

  • ((?:[^,]*,){5}[^,]*) - a capturing group that we will reference to as Group 1 with \\1 in the replacement pattern, matching
    • (?:[^,]*,){5} - 5 sequences of non-comma characters followed by a comma
    • [^,]* - 0 or more non-commas
  • , - the comma we'll turn into a ; in the replacement

Or (as @CathG pointed out, a \\K operator can also be used with Perl-like expressions)

sub("^([^,]+,){5}[^,]+\\K,", ";", vec1, perl=T)

From PCRE documentation:

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence.

However, it will not "normalize" any other commas that might follow.

copy a list of data.tables

r,data.table

copy() is for copying data.table's. You are using it to copy a list. Try.. zz <- lapply(z,copy) zz[[1]][ , newColumn := 1 ] Using your original code, you will see that applying copy() to the list does not make a copy of the original data.table. They are still referenced by...

Aggregating data in R

r

Using data.table library(data.table) setDT(df1)[, list(pages=paste(page, collapse="_")), list(user_id, date=as.Date(date, '%m/%d/%Y'))] Or using dplyr library(dplyr) df1 %>% group_by(user_id, date=as.Date(date, '%m/%d/%Y')) %>% summarise(pages=paste(page, collapse='_')) ...

Sleep Shiny WebApp to let it refresh… Any alternative?

r,shiny,sleep

some reproducible code would allow me to give you some example code, but in the absence of that... wrap what you currently have in another if(), checking for length = 0 (or just && it, with the NULL check first), and display your favorite placeholder message....

How to build a 'for' loop with input$i in R Shiny

r,loops,for-loop,shiny

Use [[ or [ if you want to subset by string names, not $. From Hadley's Advanced R, "x$y is equivalent to x[["y", exact = FALSE]]." ## Create input input <- `names<-`(lapply(landelist, function(x) sample(0:1, 1)), landelist) filterland <- c() for (landeselect in landelist) if (input[[landeselect]] == TRUE) # use `[[`...

Return Column Names when True in R

r

You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end: `colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))), paste("Impair", 1:4)) # Impair1 Impair2 Impair3 Impair4 # 1 "A" NA NA NA...

Find multiple consecutive empty lines

r

Here's a solution for extracting the article lines only. Turned out much more complex and cryptic than I'd been hoping, but I'm pretty sure it works. Also, thanks to akrun for the test data. v1 <- c('ard','b','','','','rr','','fr','','','','','gh','d'); ind <-...

ggplot2 & facet_wrap - eliminate vertical distance between facets

r,ggplot2

Change the panel.margin argument to panel.margin = unit(c(-0.5,0-0.5,0), "lines"). For some reason the top and bottom margins need to be negative to line up perfectly. Here is the result: ...

Get all prices with $ from string into an array in Javascript

javascript,regex,currency

It’s quite trivial: RegEx string.match(/\$((?:\d|\,)*\.?\d+)/g) || [] That || [] is for no matches: it gives an empty array rather than null. Matches $99 $.99 $9.99 $9,999 $9,999.99 Explanation / # Start RegEx \$ # $ (dollar sign) ( # Capturing group (this is what you’re looking for) (?: #...

Keep the second occurrence in a column in R

r,conditional,subset,find-occurrences

Here's another possible data.table solution library(data.table) setDT(df1)[, list(Value = c("uncensored", "censored"), Time = c(Time[match("uncensored", Value)], Time[(.N - match("uncensored", rev(Value))) + 2L])), by = ID] # ID Value Time # 1: 1 uncensored 3 # 2: 1 censored 5 # 3: 2 uncensored 2 # 4: 2 censored 5 Or similarly,...

optimization algorithm for circular data

r,optimization,circular,maximization

I would compute all the pairs of rows in df: (pairs <- cbind(1:nrow(df), c(2:nrow(df), 1))) # [,1] [,2] # [1,] 1 2 # [2,] 2 3 # [3,] 3 4 # [4,] 4 5 # [5,] 5 6 # [6,] 6 1 You can find the best pairing with which.max:...

ggplot equivalent for matplot

r,ggplot2

You can create a similar plot in ggplot, but you will need to do some reshaping of the data first. library(reshape2) #ggplot needs a dataframe data <- as.data.frame(data) #id variable for position in matrix data$id <- 1:nrow(data) #reshape to long format plot_data <- melt(data,id.var="id") #plot ggplot(plot_data, aes(x=id,y=value,group=variable,colour=variable)) + geom_point()+ geom_line(aes(lty=variable))...

how to get values from selectInput with shiny

r,shiny

You can simply use input$selectRunid like this: content(GET( "http://stats", path="gentrap/alignments", query=list(runIds=input$selectRunid, userId="dev") add_headers("X-SENTINEL-KEY"="dev"), as = "parsed")) It is probably wise to add some kind of action button and trigger download only on click....

Finding embeded xpaths in a String

java,regex

Use {} instead of () because {} are not used in XPath expressions and therefore you will not have confusions.

How to quickly read a large txt data file (5GB) into R(RStudio) (Centrino 2 P8600, 4Gb RAM)

r,large-data

If you only have 4 GBs of RAM you cannot put 5 GBs of data 'into R'. You can alternatively look at the 'Large memory and out-of-memory data' section of the High Perfomance Computing task view in R. Packages designed for out-of-memory processes such as ff may help you. Otherwise...

REGEX python find previous string

python,regex,string

Updated: This will check for the existence of a sentence followed by special characters. It returns false if there are no special characters, and your original sentence is in capture group 1. Updated Regex101 Example r"(.*[\w])([^\w]+)" Alternatively (without a second capture group): Regex101 Example - no second capture group r"(.*[\w])(?:[^\w]+)"...

Regex pass dynamic values with boundry

c#,regex,string,boundary

Your first regular expression has a black slash followed by the letter b because of that @. The second one has the character that represents backspace. Just put an @ in front string bound = @"\b"; ...

Regex that allow void fractional part of number

c#,regex

Just get the dot outside of the captruing group and then make it as optional. @"[+-]?\d+\.?\d*" Use anchors if necessary. @"^[+-]?\d+\.?\d*$" ...

Subtract time in r, forcing unit of results to minutes [duplicate]

r,posix,posixct

You can try with difftime df1$time.diff <- with(df1, difftime(time.stamp2, time.stamp1, unit='min')) df1 # time.stamp1 time.stamp2 time.diff #1 2015-01-05 15:00:00 2015-01-05 16:00:00 60 mins #2 2015-01-05 16:00:00 2015-01-05 17:00:00 60 mins #3 2015-01-05 18:00:00 2015-01-05 20:00:00 120 mins #4 2015-01-05 19:00:00 2015-01-05 20:00:00 60 mins #5 2015-01-05 20:00:00 2015-01-05 22:00:00 120...

Regular Expression for whole world

regex,c#-4.0,vb6

You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True) demo ...

MySQL substring match using regular expression; substring contain 'man' not 'woman'

mysql,regex

A variant of n-dru pattern since you don't need to describe all the string: SELECT '#hellowomanclothing' REGEXP '(^#.|[^o]|[^w]o)man'; Note: if a tag contains 'man' and 'woman' this pattern will return 1. If you don't want that Gordon Linoff solution is what you are looking for....

how to call Java method which returns any List from R Language? [on hold]

java,r,rjava

You can do it with rJava package. install.packages('rJava') library(rJava) .jinit() jObj=.jnew("JClass") result=.jcall(jObj,"[D","method1") Here, JClass is a Java class that should be in your ClassPath environment variable, method1 is a static method of JClass that returns double[], [D is a JNI notation for a double array. See that blog entry for...

Count number of rows meeting criteria in another table - R PRogramming

r

Using dplyr for your first problem: left_join(contacts, listings, by = c("id" = "id")) %>% filter(abs(listing_date - contact_date) < 30) %>% group_by(id) %>% summarise(cnt = n()) %>% right_join(listings) And the output is: id cnt city listing_date 1 6174 2 A 2015-03-01 2 2175 3 B 2015-03-14 3 9176 1 B 2015-03-30...

Get number from string

regex

Use \d+ to match one or more digits. \b(?:http:\/\/)?(?:www\.)?example\.com\/g\/(\d+)\/\w put http:// and www. inside a capturing or non-caturing group and then make it as optional by adding ? quantifier next to that group. For both http and https, it would be (?:https?:\/\/)? DEMO...

How many characters are visible like a space, but are not space characters?

php,regex

You can make use of a Unicode category \p{Zs}: Zs    Space separator $string = preg_replace('~\p{Zs}~u', ' ', $string); The \p{Zs} Unicode category class will match these space-like symbols: Character Name U+0020 SPACE U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE...

Swing regular expression for phone number validation

java,regex

To only allow digits, comma and spaces, you need to remove (, ) and -. Here is a way to do it with Matcher.find(): Pattern pattern = Pattern.compile("^[0-9, ]+$"); ... if (!m.find()) { evt.consume(); } And to allow an empty string, replace + with *: Pattern pattern = Pattern.compile("^[0-9, ]*$");...

Regex with whitespaces and preceding zeros

regex,sas

You can use this simplified regex: /^[\s0]*11\s*$/ ...

How to Match a string with the format: “20959WC-01” in php?

php,regex

$pattern = '! ^ # start of string \d{5} # five digits [[:alpha:]]{2} # followed by two letters - # followed by a dash \d{2} # followed by two digits $ # end of string !x'; $matches = preg_match($pattern, $input); ...

How to write RegEx for inserting line break for line length more than 30 characters?

regex

Find what: ^(.{30}) Replace with: \1\n ...

How to create the javascript regular expression for number with some special symbols

javascript,regex

This matches all given examples as well: ^\$?\d+(?:[.,:]\d+)?%?$ See it in action: RegEx101 Please comment, if adjustment / further detail is required....

How to split a text into two meaningful words in R

r,string-split,stemming,text-analysis

Given a list of English words you can do this pretty simply by looking up every possible split of the word in the list. I'll use the first Google hit I found for my word list, which contains about 70k lower-case words: wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1 check.word <- function(x, wl) {...

How to plot data points at particular location in a map in R

r,google-maps,ggmap

This should get you headed in the right direction, but be sure to check out the examples pointed out by @Jaap in the comments. library(ggmap) map <- get_map(location = "Mumbai", zoom = 12) df <- data.frame(location = c("Airoli", "Andheri East", "Andheri West", "Arya Nagar", "Asalfa", "Bandra East", "Bandra West"), values...

Please can someone help me understand the exec method for regular expressions?

javascript,regex

I don't understand why it would give me two hellos back? Because the first entry in the array is the overall match for the expression, which is then followed by the content of any capture groups the expression defines. Since the expression defines one capture group, you get back...

PHP Regular Expressions Counting starting consonants in a string

php,regex

This is one way to do it, using preg_match: $string ="SomeStringExample"; preg_match('/^[b-df-hj-np-tv-z]*/i', $string, $matches); $count = strlen($matches[0]); The regular expression matches zero or more (*) case-insensitive (/i) consonants [b-df-hj-np-tv-z] at the beginning (^) of the string and stores the matched content in the $matches array. Then it's just a matter...

R — frequencies within a variable for repeating values

r,count,duplicates

You can try library(data.table)#v1.9.4+ setDT(yourdf)[, .N, by = A] ...

regex - Match filename with or without extension

regex,logstash-grok

This is about as simple as I can get it: \b\w+\.?\w* See demo...

Skip some lines with fread

r,fread

In linux, you could use awk with fread or it can be piped with read.table. Here, I changed the delimiter to , using awk pth <- '/home/akrun/file.txt' #change it to your path v1 <- sprintf("awk '/^(ID_REF|LMN)/{ matched = 1} matched {$1=$1; print}' OFS=\",\" %s", pth) and read with fread library(data.table)...

Appending a data frame with for if and else statements or how do put print in dataframe

r,loops,data.frame,append

It's generally not a good idea to try to add rows one-at-a-time to a data.frame. it's better to generate all the column data at once and then throw it into a data.frame. For your specific example, the ifelse() function can help list<-c(10,20,5) data.frame(x=list, y=ifelse(list<8, "Greater","Less")) ...

Select / subset spatial data in R

r,dictionary,spatial

I'm going with the assumption you meant "to the right" since you said "Another solution might be to drawn a polygon around the Baltic Sea and only to select the points within this polygon" # your sample data pts <- read.table(text="lat long 59.979687 29.706236 60.136177 28.148186 59.331383 22.376234 57.699154 11.667305...

Highlighting specific ranges on a Graph in R

r,graph,highlight

Or you could place a rectangle on the region of interest: rect(xleft=1994,xright = 1998,ybottom=range(CVD$cvd)[1],ytop=range(CVD$cvd)[2], density=10, col = "blue") ...

Reg ex matching a word

regex

You could use a negative lookahead which will exclude those having _FX following the initial alpha string ^ABD_DEF_GHIJ(?!_FX)(?:_\d{8})?$ see example here...

Identify that a string could be a datetime object

python,regex,algorithm,python-2.7,datetime

What about fuzzyparsers: Sample inputs: jan 12, 2003 jan 5 2004-3-5 +34 -- 34 days in the future (relative to todays date) -4 -- 4 days in the past (relative to todays date) Example usage: >>> from fuzzyparsers import parse_date >>> parse_date('jun 17 2010') # my youngest son's birthday datetime.date(2010,...

Linear multivariate regression in R

r

multivariate multiple regression can be done by lm(). This is very well documented, but here follows a little example: rawMat <- matrix(rnorm(200), ncol=2) noise <- matrix(rnorm(200, 0, 0.2), ncol=2) B <- matrix( 1:4, ncol=2) P <- t( B %*% t(rawMat)) + noise fit <- lm(P ~ rawMat) summary( fit )...

Histogram-like summary for interval data

r,statistics,histogram

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though. I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package: require(data.table) subject <- data.table(interval = paste("int", 1:4, sep=""), start = c(2,10,12,25), end = c(7,14,18,28)) query...

Subsetting rows by passing an argument to a function

r,subset

The problem is that you pass the condition as a string and not as a real condition, so R can't evaluate it when you want it to. if you still want to pass it as string you need to parse and eval it in the right place for example: cond...

Regex to remove `.` from a sub-string enclosed in square brackets

c#,.net,regex,string,replace

To remove all the dots present inside the square brackets. Regex.Replace(str, @"\.(?=[^\[\]]*\])", ""); DEMO To remove dot or ?. Regex.Replace(str, @"[.?](?=[^\[\]]*\])", ""); ...

Serial modification of objects in R

r,oop

I would create a list of all your matrices using mget and ls (and some regex expression according to the names of your matrices) and then modify them all at once using lapply and colnames<- and rownames<- replacement functions. Something among these lines l <- mget(ls(patter = "m\\d+.m")) lapply(l, function(x)...

How (in a vectorized manner) to retrieve single value quantities from dataframe cells containing numeric arrays?

r,dataframes,vectorization

It looks like you're trying to grab summary functions from each entry in a list, ignoring the elements set to -999. You can do this with something like: get_scalar <- function(name, FUN=max) { sapply(mydata[,name], function(x) if(all(x == -999)) NA else FUN(as.numeric(x[x != -999]))) } Note that I've changed your function...

Match a pattern preceded by a specific pattern without using a lookbehind

regex,eclipse,lookahead

A work-around for the lack of variable-length lookbehind is available in situations when your strings have a relatively small fixed upper limit on their length. For example, if you know that strings are at most 100 characters long, you could use {0,100} in place of * or {1,100} in place...

Remove quotes to use result as dataset name

r,string

You can get the values with get or mget (for multiple objects) lst <- mget(myvector) lapply(seq_along(lst), function(i) write.csv(lst[[i]], file=paste(myvector[i], '.csv', sep='')) ...

match line break except line begin with spcific word or blank line

regex,notepad++

Try this regex: (?<=[a-zA-Z])(\n) I used parentheses to capture the newline character. https://regex101.com/r/zS9pB4/3...