Menu
  • HOME
  • TAGS

how to extract set of lines using separator from file in UNIX

unix,awk,sed,pattern-matching,text-extraction

This can make it: awk -v RS="" '{print > "query"(++i)".sql"}' file With -v RS="" we define each record as a paragraph. Then it is a matter or printing them to queryNUMBER.sql. To do that, we use the ++i that increments every single time. See created files: $ cat query1.sql select...

How to extract data from a file in C

c,data,extract,extraction,text-extraction

Try this #include <stdlib.h> #include <stdio.h> #include <string.h> #define MAX_ROW 256 int main() { FILE *file; float data[MAX_ROW][6]; int index; float *row; file = fopen("file.dat", "r"); if (file == NULL) { perror("fopen()"); return 1; } index = 0; row = data[0]; while (fscanf(file, "%f%f%f%f%f%f", &row[0], &row[1], &row[2], &row[3], &row[4], &row[5])...

Extract numbers and decimal from string in EXCEL

excel,feature-extraction,text-extraction

With data in A1, use: =MID(A1,FIND("(",A1)+1,FIND(")",A1)-FIND("(",A1)-1) ...

how to retrieve a line of text from ACE editor?

javascript,ace-editor,text-extraction

You need to call line0 = editor.session.getLine(0); using getElementsByClassName("ace_line") or similar won't work since ace creates DOM elements only for visible lines. text content of the DOM element is different from the text in document since tabs and some other characters are replaced by spaces. ...

Logic behind goose extractor

python-2.7,text-extraction,boilerplate,goose

There is the pretty straightforward description how goose works https://github.com/jiminoc/goose/wiki

Stanford NER 3.4.1 questions

java,stanford-nlp,text-extraction

(1) Yes, absolutely. The tokens (class: CoreLabel) returned each store begin and end character offsets for each token. The easiest way to get at the offsets for whole entities is with the classifyToCharacterOffsets() method. See the example below. (2) Yes, but there is some subtlety in interpreting these. That is,...

ColdFusion extract values from text file

regex,coldfusion,extract,text-parsing,text-extraction

You don't want REMatch, you want REFind (docs): REFind(reg_expression, string [, start, returnsubexpressions ] ) returnsubexpressions is what you need, so... <cfset str = "request.config.MY_PARAM_NAME = 'The parameter VALUE!!';"> <cfset match = REFind("^request\.config\.(\S+) = (.*);", str, 1, "Yes")> <cfdump var="#match#"> match will be a Struct with two keys (POS and...

Go to line number and extract text in following matching brackets

python,regex,bash,perl,text-extraction

Use the regex extension module to check for the balancing of brackets. import regex m = input("Enter the line number:\n") with open('file') as f: fil = f.readlines()[int(m)-1:] print(regex.search(r'{(?:(?0)|[^{}])*}', ''.join(fil)).group()) Example of how the above code works. $ cat file foo while(j<N_sim) { Vect_F[j]=0; for (k=0; ((k<N_col) & ((j-k)>=0)); k++) Vect_F[j]+=F[i][k]*Vect_Up[j-k];...

How to export tags from Stackoverflow to a excel file?

tags,extract,stack-overflow,export-to-csv,text-extraction

Run appropriate query on data.stackexchange.com and click on "Download CSV" link. Also you can download a complete DB dump (the link is on the DB.stackexchange help page)...

Repeat text extraction with Python

python,xml,bash,loops,text-extraction

You could use Beautiful Soup's css selectors. >>> from bs4 import BeautifulSoup >>> s = "foo <font color='#FF0000'> foobar </font> bar" >>> soup = BeautifulSoup(s, 'lxml') >>> for i in soup.select('font[color="#FF0000"]'): print(i.text) foobar ...

How to extract urls from an html file stored on my computer using java?

java,io,text-extraction

You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway. static Collection<String> getLinks(Path file) throws IOException, MimeTypeParseException, BadLocationException { HTMLEditorKit htmlKit =...

Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R) [closed]

r,data-mining,text-mining,pca,text-extraction

I think that task is not for PCA. I would first try to introduce some kind of distance measure between 2 addresses. You can either use entire address as a single feature - then there're plenty of general-purpose string similarity measures, for example Levenshtein distance. There's a method in utils...

Extract a substring by using regex doesn't work

java,regex,string,text-extraction

You need to escape brackets: Pattern p = Pattern.compile("\\[(.*?)\\]"); ...

Extract Numbers from a string in R

regex,r,text-extraction,stringr

The code below covers the cases in your question. Hopefully, you can generalize it if you find other character combinations in the data. # Phone numbers (I've added an additional number with the "/" character) strings <- c("87225324","65-62983211","65-6298-3211","8722 5324", "(65) 6296-2995","(65) 6660 8060","(65) 64368308","+65 9022 7744", "+65 6296-2995","+65-6427 8436","+65 6357...

Extract specific numbers from string in R

r,string,text-extraction

Here's one way using stringr, where d is your dataframe: library(stringr) m <- str_extract(d$V1, '\\d{11}') na.omit(data.frame(V1=m, V2=d$V2)) # V1 V2 # 5 48808032191 5 # 6 85236250110 6 # 7 92564593100 7 # 8 03643936451 331 # 9 22723200159 3894 # 10 98476963300 3895 # 11 15239136149 3896 # 12...

bag-of-words approach / tools / library for C++?

c++,machine-learning,text-processing,text-extraction,lda

Umm... surely it should be easy enough to code? The stupidest, yet guaranteed to work, approach will be to iterate over all the documents twice. During the first iteration, create a hashmap of the words and a unique index (a structure like HashMap), and during the second iteration, you do...

Object extraction for a low contrast image

matlab,image-processing,text-extraction

Here is a way to do it. First apply a median filter with a large kernel on the image to remove outliers and then apply a threshold to convert to a binary image. Note that playing with the kernel size of the filter alters the threshold level you need to...

scrapy isn't working right in extracting the title

scrapy,web-crawler,text-extraction

You need to change the item['title'] to this: item['title'] = ''.join(site.xpath('//table[@width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0]) Also edit sites to this to extract the required links only (ones with Delhivery in it) sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div') EDIT: so I understand now that you need to add a pagination rule to your code. it should...

search a specific word in BeautifulSoup python

python,string,python-2.7,beautifulsoup,text-extraction

This should get you started with being able to extract the ssid for each entry. Note that some of those link don't have any ssid so you'll have to account for that with some error catching. No need for re or the urllib2 modules here. import feedparser import requests from...

Set minimum confidence to ocr in Matlab

matlab,ocr,text-extraction,matlab-cvst,confidence-interval

The easiest way would be to create a logical index based on your threshold value: bestWordsIdx = ocrtxt.WordConfidence > 0.8; bestWords = ocrtxt.Words(bestWordsIdx) And same for Text: bestTextIdx = ocrtxt.CharacterConfidence > 0.8 bestText = ocrtxt.Text(bestTextIdx) ...

perform gsub in a data frame with 2 columns

regex,r,string,gsub,text-extraction

You override the whole data frame instead of only one column. Try this: Data_edited_txt2$text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", Data_edited_txt2$text) Data_edited_txt2$text <- gsub("@\\w+", " ", Data_edited_txt2$text) Data_edited_txt2$text <- gsub("[[:punct:]]", "", Data_edited_txt2$text) ...