where to download multi language word list from Wiktionary?

I was wondering if there was a place to download multi-language word lists from Wiktionary? Lists of the words contained in the various Wiktionary versions can all be constructed by parsing the XML/database dump files found at dumps.wikimedia.org. [XXX]

Is there a more efficient way to find most common n-grams?

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this? In Python, using NLTK:$ wget http://norvig.com/big.txt$ python>>> from collections import Counter>>> from nltk import ngrams>>> bigtxt = open('big.txt').read()>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))>>> ngram_counts.most_common(10)[(('of', 'the'), 12422), (('in', 'the')...

Methods of calculating text string similarity?

Let’s say I have an array of strings and I need to sort them into clusters. I am currently doing the analysis using n-grams, e.g.:Cluster 1:Pipe fixingPipe fixing in Las VegasMovies about Pipe fixingCluster 2:Classical musicWhy classical music is greatWhat is classical musicetc.Let’s say within this array I have these two strings of text (among others):Japanese studentsStudents from JapanNow, the N-gram method will obviously not put these two strings together, as they do not share the same tokenized structure. I tried using Damerau-Levenshtein distance calculation and TF/ID...

How to combine the strengths of PCFG (sentence structure) and n-gram models (lexical co occurrence)?

How to combine the strengths of PCFG (sentence structure) and n-gram models (lexical co-occurrence)? Have a look at Dan Klein's paper. [XXX]

Stanford corenlp: top K ngrams with count

How to use stanford corenlp to get top K ngrams with their count? I know I can write this code using HashMap or Trai but my corpus is pretty large (200K articles each with avg size 30KB) and I want 5grams, so memory requirement will be huge. Thus I was wondering if I can use corenlp for this purpose. So given a corpus it should return only top K ngrams in this format:word1 word2 word3 word4 word5 : frequencyI don't want any probabilistic model. CoreNLP doesn't have anything to help you store ngrams efficiently. All it could help you with here would be tokenizing the tex...

word2vec: how to predict the most likely words and sentences?

Word2vec by Google has been shown to be powerful in NLP tasks. I am quite new to the tool and unclear about what exactly it can do.Say we have a sentence: "I will go to New York this weekend."Based on this sentence, we have many options to transform it. For example:1) "I will go to New York with my Dad this weekend."2) "I will go to LA this weekend."3) "I will not go to New York this weekend."What I want from word2vec is that, given the basic sentence, how can I predict one of its transformations to be the most likely one? Or, given one transformation, we can calculate the ...

Is TF-IDF necessary when using SVM?

I'm using Support Vector Machines to classify phrases. Before using the SVM, I understand I should do some kind of normalization on the phrase-vectors. One popular method is TF-IDF.The terms with the highest TF-IDF score are often the terms that best characterize the topic of the document.But isn't that exactly what SVM does anyway? Giving the highest weight to the terms that best characterize the document?Thanks in advance :-) The weight of a term (as assigned by an SVM classifier) may or may not be directly proportional to the relevance of that term to a particular cla...

Applied NLP: how to score a document against a lexicon of multi-word terms?

This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it...

concepts extraction from stanford parsing tree nlp

I am nlp research beginnerI want to extract the concepts from textfor example "The Thing Things albums" the concept is "The Thing Things"I am using parsing tree for noun phrases but in this example the tree extracts "The Thing Things" &"albums"and another example "Who started the handset alliance?" I expected that "handset alliance" but the noun phrase is"the handset alliance"how can I solve them ? Your definition of "concepts" is not perfectly clear to me, but take Illinois Shallow Parser might be useful for this. See a demo here: http://cogcomp.cs.illinois.edu/demo...

How to split an NLP parse tree to clauses (independent and subordinate)?

Given an NLP parse tree like (ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))Original sentence is "You could say that they regularly catch a shower, which adds to their exhilaration and joie de vivre."How could the clauses be extracted and reverse engineered?We would be splitting at S and SBAR (to preserve the ty...

How to add word-level features to Mallet SimpleTagger?

I have been going through this blog post which contains a SimpleTagger example. It says:Given an input file "sample" as follows:CAPITAL Bill noun slept non-noun here non-nounwhere all but the last token on each line is a binary feature, and the last token on the line is the label nameSo, how do I add the word-level features here?Example: The number of syllables in the word, the length of the word, etc Everything before the last token is treated as a feature. You should be able to add arbitrary features before this:CAP SYL1 CHAR4 Bill nounSYL3 CHAR9 respond...

Request timeout in API.AI

I have an API.ai agent that sends a request (comes from the user) to a webhook which needs a lot of processing (more than 5 sec) to get the answer. As far as I know, that there is no way to increase the response timeout in API.aiSo, I have created 2 intents. The first one simply will call my webhook to start the processing the result, and at the same time the webhook will reply to the user, "Your request is under processing...". The second intent has an event and action. The purpose of the new event is just to display the result to the user.Once the result is ready, my back...

What are the "-P"s in the Berkeley Aligner's output format?

I want to use the Berkeley Aligner for some MT research I'm doing, since, apparently, it beats GIZA++ pretty handily (a 32% alignment error reduction in some reported results). For the most part the outputs in the Berkeley Aligner "examples" directory look like what Moses does to GIZA++ output files (i.e., paired aligned word indices), but there are some funny looking "-P"s after certain pairs. I can't for the life of me find any documentation of what these "-P" annotations are supposed to signify (certainly not in the Berkeley Aligner "documentation" directory).For clar...

IOB format as output of parser

Is there any function in Standford NLP and OpenNLP to get output of parsing in IOB format? I need to use parser for NP chunking of sentences. If you only need NP chunks, use the OpenNLP chunker instead of a parser.It sounds like it might help you to read more about the differences between chunking and parsing, for example in the NLTK docs on partial parsing. Although you could extract NPs from the output of a parser if you wanted, a normal parse couldn't be represented in IOB format or converted to IOB format. [XXX]

NLP POS Tree understanding

How does tree gets formed in NLP -> parts of speech tagging. What is algorithm behind this? (S (NP Alice) (VP (V chased) (NP (Det the) (N rabbit))))for instance how can Det "the" and N "rabbit" become NP (grouped under NP?)What is algorithm behind tree formation and aggregation of nodes What you mean here is basically called parsing and not POS tagging. POS does only care about assigning the right POS tag to a word (I.e. DT to 'the' or NN to 'dog').In parsing, this information is used to parse a sentence. There are dependency parsers and co...

Writing code for A Neural Probabilistic Language Model Bengio, 2003. Not able to understand the model

I'm trying to write code for A Neural Probabilistic Language Model by yoshua Bengio, 2003, but I'm not able to understand the connections between the input layer and projection matrix and between projection matrix and hidden layer. I'm not able to get how exactly is the learning for word-vector representation taking place. have a look at this answer hereIt explains the difference between the hidden layer and the projection layer.Referring to this thesisAlso, do read this paper by Tomas Mikolov and go through this tutorial.this will really improve your understanding.Hope...

Page 1 of 49  |  Show More Pages:  Top Prev Next Last