data mining - postgresql phrase extraction & ranking


Keywords:postgresql 


Question: 

From selected rows in a table, how can one extract and rank phrases based on how often they occur?

example 1:

example 2:

INPUT:
CREATE TABLE phrases (
    id  BIGSERIAL,
phrase VARCHAR(10000)
);

INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’)
INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’)

DESIRED OUTPUT:
phrase | weight
italian sculptor  |  5
virgin mary | 2
painters | 1
renaissance | 1
inspiration | 1
Andrea Bolgi | 1

To find just words, not phrases, one could use

SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases')
ORDER BY nentry DESC, ndoc DESC, word;

Some notes:

  • phrases could contain “stop words”, e.g. “easy to answer”
  • ideally, english language variations and synonyms would be automatically grouped.

Could pg_trgm help? (it’s ok if only 2 and 3 word phrases are found). How exactly?

Related questions:


1 Answer: 

I agree with Craig that this is certainly way beyond the scope of what tsearch2 was intended to do as well as any other existing PostgreSQL tools. However, I do think that this might not be too bad to do in the db engine. One of the strengths of PostgreSQL is programmability and this strength gives you some very underutilized options.

As Craig notes, this is the domain of natural language processing, not of SQL per se, so the first thing you want to do is settle on a natural language processing toolkit with support for a stored procedure language that PostgreSQL supports. In other words, you want something that supports Perl, Python, C, etc. Whatever PostgreSQL supports and you feel comfortable working in.

The second step is to create functional interfaces for this toolkit in stored procedure languages. This should take text in, and output the phrase breakdown in some sort of type PostgreSQL can handle reasonably well. You need to pay attention to the type carefully because that affects things like GIN indexing.

From there you can incorporate it into your database interfaces and queries.