From selected rows in a table, how can one extract and rank phrases based on how often they occur?
INPUT: CREATE TABLE phrases ( id BIGSERIAL, phrase VARCHAR(10000) ); INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’) INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’) DESIRED OUTPUT: phrase | weight italian sculptor | 5 virgin mary | 2 painters | 1 renaissance | 1 inspiration | 1 Andrea Bolgi | 1
To find just words, not phrases, one could use
SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases') ORDER BY nentry DESC, ndoc DESC, word;
- phrases could contain “stop words”, e.g. “easy to answer”
- ideally, english language variations and synonyms would be automatically grouped.
Could pg_trgm help? (it’s ok if only 2 and 3 word phrases are found). How exactly?