python - Finding Patterns and Useful Information From Survey Text [xlsx]



I want to find patters and extract useful information from a large amount of survey data. The data is sorted in an .xlsx spreadsheet with 4 columns corresponding to particular questions, with each row filled with a text response from the respondent.

How can I use python and openpyxl to extract patterns from the data, such as frequency of words or phrases, connections between answers across the four questions, or anything else I should look for?

I have limited experience in data/text mining, so if there is some documentation, useful tutorials, or another StackOverflow question I should look at, please let me know. I did a fair amount of searching here and elsewhere, but haven't found what I'm looking for.

So far I have taken a shot at word frequency based on the survey question, but it has proved difficult to navigate the openpyxl documentation to do something like this. Is there an easy way to do this in python?

Sample array [600x4]:

    [['this is an example of an answer to question 1 by respondent 1', 'answer to Q2 by R1', 'ans to Q3 by R1', 'ans to Q4 by R1']
    ['ans to Q1 by R2', 'ans to Q2 by R2', 'ans to Q3 by R2', 'ans to Q4 by R2']
    [etc, etc, etc, etc...]]

1 Answer: 

The Excel file format is not particularly suited to this kind of task. You would do much better to copy the data from the file into a tool more suited to the task such as a relational database with a full text search or maybe a specialised text engine.

openpyxl is library designed for manipulating the Excel files. So, in this case it can help you extract the data and pass it to another application.