I am a biologist trying to validate an experiment. In my experiment, I have found 71 mutations after a particular treatment. To determine if these mutations are truly due to my treatment, I want to compare them to a set of randomly generated mutations. It was suggested to me that I might try to generate a million sets of 71 random mutations for statistical comparison.
To start with, I have a dataframe with the 7000 genes in the genome of interest. I know their start and end positions. The first five rows of the dataframe look like this:
transcript_id protein_id start end kogClass 0 g2.t1 695054 1 1999 Replication, recombination and repair 1 g3.t1 630170 2000 3056 General function prediction only 2 g5.t1 695056 3057 4087 Signal transduction mechanisms 3 g6.t1 671982 4088 5183 N/A 4 g7.t1 671985 5184 8001 Chromatin structure and dynamics
Now about the million sets of 71 random mutations: I have written a function that I call a million times and it seems not to be very efficient because after 4 hours it was only 1/10 of the way through. Here is my code. If anyone can suggest a way to speed things up I would owe you a beer! And my appreciation.
def get_71_random_genes(df, outfile): # how many nucleotides are there in all transcripts? end_pos_last_gene = df.iloc[-1,3] # this loop will go 71 times for i in range(71): # generate a number from 1 to the end of all transcripts random_number = randint(1, end_pos_last_gene) # this is the boolean condition - checks which gene a random number falls within mask = (df['start'] <= random_number) & (df['end'] >= random_number) # collect the rows that match data = df.loc[mask] # write data to file. data.to_csv(outfile, sep='\t', index=False, header=False)