Skip to content Skip to sidebar Skip to footer

Merging Dataframes On Multiple Conditions - Not Specifically On Equal Values

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already. I am trying to join (merge) together two dat

Solution 1:

I've just thought of a way to solve this - by combining my two methods:

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

Solution 2:

You can use the following to accomplish what you're looking for:

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

snp_dfOut[193]: 
   chromosomeSNPBP01rs309431575256611rs31319723040021rs207381475347431rs311585975450341rs3131956758144gene_dfOut[194]: 
   chromosomechr_startchr_stopfeature_id011095411507GeneID:100506145111219013639GeneID:100652771211436229370GeneID:653635313036630503GeneID:100302278413461136081GeneID:645520merged_dfOut[195]: 
         SNPfeature_id8rs3131972GeneID:100302278

Post a Comment for "Merging Dataframes On Multiple Conditions - Not Specifically On Equal Values"