This will be a tricky one to explain, but bear with me.
I have a database for example:
Field_1 Field_2 Field_3
a 1 True
c 5 False
d 1 False
a 1 False
b 2 True
And I have a list of values for each field from which I'd like to find the number of present possible permutations:
Field_1_Values = [c, d]
Field_2_Values = [1, 2, 5]
Field_3_Values = [True]
I'd like to get a count of all the possible permutations between different fields (order does not matter), for all the possible permutations lengths, present within the database.
The permutations to check in this example would therefore be:
Length_1: c, d, 1, 2, 5, True
Length_2: c-1, d-1, c-2, d-2, c-5, d-5, c-True, d-True, 1-True, 2-True, 5-True
Length_3: c-1-True, c-2-True, c-5-True, d-1-True, d-2-True, d-5-True,
And the result I'm looking for is the total number of records for each of the different permutation lengths:
Length_1_Total = 9
Length_2_Total = 4
Length_3_Total = 0
Currently, I'm using Python and looping over the lists, i.e.
import psycopg2
import itertools
conn = psycopg2.connect(host=database_host, port=database_port, dbname=database_name, user=database_user, password=database_password)
cursor = conn.cursor()
Field_1_Values = ["field_1:c", "field_1:d"]
Field_2_Values = ["field_2:1", "field_2:2", "field_2:5"]
Field_3_Values = ["field_3:True"]
Length_1_Total = 0
for combo in itertools.chain(Field_1_Values,Field_2_Values,Field_3_Values):
query = "select exists(select 1 from dataset where " + combo.split(":")[0] + " = %s)"
cursor.execute(query, [combo.split(":")[1]])
Length_1_Total += int([row[0] for row in cursor.fetchall()][0])
Length_2_Total = 0
for combo in itertools.chain(itertools.product(Field_1_Values, Field_2_Values), itertools.product(Field_1_Values, Field_3_Values), itertools.product(Field_2_Values, Field_3_Values)):
query = "select exists(select 1 from dataset where " + combo[0].split(":")[0] + " = %s and " + combo[1].split(":")[0] + " = %s)"
cursor.execute(query, [combo[0].split(":")[1], combo[1].split(":")[1]])
Length_2_Total += int([row[0] for row in cursor.fetchall()][0])
Length_3_Total = 0
for combo in itertools.chain(itertools.product(Field_1_Values, Field_2_Values, Field_3_Values)):
query = "select exists(select 1 from dataset where " + combo[0].split(":")[0] + " = %s and " + combo[1].split(":")[0] + " = %s and " + combo[2].split(":")[0] + " = %s)"
cursor.execute(query, [combo[0].split(":")[1], combo[1].split(":")[1], combo[2].split(":")[1]])
Length_3_Total += int([row[0] for row in cursor.fetchall()][0])
In the above code -
itertools.product
gives me the possible permutations between the lists that I pass it.
itertools.chain
allows me to concatenate the generators such that the permutations are created as the loop runs rather than storing them all in memory.
Now in reality I have 4000 records with 5 fields in the dataset. I also perform the above process 50 times for different lists of values (as these lists are the result of some prior work and I wish to see the effect of the prior work on the possible permutation result set).
This code then becomes extremely slow and so I was wondering if there is a way to get the database to do all the work directly in a single query, perhaps in which I can benefit from some database optimisations.
My Ideas:
I did think to query for the distinct combinations of values from the database, and then check for the intersection between this result set and my length-1 permutations to be checked in my Python code, i.e. for length-1 permutations:
query = """select distinct(Field_1)
union all
select distinct(Field_2)
union all
select distinct(Field_3)"""
cursor.execute(query)
Present_Length_1_Permutations = set([row[0] for row in cursor.fetchall()])
Length_1_Permutations_To_Check = set([c, d, 1, 2, 5, True])
Matching_Permutations = Length_1_Permutations_To_Check.intersection(Present_Length_1_Permutations)
Length_1_Total = len(Matching_Permutations)
Repeating the process for length-2 and length-3 permutations respectively. However, on my real dataset, I get an 'out of memory error' for some of the longer length permutations as the result set returned by the database query is too large.
Is there a way I can perform the permutation check directly within the database? So that at no point do I return all the distinct permutations to my code?
i.e. something like (I know this is incorrect code, just trying to present my idea), assuming Length_1_Values is my list to check the presence of:
select permutation from (select distinct(Field_1) union all select distinct(Field_2) union all select distinct(Field_3)) inner join permutation from Length_1_Values
Aucun commentaire:
Enregistrer un commentaire