SQL or Programming (vs) Hybrid Approach: How we reduced processing time from 24 hours to 24 minutes

Pradeep Vijayakumar
3 min readDec 5, 2020

There is a constant tug-of-war when it comes to extracting insights out of data between SQL and Programming languages. Even for simple data operations such as aggregation, integrity checks there is always a debate about which is faster querying a database or loading data to a dataframe (memory) and programmatically performing the operation. Most of the times the answer to that is “Depending on the volume of data”.

Well why not an Hybrid approach!!

Scenario: We were dealing with a similar debate (SQL vs Python) when trying to solve a slowness problem involving huge transactional data. One of our financial customers had an important legacy application with its master and transactional data stored in DB and there were SQL operations performed on these tables to generate a dashboard report. It was a very slow process taking more than 24 hours. Critical decision making was becoming a challenge for the leadership due to the delay in this report.

Initial Approach: We were requested to identify the bottleneck due to slowness, fix the issue/re-write the whole process (as needed) and make the application consumable. Being a data engineer, when given a slow SQL based operation, our immediate thought was “Query optimization”. We were informed that there were 50 queries running behind the scenes in the application. We identified some of the expensive ones that were running and looked at their execution or query plan. We created few missing indexes, included some mandatory join filters and made some additional improvements to the queries to make them run faster. These improvements reduced the execution/processing time from 24 hours to almost 15 hours.

Though we brought down the processing time, we were still not fully convinced with the improvement because the end user still had to wait 15 hours to view the dashboard and wanted to explore more options. That’s when we thought rather than improving the existing process/queries, to look at how the application is running these queries behind the scenes. We found a major bottleneck in the application running the queries sequentially as part of a single transaction. So the immediate step was to identify if there were dependencies between the 50 queries i.e. if a query consumes or depends on results from another query. Though there were few dependencies, most of the queries were independent.

Solution: Hybrid Approach

a) Since there were less dependencies between the queries we now have an option to run them parallelly wherever possible. That is when, we made use of a Hybrid approach using SQL and Python.

b) We grouped queries and created multiple smaller batches i.e. we broke down 50 queries and created 5 total batches with each batch holding 10 queries internally. The dependent queries (if any) were placed in the same batch or combined as one query.

c) Next step was to create dictionary objects (e.g.: Dict_batch1, Dict_batch2, Dict_batch3) similar to below example syntax in Python to hold the batch of queries. In the below dictionary each query Key (e.g.: Query1,Query2) will hold the corresponding SQL code as the Value.

Dict_batch1 =
{
"Query1": "Update table1 Set column1 = 'xyz' where cond1= 'abc';",
"Query2":" …",
"Query3":" …",
….
"Query10":"…"
}

d) Once 5 dictionary objects were created, next step was to invoke them parallelly using a Python package. We used “multiprocessing” and the invoke syntax given below for your reference. It loops through each of the queries in the dictionary and kicks off the execution and moves on to the next query. It doesn’t wait for the query execution to complete to proceed to next query in the dictionary i.e. it is not sequential as inside a DB transaction.

processLists=[]for name, text in Dict_batch1.items():
process = multiprocessing.Process(target= execute_query,args=(engine, text))
process.start()
processLists.append(process)
for proc in processLists:
proc.join()

Conclusion: The debate between SQL and programming languages will continue with someone vouching for one over the other. But when we bring the intelligence from both the worlds (query optimization and parallelism) the solution obtained in terms of performance improvement is substantial. Using the Hybrid approach above we were able to solve the entire execution (all 50 queries) in under 24 minutes. Also, another learning from this project was rather than looking to solve the problem by improving the current design or architecture, it helps to step back and look at the design and re-design the flow to bring significant improvements.

--

--