Solving Reporting Problems Simply With A Little SQL
[I originally wrote this for engineering.beenverified.com. Just bringing it over here.]
I am always satisfied when I am able to solve a complex problem using SQL. This is because SQL allows breaking down problems much easier than writing a program. Being based on relational algebra helps, in that you can reduce your problem as a relationship between different sets of data.
At one point in my career, I was writing a lot of SQL. Since writing reports is not much of a thrill, I made it worthwhile by stretching my imagination on how to solve problems with it. Even today, I was impressed what I came up with; the data requests sound intimidating, but if it were a test, I’d at least get a A- (I didn’t care much about how the queries performed.)
I am not writing much SQL at this point. However, I know that the tool is there at my disposal when I really need to solve thorny data requests. I am writing this because I wanted to address one such scenario, with the hope that people can see what I see when it comes to the expressive power of SQL.
The problem that brought me to write about this.
Basically, if I make a phone call, I record when it starts and when it ends. I need to find how many calls happen concurrently, and in particular the maximum number of calls that happen concurrently.
Finding out how many calls happen concurrently was the request. I initially freaked out: what does this mean? The day before I started this, however, I came up with an idea on how to start progressing with it. The kernel of a solution involved two sets of time ranges.
PostgreSQL 9.2 introduces range types, which make these types of queries easier. Unfortunately, I had 9.1 installed on the server and on my dev computer. So I had to start off by building a new version of PostgreSQL and then dumping the data in a new database to just start on this project. Of course, nothing comes easy.
I started out by defining what data I am working with. Earlier, I said I wanted to start with two sets of time ranges. For the first iteration, I defined them as:
A. The context of time ranges that I am working off of, B. The time range that happened in context to each item within the first set.
My first solution was for both sets to be the same exact pieces of data. It made sense at first. The logic was that for a call in set A, is there any calls in set B that happen within the A call?
For example, a call lasted 5 minutes between 1pm and 1:05pm. What calls happened during that time?
This can be addressed in psuedo-SQL as:
select A.time_range, count(B.time_range) from call_session_time_range A inner join call_session_time_range B on A.time_range @> B.time_range and A.time_range <> B.time_range group by A.time_range;
(Note: call_session_time_range is a view over a table, that creates a time range from each row with their created_at and updated_at fields. For our purposes, it doesn’t matter how its implemented.)
The query says, get two sets, A and B, where A = B, and for each row in A, find a row in B where the time range is in the row from A.
What was interesting was that I saw this pattern develop:
A has (B, C, D) B has (A, C, D) C has (A, D)
or, in English, A had 3 calls in its context, B had 3 calls in its context and C had 2 calls in its context. It also means that A can have B and B can have A. I was going to work with that, but it looked weird. Really weird. Like, there was something not right about it but I couldn’t put my finger on it. The real problem is that this told me nothing useful in relation to the original question.
So great, B happened during A. What does this prove?
In fact, when I ran this query over a large data set I got a maximum result of 199 calls in one time period. Initially I thought this was interesting. However, as I thought about it, it was actually garbage data. That A call could have lasted an hour!
I decided to change the way I thought about the first set of data. The word “concurrent” was what drove the first iteration, but “concurrent” is not possible with my data set. What is a better set of data to work with then? I decided, instead, to create a table of “timeslices”, with each range one minute in length.
In this instance, we are seeing instead: How many calls there were in a single minute? Not exactly “concurrent”, but can produce more usable data, I think.
So, to generate that first set, the context set, I wrote a PL/pgSQL function to fill a temporary table:
create temporary table timeslices ( timeslice tsrange ); create or replace function gen_timeslices(start_time timestamp, increment interval) returns void as $func$ declare new_date timestamp := start_time; begin while new_date <= now() loop insert into timeslices (timeslice) values(tsrange(new_date, new_date + increment)); new_date := new_date + increment; end loop; end; $func$ language plpgsql; select gen_timeslices((now() - interval '10 days')::timestamp, interval '1 minute');
So, when the last line runs, we generate a time range for each minute from 10 days ago until now.
Now the query changes to replace the first table with the new “timeslices” table.
select max(concurrent_count) from (select A.timeslice, count(B.time_range) as concurrent_count from timeslices A inner join call_session_time_range B on A.timeslice @> br.time_range and ar.timeslice <> B.time_range group by A.timeslice) res;
Our new count is really “how many calls happened within a given minute?” and our final answer is “What is the maximum number of calls that happened in a given minute for our given time range of 10 days?”
So, the answer for my query was: 6. Yay! That’s actually reasonable given the parameters we are considering.
The point of this post was to demonstrate how a relatively complex problem can be solved with some SQL really nicely. We are using set-like operations to formally define the problem and to generate useful answers. Addressing the problem by wrapping it in a framework to think about the problem really does help and is extremely powerful.
As a side note, its nice to see that that actual SQL code for it is really simple. If I wanted the thing to go faster, I can carefully drop indices in certain places. However, it’s not important and I have my answer now.