Monday, June 08, 2009

SQL for Date Spanned Data

I was recently presented with a sql challenge. Although I came up with something that works, I'm not convinced it is the best solution. This is a very long post that is mostly question, but possibly a helpful example. The details are below.

I was given a table with the following columns
ID -- Primary Key
DIVISION -- Identifier of a Division
START_DATE -- Date when the Division starts
END_DATE -- Date when the Division ends

If you want to try this yourself, I added all the create statements and sample data creation to the end of this note.

The data has an implicit concept of a "span." A Division may have many spans. The start date of a new span is typically one day after the end date of the previous span. In this case a Division has a contiguous span. Spans never overlap; that is, a single division never has a row with a start date between the start and end date of another row of the same division (and never has an end date between the start and end of another row of the same division).

Below is some sample data:

Id, Division, Start_date, End_date

10 1 09-JUN-2009 10-JUN-2009
11 1 11-JUN-2009 12-JUN-2009
12 1 13-JUN-2009 14-JUN-2009 -- Note the following row is not contiguous
14 1 17-JUN-2009 18-JUN-2009
15 1 19-JUN-2009 20-JUN-2009


If this were the total data set then Division 1 would have 5 spans, but only two contiguous spans (09 - 14 Jun and 17 - 20 Jun).

The challenge was to create a sql statement to only return contiguous spans by division. I immediately thought of a CONNECT BY but the challenge was to determine the START WITH condition.

[Update: In response to this post Alex Nuijten came up with a better method than any described below. Take a look here:
http://nuijten.blogspot.com/2009/06/analytic-function-finding-gaps.html ]
I came up with two ways to do this. I'm still not certain if there is a better solution. The first is far better and uses some SQL analytics to determine the START WITH condition. If you were to use a WHERE clause, it should go in the WITH section as identified. My example is below.

with span_info as (select s.division, s.start_date, s.end_date
, (s.start_date -lag(end_date, 1) over (partition by s.division order by start_date)) new_span
-- null or > 1 indicates a new contiguous span
, (lead(start_date, 1) over (partition by s.division order by start_date) - s.end_date) end_span
-- the end_span is not needed, just left for example
-- null or > 1 indicates the end of a contiguous span
from spantest s
-- a WHERE clause would go HERE, not in the main SELECT below
order by 1, 2 )
select division
, connect_by_root start_date span_start -- start of a contiguous span
-- , start_date -- start of the last span of a contiguous span
, end_date -- end of the contiguous span
-- , new_span, end_span, connect_by_isleaf
from span_info
where connect_by_isleaf = 1
-- gets only "leaf" recodes, i.e. the end of a contiguous span
start with nvl(new_span,1000) > 1
-- gets beginning of a contiguous span
connect by
prior end_date + 1 = start_date
-- connects spans, the 1 here and in the "start with" can be changed together
and prior division = division
-- ensures spans are for the same division
order by 1, 2

My second idea is to use an exists command to determine if the the first there exists a row with an end_date equal to start_date - 1 for the division. I can write the statement, but I'm not sure how to get the combination of the DIVISION and the START_DATE in to the START WITH. I could do some concatenation, but I don't like that at all. The code is below, but DO NOT RUN THE QUERY; it is very slow.

-- ************ DO NOT RUN THIS QUERY
select division
, connect_by_root start_date span_start -- start of a contiguous span
-- , start_date -- start of the last span of a contiguous span
, end_date -- end of the contiguous span
-- , new_span, end_span, connect_by_isleaf
from spantest
where connect_by_isleaf = 1
-- gets only "leaf" recodes, i.e. the end of a contiguous span
start with to_char(start_date,'yyyymmdd') ||'~' || division in
(select to_char(st2.start_date,'yyyymmdd') ||'~' || st2.division
from spantest st2
where not exists
(select 1 from spantest st3
where st3.division = st2.division
and st3.end_date = st2.start_date + 1
and st3.id != st2.id)
)
-- gets beginning of a contiguous span
connect by
prior end_date + 1 = start_date
-- connects spans, the 1 here and in the "start with" can be changed together
and prior division = division
-- ensures spans are for the same division
order by 1, 2


This is a lot of code and no-one other than me may ever read it. If you do, though, and have other ideas, I would love to hear them.

-- ********* Create Statements and sample data

CREATE TABLE "SPANTEST"
( "ID" NUMBER NOT NULL ENABLE,
"DIVISION" NUMBER NOT NULL ENABLE,
"START_DATE" DATE NOT NULL ENABLE,
"END_DATE" DATE NOT NULL ENABLE,
CONSTRAINT "SPANTEST_PK" PRIMARY KEY ("ID") ENABLE
)
/

CREATE INDEX "SPANTEST_IDX3" ON "SPANTEST" ("END_DATE")
/
CREATE INDEX "SPANTEST_IDX2" ON "SPANTEST" ("START_DATE")
/
CREATE INDEX "SPANTEST_IDX1" ON "SPANTEST" ("DIVISION")
/

CREATE SEQUENCE "SPANTEST_SEQ" MINVALUE 1 MAXVALUE 999999999999999999999999999 INCREMENT BY 1 START WITH 1441 CACHE 20 NOORDER NOCYCLE
/

CREATE OR REPLACE TRIGGER "BI_SPANTEST"
before insert on "SPANTEST"
for each row
begin
if :NEW."ID" is null then
select "SPANTEST_SEQ".nextval into :NEW."ID" from dual;
end if;
end;

/
ALTER TRIGGER "BI_SPANTEST" ENABLE
/


-- create sample data for 10 Divisions
begin
for i in 1..20 loop
insert into spantest (division, start_date, end_date)
select 1, trunc(sysdate + (rownum*2 - 1)),trunc(sysdate + (rownum*2)) from all_tables;
end loop;
end;
/

-- create span gaps on the 15th of each month
delete from spantest where
to_char(start_date,'dd') = '15'
/

5 comments:

Roel said...

Hi Anton,
Maybe you can make your life easier when you change your datamodel. The combination of start- and end-date with the no-overlap-requirement needs additional business rules coding. What if you either make the END_DATE optional and only fill that when there is a gap, or drop that column and add a 'type' column that indicates Span or Gap. Then I expect your query (and additional business rules) will be easier to code - and faster to run...
Cheers,
Roel

Mal said...

Hi Anton,
On a similar vein to Roel - most date-range stuff I've come across such as prices needs to be contiguous, and if that's the case, I only hold the start date for each entry. It's simple to fabricate a view showing the appropriate "end date" if it's needed (for reporting) and it does mean that you cane never get a gap in the data or an overlap.
Malcolm

Anton Nielsen said...

Roel * Mal,

Thanks for the ideas. In short, it's not my table and I can't change the way it works. For the sake of this exercise, let's assume it's a packaged application.

Anton

Alex Nuijten said...

Hi Anton, about a week ago I left a comment (a query statement) but never saw it here. So I decide to write a blog about it.. :)
http://nuijten.blogspot.com/2009/06/analytic-function-finding-gaps.html

Anton Nielsen said...

Alex,

I never expected such a fantastic response to this query. Your solution is far superior to mine--just what I was looking for.

Thanks,

Anton