A Simple Example of Oracle Analytics: Running Totals

Kevin Meade's picture
articles: 

Analytics are magic. But as with most software products; I seem to use only one percent of the features, ninety nine percent of the time. Indeed, having built warehouses and reporting systems for the last eighteen months, I look back and see that I got a lot done with only three Analytics, SUM, LAG, and LEAD. Knowing how intimidating Analytics can look to those who have not used them, I figured I’d show the uninitiated, how to get in through the back door, with a reduced look at the capability. You can do a lot with very little. We are going to discuss just one Analytic, SUM used to create running totals.

Before we get started, please let me point out that there are already several good articles here on OraFAQ that cover Oracle Analytics. Some people a lot smarter than me took the time to write this stuff, you should read it. Also, there are other great websites besides OraFAQ with articles on Analytics so check them out. Here are a few quick links to get you started:

Next, if you have not figured it out yet, I am not providing a treatment of every Analytic in the book. Heck, I am waiting like everyone else, for Tom Kyte of ASKTOM Home fame (and the King of Oracle Analytics) to write the book on Analytics we all want him to write. I am only going to talk about SUM used to compute a running total. But, I will show you quit a bit of this little gem. Remember, the premise of this article is that for all the wealth of Analytics available, This one function will see the most use. OK, with that out of the way let’s get crack-a-lackin’.

One good way to get a grasp of what Analytics are about is to follow the coding evolution of solutions to the problems Analytics are meant to help with. The most obvious one is the problem of constructing a running total. Consider the following data about a fictitious employee of some company, his name is JIM, and the data shows the history of raises Jim has received during his employment at the company:

drop table emp_raise_hist
/

create table emp_raise_hist
(
 ename varchar2(10) not null
,raise_date date not null
,raise_amount number not null
)
/

alter table emp_raise_hist
   add constraint uk_emp_raise_hist_1 unique (ename,raise_date)
/

insert into emp_raise_hist
(
 ename
,raise_date
,raise_amount
)
select 'Jim',add_months(trunc(sysdate,'mm'),-rowno*6),4000
from (
      select rownum rowno
      from dual
      connect by level <= 8
     )
/

SQL> desc emp_raise_hist
 Name                                Null?    Type
 ----------------------------------- -------- ----------------
 ENAME                               NOT NULL VARCHAR2(10)
 RAISE_DATE                          NOT NULL DATE
 RAISE_AMOUNT                        NOT NULL NUMBER

SQL> 
SQL> select *
  2  from emp_raise_hist
  3  order by
  4           ename
  5          ,raise_date
  6  /

ENAME      RAISE_DAT RAISE_AMOUNT
---------- --------- ------------
Jim        01-JUN-03         4000
Jim        01-DEC-03         4000
Jim        01-JUN-04         4000
Jim        01-DEC-04         4000
Jim        01-JUN-05         4000
Jim        01-DEC-05         4000
Jim        01-JUN-06         4000
Jim        01-DEC-06         4000

8 rows selected.

Jim here is getting a four thousand dollar raise every six months. Not too bad I suppose. I wonder how he does it. I am sure Jim’s HR director wonders how he does it too. Indeed, Jim’s HR director is tasked as part of his job, to “keep salaries competitive” (we all know what that means). So this HR director is keen to see how fast salaries in the company are growing over time, which is maybe why this historical table exists in their database to begin with. What can we show him? Well, we all know we can show him this:

SQL> select
  2         ename
  3        ,sum(raise_amount) total_raise_amount
  4  from emp_raise_hist
  5  group by ename
  6  /

ENAME      TOTAL_RAISE_AMOUNT
---------- ------------------
Jim                     32000

This is nice. It tells the HR guy that Jim has had thirty two thousand dollars in raises since he started with the company. But the HR guy is really going to want to know more than that. One thing he is going to want to see is the amount of total raise after each raise event. We all know this as the RUNNING TOTAL. What this director of HR wants to see is this data, but how do we get it for him?

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected. 

We see here, the same eight rows as before, but we have added another column to the report. We have added the running raise amount total column which shows the inception to date total raise amount relative to the raise event row it lives on. Back in the Stone Age, we used Hammer and Chisel techniques. To produce this number, we would have created a view, this view, and joined to it in our query.

create or replace
view vw_emp_raise_hist_1
as
select
       a.ename
      ,a.raise_date
      ,sum(b.raise_amount) running_raise_amount
from emp_raise_hist a
    ,emp_raise_hist b
where b.ename = a.ename
and b.raise_date <= a.raise_date
group by
         a.ename
        ,a.raise_date
/

With this view, we could write this query to create the report:

SQL> select
  2         a.ename
  3        ,a.raise_date
  4        ,a.raise_amount
  5        ,b.running_raise_amount
  6  from emp_raise_hist a
  7      ,vw_emp_raise_hist_1 b
  8  where b.ename = a.ename
  9  and b.raise_date = a.raise_date
 10  order by
 11           a.ename
 12          ,a.raise_date
 13  /

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected.

There we go, one satisfied HR director. Of course, this solution is far from an ideal solution. In particular, this solution has the potential to be a very poor performer. First the query above does two self joins. You have to look at the same table three times to get your answer. Additionally and more importantly, this solution is not linear in terms of the amount of data it has to examine.

Consider this:

to get the running raise amount for the first row dated 01-JUN-03, we must look at the first row
to get the amount for the second row 01-DEC-03 we have to look at it and the row before it, that’s two rows
to get the running raise amount for the third row we have to look at three rows

And so on. To get the sums for all rows on our Jim report we have to do the following row lookups:

Rows to sum up for target row                     rows visited

row1                                                         1
row1 + row2                                                  2
row1 + row2 + row3                                           3
row1 + row2 + row3 + row4                                    4
row1 + row2 + row3 + row4 + row5                             5
row1 + row2 + row3 + row4 + row5 + row6                      6
row1 + row2 + row3 + row4 + row5 + row6 + row7               7
row1 + row2 + row3 + row4 + row5 + row6 + row7 + row8        8
                                                        ------
                                                            36

For those who are math minded, the formula is N*(N+1)/2 = 8*9/2 = 72/2 = 36. Yep, to produce just this little report we need to do 36 row lookups for the summing. And that is just for the summing. It does not include initial row lookups in our main query and corresponding initial row lookups in our view. That adds 16 more lookups bringing the total to 52 rows visited to create this report. Rather costly eh? Now we see why in the old days people were quick to create summary tables for this kind of data.

But, when one solution is pretty much all you got, that is what you use. So, whether you were slipping on a pair of Penny Loafers, working Shigeru Miyamoto’s “jumpman” Mario across the screen, or bouncing to Ton Loc’s “Funky Cold Medina”, you created the view, and if it performed way to slow, built the summary table and declared it done.

But we are always moving forward in IT. The next generation of Oracle Developers paid outrageous money for Reebok Pumps, sported Tomogatchi Gigapets from their key chains, did the Macarena, and reveled in their ability to do procedural programming in Oracle with a new language called PL/SQL. With PL/SQL we could trade in our view for a function like this one:

create or replace
function f_emp_raise_hist_rra (
                               ename_p in varchar2
                              ,raise_date_p in date
                              ) return number is
   running_raise_amount_v number;
begin
   select sum(raise_amount)
   into running_raise_amount_v
   from emp_raise_hist
   where ename = ename_p
   and raise_date <= raise_date_p
   ;
   return (running_raise_amount_v);
end;
/
show errors

Which allowed us to write this query to get our report:

SQL> select
  2         ename
  3        ,raise_date
  4        ,raise_amount
  5        ,f_emp_raise_hist_rra(ename,raise_date) running_raise_amount
  6  from emp_raise_hist
  7  order by
  8           ename
  9          ,raise_date
 10  /

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected.

The same answer, just a different way of getting it. But is it really progress? Well that is debatable I suppose. Looking at the function we see that it does not require us to look up a priming row before looking up the actual rows we want to sum, as the view before it did. Thus we are in fact doing eight less row lookups this way. But this benefit is offset by the fact the we have to do cursor management now, needing to open and close a cursor with each invocation of this function, and we have to incur the somewhat expensive penalty of a context switch going from SQL to PL/SQL to do the call and then the reverse when the function call ends. So it is not at all clear there is a real performance benefit with the PL/SQL function over the older VIEW solution.

Still, there is one important aspect to PL/SQL solution; it simplifies the main query significantly in a philosophical way. Conceptually the query is much easier to understand. It is easier to see what the basic thing is you are creating with the query. So to that end, this PL/SQL solution it is a far better solution than the older VIEW solution. As long as it performs well, we will likely go with it.

All was good for a while; then the new millennium arrived with Y2K ending in a whimper instead of meltdown. This generation or Oracle Developers stopped walking and starting riding Razors, got their entertainment playing Texas Hold’em, and went looking for their Music Superstars not from record companies, but from TV’s American Idol. And Oracle gave us the SCALAR SUB-SELECT, and with it once again another way to rewrite our query hopefully for the better.

SQL> select
  2         ename
  3        ,raise_date
  4        ,raise_amount
  5        ,(
  6          select sum(raise_amount)
  7          from emp_raise_hist b
  8          where b.ename = a.ename
  9          and b.raise_date <= a.raise_date
 10         ) running_raise_amount
 11  from emp_raise_hist a
 12  order by
 13           ename
 14          ,raise_date
 15  /

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected.

Once again same thing, only different. And once again we have to ask how it is better. We point out that this method does nothing to address the excessive row lookups we must to do compute sums for each raise event. And we still have to open and close a cursor with each invocation (e.g. with each row from the main query). We can say there are no context switches with this solution because it is all SQL in nature. I suppose that is something. And it still lets us focus on the main thing we are selecting so that is good. But we might also say it is not reusable as the function so it loses something there. All in all, this looks at first blush to be convenient in some ways and otherwise a wash.

But today, Oracle sports Analytic functions. One of the benefits of using Analytics is that they are supposed to perform at least as well as any of the previous solutions we have seen over the years, and potentially much faster. On the intangible side, we also will see that they again allow us to think about our problem solution less in terms of lines of code we have write, and more in terms of what kinds of transformations we need to apply to our data to arrive at our final destination.

The Analytic solution to this running_raise_amount report is this:

SQL> select
  2         ename
  3        ,raise_date
  4        ,raise_amount
  5        ,sum(raise_amount) over (
  6                                 partition by ename
  7                                 order by raise_date
  8                                ) running_raise_amount
  9  from emp_raise_hist
 10  order by 
 11           ename
 12          ,raise_date
 13  /

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected.

As you can see, the same result, just a different way of getting it. But don’t think that the Analytic gets its data the same way the other methods did, because it does not. Let us recap what we have said so far, and then look at some query plans for a more in-depth peak.

We started with a simple need, how to produce a running total as part of the output of a list of rows. We had a history lesson on how coding to solve this problem has changed over the years as Oracle has provided new abilities in their database. And although we have not yet discussed how an Analytic operation actually gets its job done, we should at least by comfortable with the idea that Analytics can be understood in part by an initial examination, of the problems they are trying to solve. Further, if you accept the premise that only a few Analytics are used most of the time, then our back door into the Analytics world can be opened through an examination the most commonly used Analytic (and also one of the easiest to understand), the running total.

Lets take a look at the query plans of each of our four methods, just for ha-ha’s and then we can move on to the innards of Analytic processing. Here are the query plans and statistics via autotrace for our four methods.

View Based Execution Plan
----------------------------------------------------------
Plan hash value: 2297955720

--------------------------------------------------------------------------------------------
| Id  | Operation            | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                     |    10 |   860 |     8  (25)| 00:00:01 |
|   1 |  SORT GROUP BY       |                     |    10 |   860 |     8  (25)| 00:00:01 |
|*  2 |   HASH JOIN          |                     |    10 |   860 |     7  (15)| 00:00:01 |
|   3 |    NESTED LOOPS      |                     |    25 |  1425 |     3   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| EMP_RAISE_HIST      |     8 |   328 |     3   (0)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN| UK_EMP_RAISE_HIST_1 |     3 |    48 |     0   (0)| 00:00:01 |
|   6 |    TABLE ACCESS FULL | EMP_RAISE_HIST      |     8 |   232 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"."ENAME"="A"."ENAME")
       filter("B"."RAISE_DATE"<="A"."RAISE_DATE")
   5 - access("A"."ENAME"="A"."ENAME" AND "A"."RAISE_DATE"="A"."RAISE_DATE")

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         16  consistent gets
          0  physical reads
          0  redo size
        791  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          8  rows processed

SQL> 

Function Based Execution Plan
----------------------------------------------------------
Plan hash value: 1057306416

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     8 |   232 |     4  (25)| 00:00:01 |
|   1 |  SORT ORDER BY     |                |     8 |   232 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP_RAISE_HIST |     8 |   232 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
         63  consistent gets
          0  physical reads
          0  redo size
        791  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          8  rows processed

SQL> 

SCALAR SUB-SELECT Based Execution Plan
----------------------------------------------------------
Plan hash value: 1319989325

----------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                     |     8 |   232 |     4  (25)| 00:00:01 |
|   1 |  SORT AGGREGATE              |                     |     1 |    29 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| EMP_RAISE_HIST      |     1 |    29 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | UK_EMP_RAISE_HIST_1 |     1 |       |     2   (0)| 00:00:01 |
|   4 |  SORT ORDER BY               |                     |     8 |   232 |     4  (25)| 00:00:01 |
|   5 |   TABLE ACCESS FULL          | EMP_RAISE_HIST      |     8 |   232 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("B"."ENAME"=:B1 AND "B"."RAISE_DATE"<=:B2)

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         17  consistent gets
          0  physical reads
          0  redo size
        791  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          8  rows processed

SQL> 

Oracle Analytics Based Execution Plan
----------------------------------------------------------
Plan hash value: 3594581820

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     8 |   232 |     3   (0)| 00:00:01 |
|   1 |  WINDOW SORT       |                |     8 |   232 |     3   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP_RAISE_HIST |     8 |   232 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        791  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          8  rows processed

SQL>

At the risk of going out on a limb, let me make some careless statements on the performance aspects of each coding method, given these plans. The amount of data we are dealing with is trivial so we should not read too much into these results. However they are not without value.

We can see the plans of each coding alternative. We see that the PL/SQL method has high cost though we can’t tell what these costs are by only looking at the plan as the SQL inside the function is executed as a recursive call and thus only shows itself is another plan somewhere else. We see that the view method surprisingly performs as the best of all our pre-Analytic methods though again this is a pitifully small amount of data so such generalizations are suspect. We see that the SCALAR SUB-SELECT method appears to perform similar to the VIEW method and its slight plan variation reflects the code differences in the method.

Most strikingly though, is that we see the Analytic did something completely different as witnessed by its most simple plan. It did not actually have to visit rows in a non-linear fashion as the other methods. Instead it relied on a new approach to adding up the data; sorting. This has significant ramifications, for even our simple analysis here is misleading. In a more realist scenario, we would likely be generating multiple running totals from the same data, not just one. In the insurance industry for example, there would easily be fifty or one hundred such sums in a typical insurance dataset, which need computing. Think for a moment how each method above would solve this problem.

The view method would add additional columns to its layout. This would turn out not to be so bad because it means the view will construct all the aggregates it needs without having to revisit rows for each aggregate. But the other methods are not so lucky. The PL/SQL method unless you are willing to venture into objects (which is another great thing), would require us to construct one function for each aggregate we want to compute. Every invocation of one of these functions has the potential to incur the same overheads of open/close cursor, recursive SQL execution, and non-linear visiting of the data. If there were fifty rolling totals to compute then our eight row example would mean four hundred function calls with 36 row visits each or fourteen thousand four hundred row visits (oh man!). The SCALAR SUB-SELECT method though it does not have the recursive call problem of the PL/SQL method, otherwise suffers from the same issues of open/close cursor, and lots of row visits that only goes up as the number of aggregates increases. Once again, the charm of old school coding shines (the VIEW method). Just cause its old don’t mean it ain’t useful.

But the Analytic is smart. We can note at least three optimizations that the Analytic is capable of:

1) It uses a sort method (which we will discuss in a minute) as a way of organizing the data in order to add up its rolling totals. Basically the Analytic is doing a CONTROL BREAK report, remember that phrase from college comp-sci class? As the amount of data gets larger, this technique just gets better as sorting is one of the things the Oracle database does real well. And as Oracle’s investment in sort technologies pays dividends in the future, Analytics will get faster.

2) Although we don’t see it here (we will in a moment), I can tell you that the Oracle Optimizer understands that all Analytics in a query that use the same Analytic expression can be bundled or piggybacked on the same sort operation. Thus to accommodate the insurance example of fifty rolling totals, only one sort is needed.

3) We see that as a coincidence, our query sorts its final results using the same sort needed by the Analytic expression. Since the Analytic is one of the last things done in a query (more or less), there was no need to re-sort the data after the Analytic, for output, because the Analytic window sort already put the data in the necessary order. So the optimizer skipped a sort. Consider what this means. In my experience about 50% of the time, your analytics will be followed by an ORDER BY that uses the same sort needed by the analytics you are computing. Thus if your query has analytics or not, you will still be doing this sort for the query. Thus in reality, all the analytics for such a query are free. We can say this because adding the analytics in such a situation has not caused you to do any significant new work in the query plan. How is that for cheap and fast! It is the ultimate in savings.

OK, if you are ready, let’s take a look at what the Analytic really does, or at least what I understand it to do.

Analytics are all about sorting. An Analytic function needs to know that the data coming to it is in some specific order, mostly for two reasons, 1) it needs to know that it will see rows of the same group in one contiguous set of rows so that it can know when it has read the last row of the group, and 2) for each row in a group, it needs to understand that row’s relative position to the other rows in the same group.

ENAME      RAISE_DAT RAISE_AMOUNT RUNNING_RAISE_AMOUNT
---------- --------- ------------ --------------------
Jim        01-JUN-03         4000                 4000
Jim        01-DEC-03         4000                 8000
Jim        01-JUN-04         4000                12000
Jim        01-DEC-04         4000                16000
Jim        01-JUN-05         4000                20000
Jim        01-DEC-05         4000                24000
Jim        01-JUN-06         4000                28000
Jim        01-DEC-06         4000                32000

8 rows selected.

Data in our sample case happens to be sorted by ENAME and RAISE_DATE. Our pre-Analytic query coding methods made sure that every time they were called (or joined to in the case of the view), they were looking at one group of rows as defined by ENAME (… and a.ENAME = b.ENAME…) as in all rows for JIM, and within this group of JIM rows, each coding method made sure to select for rows on or before the main query’s current row’s RAISE_DATE (… and b.RAISE_DATE <= a.RAISE_DATE…). This of course implies a sense of ordering to the rows such that knowing a rows date, you know if it appears before or after some other row.

It is convenient to think of how to construct the running total we need for each row in the main query by visualizing how our pre-Analytic methods do it, but that is not the only way to approach the problem. The pre-Analytic coding methods do row-by-row processing of the main query. They get a row from the main query and either 1) join to the view or 2) call a function or 3) execute a scalar sub-select. But row-by-row processing is more properly referred to as slow-by-slow processing.

Enter the Analytic. The Analytic takes a different approach. It uses sorting to organize the data into an order that facilitates CONTROL-BREAK style processing of the data. Thus the Analytic gets the data in such a way that it can simply add up numbers as it reads rows because it knows the rows are coming in the right sorted order for it to do this. Consider the Analytic query again.

select
       ename
      ,raise_date
      ,raise_amount
      ,sum(raise_amount) over (
                               partition by ename
                               order by raise_date
                              ) running_raise_amount
from emp_raise_hist
order by 
         ename
        ,raise_date
/

It uses this form:

Analytic_Fuction(<Any-Legal-Expression usually just a column>)
over (
  <Partition-By-Clause>
  <Order-By-Clause>
  <Window-Clause>
)

The real key is the OVER expression. You see it contains a PARTITION BY clause, an ORDER BY clause, and a WINDOW clause. Forget about the window clause for now; it is not used most of the time anyway. If you really want to know why it exists, think about what a smoothed average is. Removing the WINDOW clause leaves us with this form:

Analytic_Fuction(<Any-Legal-Expression usually just a column>)
over (
  <Partition-By-Clause>
  <Order-By-Clause>
)

Which as you can see is what our query used:

      ,sum(raise_amount) over (
                               partition by ename
                               order by raise_date
                              ) running_raise_amount

Given this OVER expression, Oracle will sort by ENAME and then RAISE_DATE. Let’s say that again. To do this Analytic, Oracle does the main query first, and then takes all rows that come out of the query and sorts them by ENAME, RAISE_DATE. As the sort operation outputs rows, the Analytics are computed. This is possible because the ordering of rows after the sort makes the calculations easy for Oracle to do.

And that is what our query plan for the Analytic says:

Oracle Analytics Based Execution Plan
----------------------------------------------------------
Plan hash value: 3594581820

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     8 |   232 |     3   (0)| 00:00:01 |
|   1 |  WINDOW SORT       |                |     8 |   232 |     3   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP_RAISE_HIST |     8 |   232 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Step 2 read all the rows from emp_raise_hist.
Step 1 sort the data as needed to do one or more Analytics, as the sort outputs rows, do the Analytic(s).
Step 0 done.

This clues us to the purpose of the PARTITION BY clause and the ORDER BY clause of the Analytic OVER expression. The PARTITION BY clause declares what makes a group of rows; the ORDER BY clause defines the relative position of rows with the each group. Much of the efficiency of using this method of calculating Analytic values, is that the sort makes it possible to compute the Analytics as the rows are output from the sort, using CONTROL-BREAK logic.

As you recall, a CONTROL-BREAK report was a report that produced totals every time a grouping control changed. This is possible because a CONTROL-BREAK report requires the input data to be sorted on the CONTROL-BREAK groups. Thus an employee listing by department, which would naturally be sorted by Department Name and Employee Name could add up the number of employees in each department by simply incrementing a counter by one until a new department name was found. At which point the report would put out a summary row for the department and reset its counter to zero so that it could start counting employees in the next department.

All this CONTROL-BREAK logic is possible only because the data is delivered to the report process in sorted order so that the rule “when I see a new control value, I know the control group I was just looking at is done and there won’t be any more rows for it” can easily be applied. This same logical concept is used by Oracle Analytics.

The PARTITION BY clause defines what makes a control group. We can count, sum, and otherwise do math on rows in the group. The ORDER BY clause provides an ordering of rows within the group. This lets us know the meaning of first row in the group, last row in the group, next row to current row, previous row to current row, and stuff like that. So in the end every unique OVER expression in a query, will cause a sort of all rows (seen as a window sort in a query plan), with all Analytics dependent upon the ordering the sort provides, being computed as the sort outputs rows.

For more clarification, look at this query and execution plan.

select
 claim_test.*
,sum(total_paid) over (partition by accnt_id order by accident_date) sum1a
,sum(total_paid) over (partition by accnt_id order by reported_date) sum2a
from claim_test
/

------------------------------------------------------------
| Id | Operation                      | Rows | Bytes 
------------------------------------------------------------
|  0 | SELECT STATEMENT               |  4   |  124 
|  1 | WINDOW SORT                    |  4   |  124 
|  2 |  WINDOW SORT                   |  4   |  124 
|  3 |  TABLE ACCESS FULL  CLAIM_TEST |  4   |  124 
------------------------------------------------------------

As you can see, there are two Analytics being computed. We also see that the OVER expression is different for each. We should thus be expecting two sorts in our query plan. The query plan shows that because there are two distinct OVER expressions, there are two window sorts necessary. Each OVER expression required the data to be sorted its own way, so that its Analytic could be computed.

This plan reads like this:

1) full table scan the claim_test table which results in four rows
2) sort these four rows for an Analytic, and compute the Analytic
3) sort these four rows again for another Analytic and compute that Analytic
4) done

If there had been say ten million rows that came out of the full table scan in step 3, then the plan would say we sorted ten million rows and computed the first Analytic, and then that we sorted the output of that which was ten million rows, and computed the second Analytic. So you should be getting the idea now. Each Analytic OVER expression requires a sort of the data. All rows in the result are sorted for each OVER expression.

If you think that sounds like a lot of work then read the first half of this paper again. You will recall that all of the pre-Analytic methods of getting the running total suffered from several performance problems including the common need to visit N*(N+1)/2 rows to compute the sum where N is the number of rows in a group. In general the CONTROL-BREAK approach using sorting first then computing is much less costly due greatly to the fact that although sorts are themselves technically not linear in cost, there are many sort variations that perform vastly superior to N*(N+1)/2. Add to this that the Oracle Optimizer has many tricks to streamline the process above and beyond access to the world’s best sort algorithms, and you have a real winner. Have a look at this second query.

select
 claim_test.*
,sum(total_paid) over (partition by accnt_id order by accident_date) sum1a
,sum(total_paid) over (partition by accnt_id order by reported_date) sum2a
,sum(total_lost) over (partition by accnt_id order by accident_date) sum1b
,sum(total_lost) over (partition by accnt_id order by reported_date) sum2b
,sum(total_lost) over (partition by accnt_id order by accident_date,cmpny_id) sum3a
,sum(total_lost) over (partition by accnt_id order by reported_date,cmpny_id) sum3b
from claim_test
/
------------------------------------------------------------ 
| Id | Operation                      | Rows | Bytes  
------------------------------------------------------------ 
|  0 | SELECT STATEMENT               |  4   |  124 
|  1 | WINDOW SORT                    |  4   |  124 
|  2 |  WINDOW SORT                   |  4   |  124 
|  3 |  TABLE ACCESS FULL  CLAIM_TEST |  4   |  124 
------------------------------------------------------------

This query has six Analytics to compute, but requires only two sort operations. Why? Well because the Analytics engine has piggybacked Analytic calculations whenever it could find compatible OVER expressions. Some of the OVER expressions although not identical to each other, are in fact “compatible” with each other.

If we do the sorting for:

over (partition by accnt_id order by accident_date,cmpny_id)

then have we not also done a sort that satistfies:

over (partition by accnt_id order by accident_date)

Of course we have. Oracle knows this, so it computes Analytics dependent upon both sort variations at the same time because they are compatible. That is pretty smart. Not even the VIEW method we liked so much originally can do that. You would have had to define multiple views to compute these running totals.

So it should be pretty clear that Analytics have the potential to be substantially faster than any previous method available to use to do these kinds of calculations. And we can put this running total Analytic to good use in warehouses and reporting systems. All the warehouse and reporting systems I have participated in building over the last eighteen months has had some common characteristics:

1) they were based more or less on a Kimball Design
2) they needed STAR TRANSFORMATION to get their data quickly
3) they all wanted to summarize their fact table to a monthly level
4) they wanted INCEPTION TO DATE numbers in a second monthly summary

It was the monthly INCEPTION TO DATE summaries that these systems pinned their performance hopes on. They were not satisfied with queries taking three minutes to answer; they wanted three seconds if they could get it. Naturally there were questions about how to generate such a summary. I was partial to using Analytics.

Well, hope this helps someone get into the use of Analytics. It really is worth the time to get to know them.

Kevin

Comments

Honestly, Analytics have held pride of place in my Mental Blocks table for a long time, but this article has helped clear a lot of cobwebs! Excellent piece. I've said so here, too: http://www.dizwell.com/prod/node/830 and the comments there go some way to explaining why I think this has been superbly written. Thanks a lot.

Kevin,

Thank you a lot for your superb post !!! Your explain techniques is really perfect.

All the best !

Analytics have since been a staple method in tuning some of our queries.
I remembered how getting the latest result set that could screw perf can still be improved a lot. And we have a lot of batch jobs doing this. So imagine the loading.

-----
select * from a
where {some condition}
and date_col = (select max(date_col) from a where {some condition})
-----

ok. this is a badly formed sql but that's a work of a newbie back then :P
we have learned Analytics and you know the trick (i.e. get the max for each partition then compare with the current row's date_col)

and btw, we call it "window function" hehe :D

You better be back to school to pass your skills and experience to students after retirement ^ ~ ^. Students do need professor just likes you with hand-on knowledge and experiences.

Your posts are easy to understand, and skills and methods can be applied to reality easily as well. Looking for your new posts.

Thank you.

Hi Kevin,

Thanks a lot for this hints - I tried them and they would be much faster than my old selects!
But I have a problem which is a little bit the same like your topic - maybe you can help me?

I have a table with many rows. The attributes of the table are code, month and value.
For each code there are 12 months and 12 values.

No I want to add the gaps between the months...
Is it possible to count the following gaps between the different rows...?

For example:

Original table1:
code, month, value
1,1,20
1,2,0
1,3,30
1,4,0
1,5,40
1,6,0
1,7,0
1,8,20
1,9,0
1,10,10
1,11,0
1,12,0
5,1,0
5,2,20
5,3,10
...

description:
january value = 20
february value = 0 (=>count this gap => new value 1 )
march value = 30
april value = 0 (=>count this gap => new value 1 )
may value = 40
june value = 0
july value = 0 (=>count this two following gaps => new value 2 )
agust value = 20
september value = 0 (=>count this gap => new value 1 )
october value = 10
november value = 0
december value = 0 (=>count this two following gaps => new value 2 )

New target table:
code, month, value
1,1,20
1,2,1
1,3,30
1,4,1
1,5,40
1,6,0
1,7,2
1,8,20
1,9,1
1,10,10
1,11,0
1,12,0
5,1,3
5,2,20
5,3,10
...

I tried this code:

select code, month
sum(value) over (
order by month
rows between unbounded preceding and current row
) as final_value
from table1 order by month;

This adds all following fields cumulative from beginning to current_row. But I need this adding only for the following gaps... then start with countin by 0.
I need only the following like in the example on top. Maybe is there an other function like decode to count only the following gaps...!?

A function instead of unbounded preceding....like "Last field with value=0" or something... ?

Best regards,

Tim

dr.s.raghunathan's picture

Many persons want to backup the data into a flat file and vice versa
writing sample utl_file package code will be a boost.

Thanks a lot. Analytics are really amazing.


select
ename, raise_date, raise_amount
-- Forward
, sum(raise_amount) over(partition by ename order by raise_date asc) as running_total1a
, sum(raise_amount) over(partition by ename order by raise_date rows between unbounded preceding and current row) as running_total1b
-- Backward
, sum(raise_amount) over(partition by ename order by raise_date desc) as running_total2a
, sum(raise_amount) over(partition by ename order by raise_date rows between current row and unbounded following) as running_total2b
-- Parent group
, sum(raise_amount) over(partition by ename order by raise_date rows between unbounded preceding and unbounded following) as total
-- Percent of Total
, (sum(raise_amount) over(partition by ename order by raise_date asc) / sum(raise_amount) over(partition by ename order by raise_date rows between unbounded preceding and unbounded following))*100 || ' %' as market_share
from emp_raise_hist
order by ename, raise_date;