contact

Subscribe via E-mail

Your email:

Posts by tag

KPI Partners News & Blog

Current Articles | RSS Feed RSS Feed

Statistical Analysis Using Linear Regression in OBIEE

 

by Kurt Wolff

Linear regression is a statistical technique for drawing a line through a set of data points that “best fits” the data. It is a useful and standard technique for quantifying trends.

For example, consider the following sample data set, graphed as a simulated dot plot in OBI EE.

describe the image

 

 

 

 

 

 

 

 

 

Fitting a regression line is a two-step process: finding the slope of the line and finding the y-intercept (where the regression line would cross the y axis).

If you don’t remember the formula for determining the slope from the statistics class you took 10 to 40 years ago (!), you can easily find it with a search on the web. This site has this:

Linear   Picture2

 

 

 

 

 

 

 

 

You can calculate this formula directly in Answers. I recommend doing it term by term, using a different calculated column for each term in the formula and then combining terms at the end.

Click on any column in the left panel to get a column to write the first term of the formula in. Here, the column clicked was Revenue.

Linear   Picture3

 

 

 

The first term will be x*y. I label the column headings to correspond to the formula I am building.

Linear   Picture5

 

 

 

 

describe the image

 

 

 

 

 

 

 

Then add another column to continue building the formula, using the Column button in the formula window to pick the x*y column already built. This technique saves a lot of headaches down the road as the formula becomes more complex and you start hunting down unmatched parentheses.

Linear   Picture6

 

 

 

 

 

 

 

Clicking on x*y gives this formula.

describe the image

 

 

 

 

 

 

It is useful to look at results for each step along the way. In this case, the results are not what are needed.

Linear   Picture10

 

 

 

 

 

 

 

 

 

 

 

 

So the formula needs to be modified. I added the “by” clause to get the sum for all rows.

Linear   Picture9

State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. [nQSError: 15036] Logical table Facts needs a primary key to support count aggregates on that table. (HY000)

However, instead of creating fact table keys, calculate the count another way. The following formula appears to work (but oddly, as we’ll see, only for a while).

describe the image

 

 

 

 

 

 

describe the image

 

 

 

 

 

 

Continue building.

Linear   Picture13

 

 

 

Linear   Picture15

 

 

 

 

 

Tip: Use the Power() function in your formulas when you can.

Linear   Picture16

Now, the odd thing mentioned above happens. All these columns compute.

Linear   Picture17

 

 

However, when you subtract the last column from the next to last to get the denominator in the formula, Answers throws a puzzling error.

Linear   Picture18

 

 

 

State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. [nQSError: 22023] An arithmetic operation is being carried out on a non-numeric type. (HY000)

At least one of the two numbers in the final column is now regarded as non-numeric?

A solution to this seems to be to change the formula for N, which was max(count(rsum(1))). Both max(rsum(1)) or sum(1 by) work.

Linear   Picture19

 

 

 

When you have the numerator(the first column in the following screen shot) and denominator (next to last column) built, calculate the slope.

Linear   Picture20

 

 

 

 

 

To plot the regression line, you also need the y intercept. The Criteria tab already has all the terms you need to plug into the intercept formula.

Once you have the slope and intercept, the regression line (values of y vs x) are easily calculated with the formula y=mx+b explained above (m is the slope, b is the y-intercept).

Linear   Picture21

 

 

 

 

The plot looks like this:

Linear   Picture22

 

 

 

 

 

 

 

This is a line chart. Revenue is shown with blue round  symbols and a 0 pixel line width. The regression line is a red dotted line with symbols turned off.

Linear   Picture23

 

 

 

Another statistic that is important is the R-squared statistic, which measures how well the regression line fits the data. The formula for R (the correlation coefficient) is:

Linear   Picture24

 

 

 

 

 

 

The Criteria tab already contains the formula for the numerator and several terms of the denominator (circled in this picture).

Linear   new24

 

 

 

Build the rest needed for the denominator, then calculate R and finally R-squared.

describe the image

 

 

 

 

 

 

 

The final value, 0.58, is the goodness of fit (goodness ranging from 0, bad, to 1, perfect).

At this point all the unneeded columns can be hidden or deleted.

It is always good to translate this into business (i.e. not statistical terms) by renaming things (“Regression” –> “Trend”) and using the narrative view to spell things out as explicitly as possible. Perhaps you can come up with a good business translation for R-square.

Linear   Picture26

 

 

 

 

 

 

 

 

 

Comments

Thanks! I may have a use for this at some point. R-squared of .58 would mean that the regression line explains 58% of the data points.
Posted @ Tuesday, March 27, 2012 1:39 PM by John Tesson
This is very useful. Oracle should look into this and bring in more statistical and analytic functions in obiee
Posted @ Tuesday, March 27, 2012 11:25 PM by joe
Thanks for this - mind boggling, but useful! What would be really useful is to be able to project forward for future weeks once the line has been established - is that a step too far in BIEE?
Posted @ Friday, January 04, 2013 12:35 PM by Andy
Summarizing, 
 
use this expression replacing CAMPO_FECHA with a time field and CAMPO_METRICA with the measure: 
 
(((((max(rsum(1))) *(sum(CAMPO_FECHA*CAMPO_METRICA)))  
- ((max(rsum(CAMPO_FECHA))) * (sum(CAMPO_METRICA))))  
/ ((max(rsum(1)) * max(rsum(power(CAMPO_FECHA,2))))  
- (power(max(rsum(CAMPO_FECHA)),2)))) * CAMPO_FECHA)  
+ ((sum(CAMPO_METRICA))- (((((max(rsum(1))) *(sum(CAMPO_FECHA*CAMPO_METRICA)))  
- ((max(rsum(CAMPO_FECHA))) * (sum(CAMPO_METRICA))))  
/ ((max(rsum(1)) * max(rsum(power(CAMPO_FECHA,2))))  
- (power(max(rsum(CAMPO_FECHA)),2))))  
* (max(rsum(CAMPO_FECHA))))) / max(rsum(1))
Posted @ Thursday, August 21, 2014 9:21 AM by Pepe Llavori
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics