outer table predicate in ON-clause #my take

(INNER table predicate in where-clause vs ON-clause is well covered online and also in my own http://bigblog.tanbin.com/2011/05/whats-different-between-these-2-sqls.html)

Note intermediate table is unaffected by where-clause. Query processor always uses on-clause exclusively to build an intermediate joined table before applying where-clause. This is the logical view we can safely assume. Physically, Query processor could optimize away the where/on distinction, but output is always consistent with the logical view.

Q1: can we _always_ assume that LL-outerjoin-RR intermediate table _always_ includes all LL rows if no outer table predicate in ON-clause?
A: yes with the big “if”
%%A: If you need to filter on outer table, do it in where-clause — better understood. Avoid ON-clause.

Q2: Boldly dropping the big “if”, can we _always_ assume that LL-outerjoin-RR intermediate table _always_ includes all LL rows _regardless_ of outer table predicate in ON-clause?
%%A: probably yes.

If you really really want outer table predicate in ON-clause, I assume you have no good reason and just feels adventurous.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1502/html/sqlug/sqlug169.htm shows that even with outer table predicate in ON-clause, we still get “… all LL rows”

select title, titles.title_id, price, au_id
from titles left join titleauthor
on titles.title_id = titleauthor.title_id
and titles.price > $20.00
title                  title_id   price   au_id       
--------------------   ---------  ------  ---------------
The Busy Executive’s   BU1032     19.99   NULL
Cooking with Compute   BU1111     11.95   NULL
You Can Combat Compu   BU2075      2.99   NULL
Straight Talk About    BU7832     19.99   NULL
Silicon Valley Gastro  MC2222     19.99   NULL
The Gourmet Microwave  MC3021      2.99   NULL
The Psychology of Com  MC3026      NULL   NULL
But Is It User Friend  PC1035     22.95   238-95-7766

join/case as alternative to union

Assuming all animals with/without names are wanted, I'd write a unioned outer join — animal left outer join cats union animal left outer join dogs. This would give me a complete intermediate table for further processing.

I don't use union often but I usually find no good alternatives when I need a union. However,

http://stackoverflow.com/questions/9187243/sql-union-optimized-into-a-left-join-faster-but-query-plan-says-cost-more-i-o shows a join/Case as an alternative to union. I hope it's usable here. The intermediate table probably has the right number of rows (as in animals) but more columns than wanted. Case expression acts as a multiplexer to pick either cats column or dogs column.

———-

Sent: Thursday, May 17, 2012 9:11 PM

To: Kit Cheng; Bin TAN; Song Ding; Xin Zhuang; Sister; Hai Yi

a table of animals (id, name);

a table of cats (id, color);

a table of dogs (id, color);

write a sql to list all the animals whose color is red.

[12]case/group-by/self-join: #2 ] G9

(Simplest solution is at the end, which also returns #2 alone ….)
I have seen many tricks in SELECT queries (Most using joins – Fine.), but now if I must name one keyword to be the most powerful yet unappreciated keyword, it has to be
1) CASE.

2) The combined power of case and group-by is even more impressive.
3) Yet more unthinkable is the combination of case/group/self-join.
4) correlated subquery (slow?) in SELECT, FROM, WHERE and HAVING.

http://www.informit.com/articles/article.aspx?p=26856 explains “In addition to subqueries in the WHERE and HAVING clauses, the ANSI standard allows a subquery in the FROM clause, and some relational database management systems (RDBMSs) permit a subquery in the SELECT clause.”

Key to understanding case_group_self-join is the intermediate work table. Mind you, intermediate table is the key to _every_ tricky self-join. P30 [[ transact-sql cookbook]] among many chapters, has a wonderful solution to find the top 5 values in a table without a 5-way self-join. Based on that, here’s a simplified but full example, showing several distinct solutions. These solutions can be adapted to find 5th largest value in a table, too.

— setting up the data —-
drop table public.students
create table public.students(
student varchar(44),
score decimal(4,1),
primary key (student)
)
insert into students values( ‘Andrew’, 15.6)
insert into students values( ‘Becky’, 13)
insert into students values( ‘Chuck’, 12.2)
insert into students values( ‘Dan’, 25.6)
insert into students values( ‘Eric’, 15.6)
insert into students values( ‘Fred’, 5.6)
insert into students values( ‘Greg’, 5.6)
select * from students


Solution 1
—– top 3 scores
— intermediate table
select * from students h right join students L
on h.score > L.score

select L.student, ‘is lower than’, count(h.student), ‘competitors’
from students h right join students L
on h.score > L.score
group by L.student
having count(L.student) < 3

—– lowest 4 scores
— intermediate table
select * from students h left join students L
on h.score > L.score

select h.student, ‘defeats’, count(L.student), ‘competitors’
from students h left join students L
on h.score > L.score
group by h.student
having count(L.student) < 4


—- Solution 2, using case
— intermediate table
select *, (case when h.score > L.score then 1 else 0 end)
from students h left join — inner join ok since cartesian
students L on 1=1

select h.student, ‘defeats’, sum(case when h.score > L.score then 1 else 0 end), ‘competitors’
from students h left join — inner join ok since cartesian
students L on 1=1
group by h.student
having sum(case when h.score > L.score then 1 else 0 end) < 4

–same solution tested on http://sqlzoo.net/howto/source/z.dir/tip915069/sqlserver
select h.name, h.area, ‘is smaller than’, sum(case when h.area < L.area then 1 else 0 end), ‘countries’
from cia h, cia L
group by h.name, h.area
having sum(case when h.area < L.area then 1 else 0 end) < 4


—– (concise) Solution 3, using correlated sub-select, P295 [[sql hacks]] — without intermediate table
select * from students o where  (select count(1) from students where score < o.score) < 4

–Same solution tested in http://sqlzoo.net/howto/source/z.dir/tip915069/sqlserver:
select * from cia o where  (select count(1) from cia where area > o.area) < 4 — shows top 4

…and (select count(1) from cia where area > o.area) > =3 — returns #4 alone
…and (select count(1) from cia where area > o.area) > =2 — returns #4 #3 exactly

–This technique shows its power when you want top 2 in each continent, without group-by
select * from cia o where  (select count(1) from cia where area > o.area and region=o.region) < 2
order by o.region, o.area

I feel for a student/practitioner, it pays to think in terms of CASE. This strikingly simple solution can be rewritten using (messy) CASE.

What’s the different between these 2 sql’s — a condition to put into WHERE or ON

Q: What’s the different between these 2 queries?

left outer join CodeLegend c on spv.StatusUid=c.CodeLegendUid
where c.FullName=’Active’
———————————————————————— ———-
left outer join CodeLegend c on spv.StatusUid=c.CodeLegendUid
and c.FullName=’Active’

%%A: all rows from outer table will show up in intermediate table. The “Active” condition becomes “and (c.FullName=’Active’ or c.FullName is null)” if put into ON clause

If in doubt, I usually put conditions in WHERE (not ON) because WHERE is less implicit and more explicit. Intermediate table tends to show more meaningful data this way. Not sure about this case though.

See http://searchoracle.techtarget.com/answer/LEFT-OUTER-JOIN-with-ON-condition-or-WHERE-condition

http://blog.sqlauthority.com/2009/03/15/sql-server-interesting-observation-of-on-clause-on-left-join-how-on-clause-effects-resultset-in-left-join/ explains right-table predicate in ON-clause

The more obscure issue is Outer table predicate in ON-clause, explained in other blog post…

seeing no nulls from self outer join@@

–drop table students
–go
–create table students(
—    student varchar(44),
—    score decimal(4,1),
—    ident int identity
–)
–insert students select ‘Andrew’, 15.6
–insert students select ‘Becky’, 13
–insert students select ‘Chuck’, 12.2
–insert students select ‘Dan’, 25.6
–insert students select ‘Eric’, 15.6
–insert students select ‘Fred’, 5.6
–select * from students
select * from students h left join students l
  on h.score >= l.score
order by h.score, l.score
–- won’t show the nulls expected of an outer join, so I remove the “=”. Now Fred (and Greg) the poor student doesn’t score higher than anyone and therefore shows nulls.
Another way to get the nulls is adding the not-equal join condition
select * from students h left join students l
  on h.score >= l.score
and h.student != l.student –- this condition is essential to self outer joins
Now, this topic is relevant to me because whenever I construct an outer join, I expect to see nulls. An outer join without nulls is a emergency exit seat without legroom.
Q: if an outer join returns no nulls, is it equivalent to an inner join?
A: Usually yes. I’m yet to find an exception.

count(nullable_field) can return 0 – tricky in outer join

Opening quiz: how do you get count() to return a 0? I believe most novice SQL developers are unfamiliar.

See the vmware questions in pearl blog. Tested in sybase.

select count(field_allowing_nulls) — would not count the /null-cells/. See P 91 [[ introduction to sql ]]. Note you can think of the table as a spreadsheet with named and numbered columns, numbered rows, and numbered cells.

select count(*), count(1234) — would simply count all rows in the entire table without filtering

This null-behavior is esp. important in outer joins.

Generalize -> Just like count(), many aggregate functions like agg_func(filed_allowing_nulls) would filter out null values before any calculation. Best understood with average(), which must discard all null values before computation.

outer join generalized

A outer join B on
condition1_involving_both and
condition2_involving_both and
condition3_involving_both

In the worktable, system generates at least one result row for each A row.

System takes the A row and evalutes the condition set. If 4 B rows match, then 4 result rows. If no B row matches, then still one result row with (A.*, null as B.col1 , null as B.col2..)

Note that conditions may be any predicate, not only A.id=B.id

index selection in correlated subquery

http://www.onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html has a simple but good illustration.

select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
and exists (select NULL from ORDERS b
where b.CUSTOMER_ID = a.CUSTOMER_ID and b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME

is a good example of a simple correlated subquery. When the rdbms runs the subquery *for each* value of a.customer_id, this value is
taken as a constant. Therefore, the Orders index to use is the index on b.customer_id (perhaps in conjuction with b.ordered_date).

Remember for each table accessed, optimizer chooses one index. I don’t think it can use 2 indices.

listing customers with 0 payment history

listing customers with 0 payment history (according to the Payment table) — A simple anti-join, well covered in literature.

Customer_table (cid…) with primary key on cid
Payment_table (cid, date, amount ..) with non-unique index on foreign-key cid

S1 (uncorrelated): select C.cid
from C
where C.cid not in
(select distinct cid from P)

S2 (correlated) — thanks to index, no FTS:
select C.cid
from C
where not exists
(select * from P
where C.cid = P.cid)

[1] One of the first key observations is the subset relationship between the prikey C.cid and forkey P.cid, and consequently the referential integrity.

[2] Second key observation is that P.cid is (perhaps heavily) repetitive, if customers pay a few times a day (for a subway pass). I think it’s imperative to use the index on P.cid.

[3] Third key observation is the absense of additional criteria => have to process every C.cid. In most real life queries, another where-criterion restricts us to a fraction of the C.cid values. In that case P.cid FTS is bad.

S1 subquery runs only once and reads every Payment_table page exactly once. S2 subquery runs once for each [4] row in Customer_table, reads every index entry but never reads the Payment_table. Due to caching, S2 loads every index data block exactly once.

[4] I think the repetitive run of the subquery is not as bad as unnecessary disk reads. Number of disk reads dominates SELECT performance. What’s the disk vs memory speed ratio? Orders of magnitude! S1 reads all P.cid, S2 loads the entire index on P.cid. Which reads less? Also see http://www.onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html.

In any case, the correlated S2 always reads every distinct P.cid that matches C.cid, which is every[1] distinct P.cid /value/, but I think S2 reads only index, not the table. In most real-world queries, a index scan precedes a table read.

It is possible that FTS on P requires almost the same number of disk reads as a full index scan, if P.cid is close to unique. Without [2], disk reads related to P are identical between S1 and S2.

S3 (using distinct): — no FTS on P, probably beats the anti-join recommendation in [[ oracle sql tuning ]].
select distinct C.cid
from C outer join P on cid
where P.cid = null

S4 (using count): select C.cid
from C outer join P on cid
having count(date) = 0
group by C.cid

S5 (using sum): select C.cid
from C outer join P on cid
having sum(amount) = NULL — not 0. P 93 [[ intro to sql ]]
group by C.cid

S6 (using minus):

join order ] oracle – brief notes

* n-way
* Oracle can take an hour to choose a join order
* To limit it, you can tell the Oracle optimizer how many permutations to look into. P406 [[ oracle sql tuning ]]
* “ordered” hint tells the optimizer to follow the join order specified in the FROM clause

But what join order would you, the developer, prefer? I feel we should avoid re-scanning a large table

sub-query as column expression (sybase

— sybase
my $subselect=qq[ select min(ApprovalDate) from MtgAccts where …];
my $sql = “select *, convert(varchar, Day1ofMonth, 112) as ‘numeric_month’, ($subselect) as ‘approval_asof’…

http://sqlzoo.net/1.htm (tested on Oracle, MSSQL, Mysql)

* To list number of population “superpowers” by region

SELECT region, count(*),
(select count(*) from bbc a where population > 100123000 and a.region=b.region) as superpowers
FROM bbc b group by region

common 1-table sql:increments between adjacent rows

eg: savings (datetime, balance)

For a live example, let’s use 
http://sqlzoo.net/1.htm table countries (population, area…)

To simplify display, we will limit ourselves to the top 5 populations. 
 
* rank populations and list the pop decrement
select a.name,a.population, min(a.population-b.population) as decr from bbc a, bbc b where
a.population > b.population and b.population > 150000000
group by a.name, a.population
order by a.population desc

* Rank populations and list the pop decrement along with the 2 countries compared:

SELECT a.name, a.population, a.population-b.population as decr, b.name FROM bbc a, bbc b where b.population > 160000000 and
a.population-b.population = ( select min(a.population-c.population) from bbc c where a.population>c.population)
order by a.population desc

 

left join == left outer join

“The ANSI outer-join syntax begins an outer join with the LEFT JOIN, LEFT OUTER JOIN, RIGHT JOIN, or RIGHT OUTER JOIN keywords. The OUTER keyword is optional.”  ==> LEFT implies LEFT-OUTER

“ANSI join syntax also allows the dominant or subordinate part of an outer join to be the result set of another join, when you begin the join with a left parenthesis.”
right join is lg2
 

null forkey ^ unmatched prikey

–simple illustration:
emplyee {employee_id, department_id, age, salary…}
department {department_id, size, average age…}

null-forkey: employee without department_id
unmatched prikey: department without employee

— null forkey — perhaps not a case for outer join

“null in forkey column”. referential integrity violated.

http://sqlzoo.net/4.htm

Q 2a: One of the MSP (member of parliment) has no party (null forkey, violating referential integrity). How do you include him in a member-and-party listing?

A: select msp.name, party.name from msp, party where msp.party=party.code
union
select msp.name, null from msp where
msp.party is null

Neither makes sense:
wrong: msp.party=party.code(+) — no effect. How can party.code (prikey) be a SUBSET of msp.party (forkey)? referential integrity violated

wrong: msp.party(+)=party.code — showing empty parties

— unmatched prikey — #1 common scenario for outer join

“forkey = subset@prikey”

Heart of every outer join is a subset relationship. If the 2 join-columns are prikey-forkey, then the smaller set has to be the forkey — referential integrity. An outer join will show ALL the rows from the prikey side, whether or not there’s a matching row.


where employee.employee_id = bonus.employee_id(+) — “+” on the forkey side

nested loop join

for now, focus on 1-to-1 mapping between tables.
  • oracle takes 1st row from driving-table
  • oracle looks for a matching row in 1st joined-to table
  • oracle builds a combined row containing fields from both tables
  • the combined row becomes input to the 3rd loop scanning the 3rd table
  • a final combined row is returned

one-word intro 2 outer join

Outer join (amidst other joins) can be confusing for beginners. Different authors use their favorite anchor keywords. Best one-word intro: subset

Sometimes, one join-column is a subset of the other join-column, where outer join and inner join show their difference. Focus on subset scenario for now. Once you get your mind around this scenario, you are ready to look at the generic scenario where one join column is not a subset of another join column

–2nd word after “subset”? union

In some SQL implementations, u need a union query to do outer join

I believe unions solve more problem than outer joins.

–simplest example: bonus
show employees with and without bonus
“bonus” table {employee_id (forkey), bonus}

joins^sub^group-by^agg

Let’s take the base sql vocabulary as a starting point
without joins
without sub queries
without grouping
without agg ie aggregates
without union

Q: which addition is “troublesome” for users?

$ Join is natural to sql. Even outer join is natural.
$ Union is not as natural but simple to understand
$ Sub query is an unnatural addition to sql. ugly.
$ correlated sub query is complex.
$ Group-by imposes restrictions on other parts of a select-statement, such as “select expressions must be …”
$ Agg imposes restrictions, such as “other select expressions must be …”

forkey ^ join ^ cartesian-product

referential-integrity ^ forkey ^ any_type_of_join ^ cartesian-product — at the heart of the relational paradigm.

* most if not all joins (including self-join, outer join) are Cartesian in nature, and produce a intermediate Cartesian table (icart) initially. (No need to explain “initially”)
* forkeys exist primarily (if not always) as join-columns
* relational model relies on forkeys at its heart
* normalization usually (if not always) create forkeys

common 1-table query: list buses pass`both K&&T

http://sqlzoo.net/6.htm

question 2a:
The table: route(num,company,pos, stop) ie (service_num, bus_company, position, bus_stop)

select a.num, a.company from route a, route b 
where a.stop=53
and b.stop=149
and a.num=b.num
and a.company=b.company

The self-join creates a 8-column icart (ie intermediate cartesian). The rows form a cartesian product. The where-clause filters the 8-field rows based solely on the 8 fields.

I feel some of the conditions should be join conditions.

common 1-table query:list biggest countries]each region

http://sqlzoo.net/1a.htm 1 table only: bbc(name, region, area, population, gdp) Surprisingly, such a simple thing need a complicated solution in sql!

–You can use a correlated sub query with ALL

select name, region, area from bbc a where area >=
all (select area from bbc b where a.region=b.region) — equal sign needed

–You can use a correlated sub query with max()

select name, region, area from bbc a where area =
(select max(area) from bbc b where a.region=b.region)

I guess you may also use a self-join? Maybe “biggest” type of query unnatural for self-join?

correlated^uncorrelated subquery

#1 difference

Q: “how many times does the subquery run during a complete run of the /enclosing/ query?”
A: exactly once for uncorrelated. You can see it is much simpler — the sub-select is complete and ready to run on its own just like a regular select, and doesn’t mention anything from the enclosing query.

Look at the example in http://www.onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html
* uncorrelated often has the form “… IN (select …”.
* correlatd often has the form “… EXISTS (select … “

"referential-integrity" explained by outer join, subset

see posts on “unmatched prikey”

DB designers often create healthy subset relationships between Table A’s prikey and Table B’s forkey. Any forkey’s #1 purpose is a join column. Subset => outer join.

Why prikey is never a subset of forkey? Such a freak subset means one of the forkey entries is unmatched in the prikey column — breaking referential integrity.

(un)correlated subquery^join

Correlated/uncorrelated sub-queries and joins are blood brothers and often simultaneously usable to solve a given problem. A few non-obvious differences.

* subquery — clearer than joins
sub query may look unfriendly compared to joins, but actually the logic is easier to make out.

* uncorrelated — most readable
Uncorrelated is usually *simpler* than joins or correlated.

* uncorrelated — fast
b’cos one nested scan only. Correlated (not uncorrelated) can hurt performance because it could involve one scan per value “passed in”

Uncorrelated requires just one execution of the sub query.