Explore the strategy of Local Predicate Generation for Queries within DB2 UDB for iSeries V5R3.
With the availability of V5R3 i5/OS and DB2 UDB for
iSeries, the SQL Query Engine now supports a powerful strategy for minimizing
query input/output (I/O) and maximizing query performance. This new strategy
appears as magic, given the Query Optimizer's ability to rewrite the query using
generated local selection predicates where none were specified. This article
explores the technique and benefits of look-ahead predicate generation
(LPG).
In the world of query optimization, minimizing or even eliminating
the reading and processing of unnecessary data is the key to good performance
because I/O operations are relatively slow operations. The job of the query
optimizer is to choose the appropriate methods and build a strategy that will
access and process the rows as quickly and efficiently as possible. With a high
number of strategies available, the optimizer will normally be able to build a
query plan that meets the user response time requirements.
Within a
query, the local selection predicates are used to specify which rows are to be
selected for processing. One technique for eliminating rows from further
processing requires reading the data from the table(s) and testing values.
Another technique for eliminating rows is accomplished without actually reading
the data from the table(s) but instead relies upon mathematic principles and
other database objects to avoid testing rows in the table(s). For example, given
a query with local selection that identifies one row out of a one-billion-row
table, the database engine can read and test every row (one billion tests), or
it can use an index to identify only the one matching row, eliminating all the
other rows without reading and testing them. While the result will be the same,
the difference in query performance between these two strategies will be significant.
Another query scenario that can cause a lot of reading (and
poor performance) is joining of tables. Specifying a join usually causes the
database engine to select rows from one table and perform a read operation to
find any matching rows in another table. By definition, the join can reduce the
result set, but only by testing for the existence of the row. The potential for
reading and ultimately rejecting many rows makes this type of query problematic.
Again, if the query optimizer can employ specific strategies to eliminate rows
before the join, query performance can be increased and use of database server
resources can be reduced.
DB2
UDB for iSeries supports many methods and employs many strategies to
substantially reduce the reading of rows and the processing of data. One such
strategy is called look-ahead predicate generation (LPG). This powerful and
elegant feature can significantly reduce query times.
A simple join
scenario will help explain why the LPG support can have such a positive affect
on a query's performance.
A Couple of Methods
Assume two tables: one relatively small (SmallTableA)
and one relatively large (LargeTableB). A join query is issued referencing both
tables. Local selection is defined against SmallTableA, but not LargeTableB. The
selectivity of the local selection predicate is very high (i.e., it identifies
two rows in SmallTableA).
SELECT * FROM SmallTableA A, LargeTableB B WHERE A.Join_Col = B.Join_Col AND A.Col1 IN (112358, 132134)
With no indexes defined on LargeTableB, you have various table access and
join processing combinations:
One strategy is to join LargeTableB to
SmallTableA. Select all rows via a full table scan of LargeTableB and join every
row to SmallTableA via an index or a hash table. Figure 1 illustrates this
join.
Figure 1: Here, LargeTableB is joined to SmallTableA. (Click images to
enlarge.)
Given no
local selection, every row selected from LargeTableB will be used to try the
join to SmallTableA. This combination will produce a large number of
reads.
Another strategy is to join SmallTableA to LargeTableB. Select all
rows via full table scan of LargeTableB to create a temporary indexed list or
temporary hash table, as shown in Figure 2.
Figure 2: Here, SmallTableA is joined to LargeTableB.
Given no
local selection to reduce the rows, the temporary indexed list or temporary hash
table on LargeTableB will have to represent all the rows from LargeTableB, and
both temporary data structures will take a relatively long time to populate.
These structures will be relatively large as well.
A Couple of Better Methods
With DB2's LPG and query rewrite techniques, the join
scenarios can be enhanced considerably.
If LargeTableB has some local
selection defined and this local selection is somewhat selective (i.e.,
narrowing down the rows to be selected and processed), then the database engine
can use methods that will reduce the number of rows accessed. This is where LPG
comes in! The query optimizer can "look ahead" to generate local selection from
other tables in the query and then transfer that local selection to another
table. When this local selection is available, methods that take advantage of
indexing technology can be employed to speed up the query request. (Click here
for more information about indexing strategies.)
Using the example above,
the local selection specified for SmallTableA will be used to identify not only
the rows in SmallTableA but also the corresponding join column values. For
example, the SmallTableA row containing A.Col1 = 112358 also contains A.Join_Col
= Value1 (i.e., the corresponding join column value).
These distinct join
column values derived from SmallTableA are used to generate local selection on
LargeTableB. Now that LargeTableB has local selection, additional techniques can
be applied to minimize or eliminate reading large sets of rows.
Here's a
representation of the rewritten query:
SELECT * FROM SmallTableA A, LargeTableB B WHERE A.Join_Col = B.Join_Col AND A.Col1 IN (112358, 132134) AND B.Join_Col IN (Value1, Value2)
The new local selection allows the optimizer to consider additional join
strategies. If the (generated) local selection on LargeTableB significantly
reduces the number of rows accessed, LargeTableB may be placed in the first
(primary) join position and accessed via an index.
Another strategy has
LargeTableB placed in the second join position. If the join on LargeTableB is
implemented via a temporary hash table, the amount of data placed in the hash
will also be reduced by the (generated) local selection. Populating the
temporary hash table can be much faster if an index is created for the
(generated) local selection. Figure 3 illustrates this strategy.
Figure 3: Here, LargeTableB is placed in the second
join position.
The real magic of LPG
comes into play when there is more than one table joined to the larger
table. Assume four tables: three relatively small (SmallTableA,
SmallTableB, SmallTableC) and one relatively large (LargeTableD). A join query
is issued referencing both small and large tables. Local selection is defined
against the small tables, but not the larger LargeTableD. The selectivity of the
local selection predicates is very high (i.e., it identifies few rows in
SmallTableA, SmallTableB, and SmallTableC).
SELECT * FROM SmallTableA A, SmallTableB B, SmallTableC C, LargeTableD D WHERE A.Join_Col = D.Join_Col1 AND B.Join_Col = D.Join_Col2 AND C.Join_Col = D.Join_Col3 AND A.Col1 IN (112358, 132134) AND B.Col6= 'ABC' AND C.Col4= 2005 AND C.Col5= 'January'
These are the various table access and join processing combinations for
the first join pair:
- SmallTableA joined to LargeTableD
- SmallTableB joined to LargeTableD
- SmallTableC joined to LargeTableD
Select all rows via full
table scan of LargeTableB to create a temporary indexed list or a temporary hash
table. For all rows joined to LargeTableD, join to other two SmallTables, as
shown in Figure 4.
Figure 4: Here, all rows joined to LargeTableD join to other two
SmallTables.
Now, consider these table access and join processing combinations:
- LargeTableD joined to SmallTableA
- LargeTableD joined to SmallTableB
- LargeTableD joined to SmallTableC
Select all rows via full
table scan of LargeTableD and join every row to SmallTableA via an index or a
hash table. For all rows from LargeTableD, join to the other three SmallTables,
as shown in Figure 5.
Figure 5: Here, all rows from LargeTableD join to the other three
SmallTables.
Using the four-table query example above, the local
selection specified for SmallTableA, SmallTableB, and SmallTableC will be used
not only to identify the rows in the respective tables, but also to identify the
corresponding join column values. These distinct join column values (derived
from SmallTableA, SmallTableB, and SmallTableC) are used to generate local
selection on LargeTableD using LPG. See Figure 6.
Figure 6: The distinct join column values use LPG to generate local
selection on LargeTableD.
The rewritten query would be represented
like this:
SELECT * FROM SmallTableA A, SmallTableB B, SmallTableC C, LargeTableD D WHERE A.Join_Col = D.Join_Col1 AND B.Join_Col = D.Join_Col2 AND C.Join_Col = D.Join_Col3 AND A.Col1 IN (112358, 132134) AND B.Col6= 'ABC' AND C.Col4= 2005 AND C.Col5= 'January' AND D.Join_Col1 IN (Value1, Value2) AND D.Join_Col2 IN (Value1, Value2, ... Value n) AND D.Join_Col3 IN (Value1, Value2, ... Value n)
Now that LargeTableD has local selection defined, an appropriate set of
indexes can be created and applied to minimize (or eliminate) reading large sets
of rows.
Yet another DB2 UDB for iSeries feature can be employed to
handle the newly generated local selection--namely, index ANDing. This is the
ability of the query optimizer and database engine to use more than one index to
specifically identify and access rows within a given table. Using
single-column-key indexes defined for each join column in LargeTableD, the
database engine can merge the list of rows identified by each respective index
and use the final list to access the table. The intersection (ANDing) of all
three local selection predicates should result in a much narrower set of rows to
be accessed and processed in LargeTableD. Using the indexes allows the database
engine to avoid reading a large set of rows, resulting in a more efficient and
faster query. See Figure 7.
Figure 7: Index ANDing allows the database engine to avoid reading a large
set of rows.
The optimizer can also use LPG with common join requests
that contain a one-to-one relationship. In this case, LPG is used to recursively
look ahead and generate local selection from the neighboring table "downstream."
This has the tremendous benefit of allowing the optimizer much more latitude in
setting the join order. Furthermore, this strategy reduces the risk of any one
particular join order delivering poor performance.
Assume four tables all
roughly the same size: TableA, TableB, TableC, and TableD. A join query is
issued referencing all four tables with a one-to-one relationship (A to B, B to
C, C to D). Local selection is defined against only one table, TableA.
SELECT * FROM TableA A, TableB B, TableC C, TableD D WHERE A.Join_Col1 = B.Join_Col1 AND B.Join_Col2 = C.Join_Col2 AND C.Join_Col3 = D.Join_Col3 AND A.Col5 = 'ABC'
Determining the best-performing query plan can be problematic, depending
on the selectivity of the local selection predicates defined for TableA and the
relative position of TableA in the join order. If all the tables have some local
selection defined, the optimizer has many more choices for the query join order
and consequently more opportunities to provide multiple high-performance query
plans.
Using the four-table query example above, the local selection
specified for TableA will be used not only to identify the rows in TableA, but
also to identify the corresponding join column values. These distinct join
column values (derived from TableA) are used to generate local selection on
TableB.
The generated local selection specified for TableB will be
used to identify the rows in TableB and identify the corresponding join column
values. These distinct join column values (derived from TableB) are used to
generate local selection on TableC.
The generated local selection
specified for TableC will be used to identify the rows in TableD and identify
the corresponding join column values. These distinct join column values (derived
from TableC) are used to generate local selection on TableD.
Now the
query has been rewritten and optimized with local selection on all four tables
(Figure 8).
Figure 8: The query is optimized with local selection on all four
tables.
Here's a representation of the rewritten query:
SELECT * FROM TableA A, TableB B, TableC C, TableD D WHERE A.Join_Col1 = B.Join_Col1 AND B.Join_Col2 = C.Join_Col2 AND C.Join_Col3 = D.Join_Col3 AND A.Col5 = 'ABC' AND B.Join_Col1 IN (Value1, Value2, ... Value n) AND C.Join_Col2 IN (Value1, Value2, ... Value n) AND D.Join_Col3 IN (Value1, Value2, ... Value n)
Being aware of and planning for LPG will allow the creation of indexes to
effectively support join queries. Without this knowledge, and the corresponding
indexes, better query performance will be squandered. The strategy for creating
indexes for the generated local selection predicates is identical to creating
indexes for user-supplied local predicates. Identifying the join columns is a
clue to identifying which columns to create indexes on.
Query Magic
Like a good magician, the query optimizer uses
techniques that are hidden from the audience. Only the final result is obvious.
Given that query rewrite and LPG occurs out of sight, it will be helpful to get
some feedback from the optimizer when this strategy is employed. One simple
technique is to observe the estimated selectivity of each table in the query
when that table has no user-supplied local selection predicates. Without LPG,
the estimated number of rows selected should be equal to the total number of
rows in the table. If the estimated number of rows selected is less than
the total, then some selection is being applied.
Another more accurate
technique is to rely on iSeries Navigator - Visual Explain to render the query
graph. Once the query graph is displayed, any nodes that are under the influence
of LPG can be highlighted (in green). To identify the use of LPG, select the
"Highlight LPG" option on the View menu within the Visual Explain
window.
Figure 9: iSeries Navigator - Visual Explain renders the query
graph.
The key to fully utilizing an optimization capability is in
integrating that capability into the normal flow of query optimization--in
effect, making that capability pervasive. Traditionally, sophisticated
optimization techniques rely on interrogation of the query or query environment
to recognize when and where to apply the specific technique. If this
identification is successful, then the technique can be successful; otherwise,
the technique is useless or even detrimental.
While the DB2 UDB for
iSeries LPG may seem like a specialized and rarely needed technique, it really
is a pervasive and powerful tool within the optimizer's bag of tricks. The
simple join examples illustrated here are but one area in which this capability
can be applied. Other areas include queries against star
schema and snowflake schema data models, as well as SQL requests with
subqueries, common table expressions, or derived
tables.
Rob Bestgen is a senior technical staff
member and query optimizer design leader on the IBM DB2 for iSeries team in
Rochester, Minnesota. Rob can be reached at
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
.
Mike Cain is a senior technical staff
member and leader of the DB2 for iSeries Center of Competency in Rochester,
Minnesota. Mike can be reached at
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
.
|