sql - Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

Question

Welcome To Ask or Share your Answers For Others

sql - Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

From My other question, Using REGEXP_SUBSTR with Strings Qualifier, I'm trying to decide which approach would be better to use.

The Resulting Data set should only show the strings before the delimters PLE, # and ALL in proper order. The Current Query already in the package is something like this (DDL and DML is in the Bottom of the Post):

SELECT  DATA1
      , DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1)) GET_DATA_TILL_FIRST_PLE
      , DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1)) GET_DATA_TILL_FIRST_NUM_SIGN
      , DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1)) GET_DATA_TILL_FIRST_ALL
      , NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1) PUT_THEM_ALL_TOGETHER    
FROM    table_x;

Which Results into a Data Set below:

DATA1                   | GET_DATA_TILL_FIRST_PLE | GET_DATA_TILL_FIRST_#_SIGN  | GET_DATA_TILL_FIRST_ALL    |  PUT_THEM_ALL_TOGETHER
----------------------- | ----------------------- | --------------------------- | -------------------------- |  ----------------------
STRING_EXAMPLE          | STRING_EXAM             |                             |                            |  STRING_EXAM
TREE_OF_APPLES          | TREE_OF_AP              |                             |                            |  TREE_OF_AP
FIRST_EXAMPLE           | FIRST_EXAM              |                             |                            |  FIRST_EXAM
IMPLEMENTATION          | IM                      |                             |                            |  IM
PARIS                   |                         |                             |                            |  PARIS
PLEONASM                |                         |                             |                            |  PLEONASM
XXXX 1                  |                         |                             |                            |  XXXX 1 
XXXX YYYYYY 2 FFFFFFFFF |                         |                             |                            |  XXXX YYYYYY 2 FFFFFFFFF
XXXX YYYYYY 5FFFFFFFFF  |                         |                             |                            |  XXXX YYYYYY 5FFFFFFFFF
OPOPOPOPO #09090 APPLE  | OPOPOPOPO #09090 AP     | OPOPOPOPO                   | OPOPOPOPO #                |  OPOPOPOPO #09090 AP
OPOPOPOPO BALL#         |                         | OPOPOPOPO BALL              | OPOPOPOPO B                |  OPOPOPOPO BALL
BALL IS #LIFE           |                         | BALL IS                     | B                          |  BALL IS

PS. I only need column PUT_THEM_ALL_TOGETHER but i included the other columns as well do add context.

I find the query bit confusing and hard to read so i tried using REGEXP_SUBSTR and with the suggestion of @vkp, i came up with the below query that results into the same data set above.

SELECT  DATA1
  , REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1) GET_DATA_TILL_FIRST_PLE
  , REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1) GET_DATA_TILL_FIRST_#_SIGN
  , REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1) GET_DATA_TILL_FIRST_ALL
  , COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1),
             DATA1) PUT_THEM_ALL_TOGETHER
FROM    table_x;

However, From @MathGuy's Answer, it seems that INSTR and SUBSTR is much more efficient. i tested this to some extent and here's what i got:

Using INSTR and SUBSTR:

SET TIMING ON;    
BEGIN
    UPDATE  table_x
    SET     DATA2 = NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1);    
    ROLLBACK;        
END;
/

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.234

Using REGEXP_SUBSTR:

SET TIMING ON;  
BEGIN    
    UPDATE  table_x
    SET     DATA2 = COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1);
    ROLLBACK;        
END;
/

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.236

Although this is a very limited test data, it shows that the combination of INSTR and SUBSTR is a little bit faster than REGEXP_SUBSTR. Would it be negligible to use REGEXP_SUBSTR instead of INSTR and SUBSTR for readability's sake?

DML and DDL:

create table table_x 
(
    data1 varchar2(100)    
   ,data2 varchar2(100)
);

INSERT INTO table_x (DATA1) VALUES ('STRING_EXAMPLE');
INSERT INTO table_x (DATA1) VALUES ('TREE_OF_APPLES');
INSERT INTO table_x (DATA1) VALUES ('FIRST_EXAMPLE');  
INSERT INTO table_x (DATA1) VALUES ('IMPLEMENTATION');   
INSERT INTO table_x (DATA1) VALUES ('PARIS');            
INSERT INTO table_x (DATA1) VALUES ('PLEONASM');        

INSERT INTO table_x (DATA1) VALUES ('XXXX 1');   
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 2 FFFFFFFFF'); 
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 5FFFFFFFFF'); 

INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO #09090 APPLE'); 
INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO BALL#'); 
INSERT INTO table_x (DATA1) VALUES ('BALL IS #LIFE');

Thank you.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

458 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:42:06+0000

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z 
set data1 = case
when rownum between      1 and 150000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9)) 
                               || '#'   || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'ALL' || substr(data1, mod(rownum, 9) + 1)
          end
where rownum <= 450000;

commit;

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1, 
       case 
         when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
         when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
         when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
         else data1 end
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP solution

select sum(length(data2))
from (
select data1, 
       COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1)
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:30.75

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

Categories

sql - Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags