2022/06/22

MySQLだけでWindow関数を使って@rowとか使わずに95%ileを計算したい

やり方があってるかどうかわからないので違ってたら教えてほしい。

サンプルデータこんな感じ。


mysql80 209534> SELECT * FROM t1 LIMIT 3;

+---------------------+-----------+
| dt                  | rows_read |
+---------------------+-----------+
| 2022-05-31 17:16:00 |         0 |
| 2022-05-31 17:16:01 |         6 |
| 2022-05-31 17:16:03 |         0 |
+---------------------+-----------+

3 rows in set (0.00 sec)

まずは全期間でrows_readの95%ileを計算してみたい。

パーセンタイルを一発で求めるなにかは無さそうなので、まずはおとなしくRANK()で並べ替える。

mysql80 209534> SELECT dt, rows_read, RANK() OVER (ORDER BY rows_read) AS _rank FROM t1;
+---------------------+-----------+-------+
| dt                  | rows_read | _rank |
+---------------------+-----------+-------+
| 2022-05-31 17:16:00 |         0 |     1 |
| 2022-05-31 17:16:03 |         0 |     1 |
| 2022-05-31 17:16:04 |         0 |     1 |

..
| 2022-05-31 18:26:54 |   1652200 |  5297 |
| 2022-05-31 18:20:12 |   1656600 |  5298 |
| 2022-05-31 18:40:09 |   1668900 |  5299 |
+---------------------+-----------+-------+
5299 rows in set (0.00 sec)

要はこの _rank / 5299 がパーセンタイルになるので割ってやればいいんだけど、5299の部分はもちろん動的に作りたい……が、COUNT取らないといけないので一旦WITH句に逃がす。

mysql80 209534> WITH
    -> _count AS (
    ->   SELECT COUNT(*) AS c FROM t1
    -> ),
    -> _ranked AS (
    ->   SELECT dt, rows_read, RANK() OVER (ORDER BY rows_read) AS _rank FROM t1
    -> )
    -> SELECT dt, rows_read, _rank, c, (_rank / c) * 100 AS percentile
    -> FROM _ranked JOIN _count;

+---------------------+-----------+-------+------+------------+
| dt                  | rows_read | _rank | c    | percentile |
+---------------------+-----------+-------+------+------------+
| 2022-05-31 17:16:00 |         0 |     1 | 5299 |     0.0189 |
| 2022-05-31 17:16:03 |         0 |     1 | 5299 |     0.0189 |
| 2022-05-31 17:16:04 |         0 |     1 | 5299 |     0.0189 |

..
| 2022-05-31 18:26:54 |   1652200 |  5297 | 5299 |    99.9623 |
| 2022-05-31 18:20:12 |   1656600 |  5298 | 5299 |    99.9811 |
| 2022-05-31 18:40:09 |   1668900 |  5299 | 5299 |   100.0000 |
+---------------------+-----------+-------+------+------------+
5299 rows in set (0.01 sec)

この結果セットの WHERE percentile >= 95 ORDER BY percentile ASC LIMIT 1 が95%ile値ってことで合ってるかしらん。

mysql80 209534> WITH
    -> _count AS (
    ->   SELECT COUNT(*) AS c FROM t1
    -> ),
    -> _ranked AS (
    ->   SELECT dt, rows_read, RANK() OVER (ORDER BY rows_read) AS _rank FROM t1
    -> ),
    -> _percentiled AS (
    ->   SELECT dt, rows_read, (_rank / c) * 100 AS percentile
    ->   FROM _ranked JOIN _count
    -> )
    -> SELECT rows_read, percentile
    -> FROM _percentiled
    -> WHERE percentile >= 95
    -> ORDER BY percentile ASC
    -> LIMIT 1;
+-----------+------------+
| rows_read | percentile |
+-----------+------------+
|   1562614 |    95.0179 |
+-----------+------------+
1 row in set (0.01 sec)

とりあえず出てきた。

mysql80 209534> WITH
    -> _count AS (
    ->   SELECT COUNT(*) AS c FROM t1
    -> ),
    -> _ranked AS (
    ->   SELECT dt, rows_read, RANK() OVER (ORDER BY rows_read) AS _rank FROM t1
    -> )
    -> SELECT dt, rows_read, _rank, c, (_rank / c) * 100 AS percentile
    -> FROM _ranked JOIN _count;

..
| 2022-05-31 18:06:07 |   1562400 |  5031 | 5299 |    94.9424 |
| 2022-05-31 18:22:27 |   1562435 |  5032 | 5299 |    94.9613 |
| 2022-05-31 18:25:30 |   1562506 |  5033 | 5299 |    94.9802 |
| 2022-05-31 18:23:04 |   1562587 |  5034 | 5299 |    94.9991 |
| 2022-05-31 18:44:19 |   1562614 |  5035 | 5299 |    95.0179 |   <--- ちゃんとここの行
| 2022-05-31 18:09:39 |   1562900 |  5036 | 5299 |    95.0368 |
| 2022-05-31 18:34:26 |   1562900 |  5036 | 5299 |    95.0368 |
| 2022-05-31 18:44:26 |   1563000 |  5038 | 5299 |    95.0745 |

..

合ってるかなこれで。

さて、更にこれを「dtを1分単位で丸めて、その枠の中の95%ile値」にしたい。

RANK()を取るところまではいける。

mysql80 209534> SELECT DATE_FORMAT(dt, '%Y-%m-%d %H:%i:00') AS _dt, rows_read, RANK() OVER (PARTITION BY DATE_FORMAT(dt, '%Y-%m-%d  %H:%i:00') ORDER BY rows_read) AS _rank FROM t1;
+---------------------+-----------+-------+
| _dt                 | rows_read | _rank |
+---------------------+-----------+-------+
| 2022-05-31 17:16:00 |         0 |     1 |
| 2022-05-31 17:16:00 |         0 |     1 |
| 2022-05-31 17:16:00 |         0 |     1 |

..
| 2022-05-31 18:45:00 |   1563137 |    57 |
| 2022-05-31 18:45:00 |   1573000 |    58 |
| 2022-05-31 18:45:00 |   1598500 |    59 |
| 2022-05-31 18:46:00 |         0 |     1 |
+---------------------+-----------+-------+
5299 rows in set (0.02 sec)

さっきはCOUNTで割る数の5299が得られたけど、今度はパーティションごとにCOUNTの値が違うので MAX(_rank) で代用するのをWITHに備え付ける。

mysql80 209534> WITH
    -> _ranked AS (
    ->   SELECT DATE_FORMAT(dt, '%Y-%m-%d %H:%i:00') AS _dt, rows_read, RANK() OVER (PARTITION BY DATE_FORMAT(dt, '%Y-%m-%d %H:%i:00') ORDER BY rows_read) AS _rank FROM t1
    -> ),
    -> _count_max AS (
    ->   SELECT _dt, MAX(_rank) AS c FROM _ranked GROUP BY _dt
    -> )
    -> SELECT _dt, rows_read, _rank, c, (_rank / c) * 100 AS percentile
    -> FROM _ranked JOIN _count_max USING(_dt);
+---------------------+-----------+-------+------+------------+
| _dt                 | rows_read | _rank | c    | percentile |
+---------------------+-----------+-------+------+------------+
| 2022-05-31 17:16:00 |         0 |     1 |   59 |     1.6949 |
| 2022-05-31 17:16:00 |         0 |     1 |   59 |     1.6949 |
| 2022-05-31 17:16:00 |         0 |     1 |   59 |     1.6949 |

..
| 2022-05-31 18:45:00 |   1563137 |    57 |   59 |    96.6102 |
| 2022-05-31 18:45:00 |   1573000 |    58 |   59 |    98.3051 |
| 2022-05-31 18:45:00 |   1598500 |    59 |   59 |   100.0000 |
| 2022-05-31 18:46:00 |         0 |     1 |    1 |   100.0000 |
+---------------------+-----------+-------+------+------------+
5299 rows in set (0.05 sec)

ここからWHEREと…フレームごとにORDER BY percentile ASC LIMIT 1は難しいので、フレームごとのMIN(rows_read)でいけるかしら。

mysql80 209534> WITH
    -> _ranked AS (
    ->   SELECT DATE_FORMAT(dt, '%Y-%m-%d %H:%i:00') AS _dt, rows_read, RANK() OVER (PARTITION BY DATE_FORMAT(dt, '%Y-%m-%d %H:%i:00') ORDER BY rows_re
ad) AS _rank FROM t1
    -> ),
    -> _count_max AS (
    ->   SELECT _dt, MAX(_rank) AS c FROM _ranked GROUP BY _dt
    -> ),
    -> _percentiled AS (
    ->   SELECT _dt, rows_read, _rank, c, (_rank / c) * 100 AS percentile
    ->   FROM _ranked JOIN _count_max USING(_dt)
    -> )
    -> SELECT _dt, MIN(rows_read)
    -> FROM _percentiled
    -> WHERE percentile >= 95
    -> GROUP BY _dt;
+---------------------+----------------+
| _dt                 | MIN(rows_read) |
+---------------------+----------------+
| 2022-05-31 17:16:00 |           1600 |
| 2022-05-31 17:17:00 |            822 |
| 2022-05-31 17:18:00 |           8022 |
| 2022-05-31 17:19:00 |          79000 |

..
| 2022-05-31 18:43:00 |        1594800 |
| 2022-05-31 18:44:00 |        1578900 |
| 2022-05-31 18:45:00 |        1563137 |
| 2022-05-31 18:46:00 |              0 |
+---------------------+----------------+
91 rows in set (0.05 sec)

18:43近辺の前のクエリの結果で調べてみると

| 2022-05-31 18:43:00 |   1586000 |    52 |   59 |    88.1356 |
| 2022-05-31 18:43:00 |   1588800 |    53 |   59 |    89.8305 |
| 2022-05-31 18:43:00 |   1589332 |    54 |   59 |    91.5254 |
| 2022-05-31 18:43:00 |   1594400 |    55 |   59 |    93.2203 |
| 2022-05-31 18:43:00 |   1594600 |    56 |   59 |    94.9153 |
| 2022-05-31 18:43:00 |   1594800 |    57 |   59 |    96.6102 |
| 2022-05-31 18:43:00 |   1610900 |    58 |   59 |    98.3051 |
| 2022-05-31 18:43:00 |   1615500 |    59 |   59 |   100.0000 |
| 2022-05-31 18:44:00 |         0 |     1 |   58 |     1.7241 |
| 2022-05-31 18:44:00 |         0 |     1 |   58 |     1.7241 |
| 2022-05-31 18:44:00 |         0 |     1 |   58 |     1.7241 |

うん、合ってそう。
合ってますかね?

0 件のコメント :

コメントを投稿