《Postgresql 內(nèi)幕探索》讀書筆記 - 第一章：集簇、數(shù)據(jù)庫、數(shù)據(jù)表

2023-06-27 09:28 作者:懶時小窩 0人讀過 | 我要投稿

《Postgresql 內(nèi)幕探索》讀書筆記 - 第一章：集簇、表空間、元組

引言

個人建議本章節(jié)自己搭建一個Postgresql數(shù)據(jù)庫邊實戰(zhàn)邊閱讀更容易理解。

思維導(dǎo)圖

《Postgresql 內(nèi)幕探索》讀書筆記 - 第一章：集簇、表空間、元組.png

一、數(shù)據(jù)庫集群的邏輯結(jié)構(gòu)

1.1 天然集群

PostgreSQL天然集群，多個集群可以組成集簇，有點類似軍隊的連、團(tuán)、旅這樣的組織規(guī)則。對于我們?nèi)粘W(xué)習(xí)使用的單節(jié)點則是單個集簇單個集群，自己就是集群。

PostgreSQL如何管理這種集群規(guī)則？答案是通過一個無符號4個字節(jié)的標(biāo)識進(jìn)行管理，一個對象就是集群里的一個數(shù)據(jù)庫。

1.2 數(shù)據(jù)庫對象和對象符號標(biāo)識

數(shù)據(jù)庫對象和對象符號標(biāo)識可以通過 pg database和 pg classs 查詢，代表數(shù)據(jù)庫和對象之間映射。

另外集群在物理磁盤中通過文件目錄形式展示，一個目錄對應(yīng)一個數(shù)據(jù)庫，也就是一個base下子目錄中有一個目錄就是有一個數(shù)據(jù)庫。

base 目錄一個文件對應(yīng)一個數(shù)據(jù)庫，個人實驗的映射如下： 1：template1 14485：template0 14486：postgres

數(shù)據(jù)庫和堆表的OIDs分別存儲在pg_database和pg_class中，可以利用下面的SQL語句查詢OIDs。

數(shù)據(jù)庫的OIDs

select * from pg_database;postgres=# select * from pg_database; ?oid ?| ?datname ?| datdba | encoding | datcollate ?| ?datctype ? | datistemplate | datallowconn | datconnlimit | datlastsysoid | datfrozenxid | datminmxid | dattablespace | ? ? ? ? ? ? ?datacl ? ? ? ? ? ? ? ? -------+-----------+--------+----------+-------------+-------------+---------------+--------------+--------------+---------------+--------------+------------+---------------+------------- ------------------------ 14486 | postgres ?| ? ? 10 | ? ? ? ?6 | en_US.UTF-8 | en_US.UTF-8 | f ? ? ? ? ? ? | t ? ? ? ? ? ?| ? ? ? ? ? -1 | ? ? ? ? 14485 | ? ? ? ? ?727 | ? ? ? ? ?1 | ? ? ? ? ?1663 | ? ? 1 | template1 | ? ? 10 | ? ? ? ?6 | en_US.UTF-8 | en_US.UTF-8 | t ? ? ? ? ? ? | t ? ? ? ? ? ?| ? ? ? ? ? -1 | ? ? ? ? 14485 | ? ? ? ? ?727 | ? ? ? ? ?1 | ? ? ? ? ?1663 | {=c/postgres ,postgres=CTc/postgres} 14485 | template0 | ? ? 10 | ? ? ? ?6 | en_US.UTF-8 | en_US.UTF-8 | t ? ? ? ? ? ? | f ? ? ? ? ? ?| ? ? ? ? ? -1 | ? ? ? ? 14485 | ? ? ? ? ?727 | ? ? ? ? ?1 | ? ? ? ? ?1663 | {=c/postgres ,postgres=CTc/postgres} (3 rows)

堆表的OIDs

select relname,oid from pg_class;postgres=# select relname,oid from pg_class; ? ? ? ? ? ? ? ? ? ?relname ? ? ? ? ? ? ? ? ? ?| ?oid ? -----------------------------------------------+------- pg_statistic ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?| ?2619 pg_type ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?1247 pg_toast_1255 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2836 pg_toast_1255_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2837 pg_toast_1247 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4171 pg_toast_1247_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4172 pg_toast_2604 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2830 pg_toast_2604_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2831 pg_toast_2606 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2832 pg_toast_2606_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2833 pg_toast_2612 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4157 pg_toast_2612_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4158 pg_toast_2600 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4159 pg_toast_2600_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?4160 pg_toast_2619 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2840 pg_toast_2619_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?2841 pg_toast_3381 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?3439 pg_toast_3381_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?3440 pg_toast_3429 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ?3430 pg_toast_3429_index ? ? ? ? ? ? ? ? ? ? ? ? ? | ?3431

1.3 一切皆文件

觀察PostgreSQL的目錄結(jié)構(gòu)就能發(fā)現(xiàn)和Linux有相似的地方，就是一切細(xì)節(jié)都藏在物理文件。根據(jù)數(shù)據(jù)庫充當(dāng)一個目錄的規(guī)則，Postgresql根據(jù)數(shù)據(jù)目錄、配置文件和端口號文件來創(chuàng)建實例。

其中包含版本號，日志，索引，事務(wù)狀態(tài)等等一切相關(guān)信息，對于Postgresql來說都有相關(guān)文件進(jìn)行管理和標(biāo)識，所以可以說Postgresql的底層細(xì)節(jié)全部展示在數(shù)據(jù)目錄文件當(dāng)中。

二、數(shù)據(jù)庫集群的物理結(jié)構(gòu)

Postgresql數(shù)據(jù)庫集群都有叫做基礎(chǔ)目錄的目錄，通常在安裝Postgresql之后執(zhí)行 initdb 命令可以初始化生成新的數(shù)據(jù)庫集群。

初始化通常生成在 PGDATA 目錄。

sudo?/usr/pgsql-14/bin/postgresql-14-setup?initdb

以Postgresql-14版本為例，初始化之后的基礎(chǔ)目錄生成在下面的位置。

[root@localhost?14]#?pwd /var/lib/pgsql/14

這里使用ll觀察一下數(shù)據(jù)文件排列。

[root@localhost?14]#?ll?data/ total?68 drwx------?5?postgres?postgres????41?Jun?22?02:41?base -rw-------?1?postgres?postgres????30?Jun?22?02:41?current_logfiles drwx------?2?postgres?postgres??4096?Jun?22?02:44?global drwx------?2?postgres?postgres????32?Jun?22?02:41?log drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_commit_ts drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_dynshmem -rw-------?1?postgres?postgres??4577?Jun?22?02:41?pg_hba.conf -rw-------?1?postgres?postgres??1636?Jun?22?02:41?pg_ident.conf drwx------?4?postgres?postgres????68?Jun?22?02:46?pg_logical drwx------?4?postgres?postgres????36?Jun?22?02:41?pg_multixact drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_notify drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_replslot drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_serial drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_snapshots drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_stat drwx------?2?postgres?postgres????63?Jun?22?03:30?pg_stat_tmp drwx------?2?postgres?postgres????18?Jun?22?02:41?pg_subtrans drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_tblspc drwx------?2?postgres?postgres?????6?Jun?22?02:41?pg_twophase -rw-------?1?postgres?postgres?????3?Jun?22?02:41?PG_VERSION drwx------?3?postgres?postgres????60?Jun?22?02:41?pg_wal drwx------?2?postgres?postgres????18?Jun?22?02:41?pg_xact -rw-------?1?postgres?postgres????88?Jun?22?02:41?postgresql.auto.conf -rw-------?1?postgres?postgres?28776?Jun?22?02:41?postgresql.conf -rw-------?1?postgres?postgres????58?Jun?22?02:41?postmaster.opts -rw-------?1?postgres?postgres???103?Jun?22?02:41?postmaster.pid

2.1 數(shù)據(jù)庫集簇的布局

書中的版本為 Postgresql-9，這里用比較新的 Postgresql-14 版本實驗。

PostgreSQL: Documentation: 14: 70.1.?Database File Layout

Postgresql-14的官方文檔中的各個文件含義如下。

表?70.1.?PGDATA 的內(nèi)容`

ItemDescription中文簡介PG_VERSIONA file containing the major version number of?PostgreSQLPostgreSQL主要版本號baseSubdirectory containing per-database subdirectories包含每個數(shù)據(jù)庫子目錄的子目錄current_logfilesFile recording the log file(s) currently written to by the logging collector記錄當(dāng)前由日志收集器寫入的日志文件的文件globalSubdirectory containing cluster-wide tables, such as?pg_database包含集群范圍的表的子目錄，如pg_databasepg_commit_tsSubdirectory containing transaction commit timestamp data包含事務(wù)提交時間戳數(shù)據(jù)的子目錄pg_dynshmemSubdirectory containing files used by the dynamic shared memory subsystem動態(tài)共享內(nèi)存子系統(tǒng)使用的文件的子目錄pg_logicalSubdirectory containing status data for logical decoding邏輯解碼的狀態(tài)數(shù)據(jù)的子目錄pg_multixactSubdirectory containing multitransaction status data (used for shared row locks)子目錄包含多事務(wù)狀態(tài)數(shù)據(jù)（用于共享行鎖）pg_notifySubdirectory containing LISTEN/NOTIFY status dataLISTEN/NOTIFY狀態(tài)數(shù)據(jù)的子目錄pg_replslotSubdirectory containing replication slot data復(fù)制槽數(shù)據(jù)的子目錄pg_serialSubdirectory containing information about committed serializable transactions已提交的可序列化事務(wù)信息的子目錄pg_snapshotsSubdirectory containing exported snapshots導(dǎo)出的快照的子目錄pg_statSubdirectory containing permanent files for the statistics subsystem統(tǒng)計子系統(tǒng)的永久文件的子目錄pg_stat_tmpSubdirectory containing temporary files for the statistics subsystem統(tǒng)計子系統(tǒng)的臨時文件的子目錄pg_subtransSubdirectory containing subtransaction status data子交易狀態(tài)數(shù)據(jù)的子目錄pg_tblspcSubdirectory containing symbolic links to tablespaces表空間符號鏈接的子目錄pg_twophaseSubdirectory containing state files for prepared transactions準(zhǔn)備好的事務(wù)的狀態(tài)文件的子目錄pg_walSubdirectory containing WAL (Write Ahead Log) filesWAL（提前寫入日志）文件的子目錄pg_xactSubdirectory containing transaction commit status data交易提交狀態(tài)數(shù)據(jù)的子目錄postgresql.auto.confA file used for storing configuration parameters that are set by?ALTER SYSTEM存儲由ALTER SYSTEM設(shè)置的配置參數(shù)的文件postmaster.optsA file recording the command-line options the server was last started with服務(wù)器最后啟動時的命令行選項的文件postmaster.pidA lock file recording the current postmaster process ID (PID), cluster data directory path, postmaster start timestamp, port number, Unix-domain socket directory path (could be empty), first valid listen_address (IP address or?*, or empty if not listening on TCP), and shared memory segment ID (this file is not present after server shutdown)記錄當(dāng)前postmaster進(jìn)程ID（PID）、集群數(shù)據(jù)目錄路徑、postmaster啟動時間戳、端口號、Unix域套接字目錄路徑（可以為空）、第一個有效的listen_address（IP地址或*，如果不在TCP上監(jiān)聽則為空）和共享內(nèi)存段ID的鎖文件（服務(wù)器關(guān)閉后此文件不存在）

2.2 數(shù)據(jù)庫布局

根據(jù)上表可以知道數(shù)據(jù)表存儲在base目錄下。

2.3 表和索引相關(guān)的文件的布局

2.3.1 oid 和 relfilenode

大小小于1GB的表或索引是單獨的文件，存儲在它所屬的數(shù)據(jù)庫目錄下。

數(shù)據(jù)庫內(nèi)部表和索引作為數(shù)據(jù)庫對象是通過OID來管理的，而里面的具體內(nèi)容則是通過變量 relfilenode 產(chǎn)生關(guān)聯(lián)，大部分情況下oid和relfilenode 通常會相等，但是也有例外，比如表和索引的relfilenode值會被一些命令（例如TRUNCATE，REINDEX，CLUSTER）所改變。

比如 TRUNCATE 一個表會重新分配 relfilenode。下面使用案例驗證重新分配 relfilenode，我們先創(chuàng)建一個測試表：

create table db_test(id int primary key, name varchar(50), age int);postgres=# select * from pg_class where relname ='db_test'; ?oid ?| relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | re lminmxid | relacl | reloptions | relpartbound -------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+--------------- -+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+--- ---------+--------+------------+-------------- 16384 | db_test | ? ? ? ? 2200 | ? 16386 | ? ? ? ? 0 | ? ? ? 10 | ? ? 2 | ? ? ? 16384 | ? ? ? ? ? ? 0 | ? ? ? ?0 | ? ? ? ?-1 | ? ? ? ? ? ? 0 | ? ? ? ? ? ? 0 | t ? ? ? ? ? | f ? ? ? ? ? | p ? ? ? ? ? ? | r ? ? ? | ? ? ? ?3 | ? ? ? ? 0 | f ? ? ? ? ? | f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ? ? ? | t ? ? ? ? ? ? ?| d ? ? ? ? ? ?| f ? ? ? ? ? ? ?| ? ? ? ? ?0 | ? ? ? ? ?734 | ? ? ? ? 1 | ? ? ? ?| ? ? ? ? ? ?| (1 row)

可以看到這里的 relfilenode 為 16384。下面我們執(zhí)行 truncate 命令查看這個值是否改變。

postgres=# truncate db_test; TRUNCATE TABLEpostgres=# select * from pg_class where relname ='db_test'; ?oid ?| relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | re lminmxid | relacl | reloptions | relpartbound -------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+--------------- -+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+--- ---------+--------+------------+-------------- 16384 | db_test | ? ? ? ? 2200 | ? 16386 | ? ? ? ? 0 | ? ? ? 10 | ? ? 2 | ? ? ? 16389 | ? ? ? ? ? ? 0 | ? ? ? ?0 | ? ? ? ?-1 | ? ? ? ? ? ? 0 | ? ? ? ? ? ? 0 | t ? ? ? ? ? | f ? ? ? ? ? | p ? ? ? ? ? ? | r ? ? ? | ? ? ? ?3 | ? ? ? ? 0 | f ? ? ? ? ? | f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ?| f ? ? ? ? ? ? ? ? ? | t ? ? ? ? ? ? ?| d ? ? ? ? ? ?| f ? ? ? ? ? ? ?| ? ? ? ? ?0 | ? ? ? ? ?735 | ? ? ? ? 1 | ? ? ? ?| ? ? ? ? ? ?| (1 row)

可以看到 oid 雖然還是 16384，但是 relfilenode 變成了 16389。

2.3.2 內(nèi)建函數(shù)`pg_relation_filepath`

內(nèi)建函數(shù)pg_relation_filepath能夠根據(jù)OID或名稱返回關(guān)系對應(yīng)的文件路徑。比如下面的案例：

select oid,relname,relfilenode from pg_class limit 10;oid ?| ? ? ? relname ? ? ? | relfilenode ------+---------------------+------------- 2619 | pg_statistic ? ? ? ?| ? ? ? ?2619 1247 | pg_type ? ? ? ? ? ? | ? ? ? ? ? 0 2836 | pg_toast_1255 ? ? ? | ? ? ? ? ? 0 2837 | pg_toast_1255_index | ? ? ? ? ? 0 4171 | pg_toast_1247 ? ? ? | ? ? ? ? ? 0 4172 | pg_toast_1247_index | ? ? ? ? ? 0 2830 | pg_toast_2604 ? ? ? | ? ? ? ?2830 2831 | pg_toast_2604_index | ? ? ? ?2831 2832 | pg_toast_2606 ? ? ? | ? ? ? ?2832 2833 | pg_toast_2606_index | ? ? ? ?2833

使用pg_relation_filepath函數(shù)返回關(guān)系對應(yīng)的文件路徑：

select pg_relation_filepath('pg_statistic') from pg_class limit 20;pg_relation_filepath ---------------------- base/14486/2619 base/14486/2619 base/14486/2619 base/14486/2619 base/14486/2619 base/14486/2619 base/14486/2619

2.3.3 `relfilenode.1` 規(guī)則

如果一個數(shù)據(jù)文件的大小超過1GB，PostgreSQL會創(chuàng)建并使用一個名為relfilenode.1的新文件，如新文件再次被寫滿，則創(chuàng)建下一個名為relfilenode.2的新文件。注意這條規(guī)則適用于索引文件和數(shù)據(jù)文件，目的是防止單個文件過度擴(kuò)張。

這個例子需要實操一下，為了在本地驗證這一點，這里需要構(gòu)建一個千萬數(shù)據(jù)的表。

下面構(gòu)建千萬表的資料來自于網(wǎng)絡(luò)。

首先創(chuàng)建序列：

CREATE SEQUENCE upms_log_id_seq START 10;

執(zhí)行結(jié)果如下：

postgres=# CREATE SEQUENCE upms_log_id_seq START 10; CREATE SEQUENCE

創(chuàng)建測試表

CREATE TABLE "public"."t_user" ( "id" int8 NOT NULL DEFAULT nextval( 'upms_log_id_seq' :: regclass ), "name" VARCHAR ( 255 ) COLLATE "pg_catalog"."default", "phone" VARCHAR ( 255 ) COLLATE "pg_catalog"."default", "birthday" TIMESTAMP ( 6 ), "sex" VARCHAR ( 64 ) COLLATE "pg_catalog"."default", CONSTRAINT "t_user_pkey" PRIMARY KEY ( "id" ) );

同時修改表的歸屬用戶。

ALTER TABLE "public"."t_user" OWNER TO "postgres";

設(shè)置字段隨機(jī)值：

select?substr(?'abcdefghijklmnopqrstuvwxyz',?1,?(?random(?)?*?26?)?::?INTEGER?)；

hone使用11位字符串：

SELECT lpad( ( random( ) * 138 ) :: INT :: TEXT, 11, '0' );

birthday 使用字符串日期：

SELECT date(generate_series(now(), now() + '1 week', '1 day'));

sex 使用0，1表示男女。

SELECT lpad( ( random( ) * 1 ) :: INT :: text, 1, '0' );

使用explain插入數(shù)據(jù)：

EXPLAIN ANALYZE INSERT INTO t_user SELECT ?generate_series ( 1, 10000000 ), substr( 'abcdefghijklmnopqrstuvwxyz', 1, ( random( ) * 26 ) :: INTEGER ), lpad( ( random( ) * 138 ) :: INT :: TEXT, 11, '0' ), DATE ( generate_series ( now( ), now( ) + '1 week', '1 day' ) ), lpad( ( random( ) * 1 ) :: INT :: TEXT, 1, '0' );

運行結(jié)果如下：

postgres=# EXPLAIN ANALYZE INSERT INTO t_user postgres-# SELECT ?generate_series ( 1, 10000000 ), postgres-# substr( 'abcdefghijklmnopqrstuvwxyz', 1, ( random( ) * 26 ) :: INTEGER ), postgres-# lpad( ( random( ) * 138 ) :: INT :: TEXT, 11, '0' ), postgres-# DATE ( generate_series ( now( ), now( ) + '1 week', '1 day' ) ), postgres-# lpad( ( random( ) * 1 ) :: INT :: TEXT, 1, '0' ); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? QUERY PLAN ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----------------------------------------------------------------------------------------------------------------------------------------- Insert on t_user ?(cost=0.00..925000.03 rows=0 width=0) (actual time=53327.150..53327.162 rows=0 loops=1) ? -> ?Subquery Scan on "*SELECT*" ?(cost=0.00..925000.03 rows=10000000 width=1194) (actual time=0.067..19638.272 rows=10000000 loops=1) ? ? ? ? -> ?Result ?(cost=0.00..700000.03 rows=10000000 width=104) (actual time=0.059..16112.101 rows=10000000 loops=1) ? ? ? ? ? ? ? -> ?ProjectSet ?(cost=0.00..50000.03 rows=10000000 width=12) (actual time=0.017..2277.500 rows=10000000 loops=1) ? ? ? ? ? ? ? ? ? ? -> ?Result ?(cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.010 rows=1 loops=1) Planning Time: 1.396 ms Execution Time: 53335.404 ms (7 rows)

上面的指令執(zhí)行之后，我們先使用函數(shù)查找數(shù)據(jù)文件的路徑：

postgres=# select pg_relation_filepath('t_user'); pg_relation_filepath ---------------------- base/14486/16398 (1 row)

觀察relfilenode.1文件，發(fā)現(xiàn)并沒有找到，這里懷疑數(shù)據(jù)量不夠大又造了一千萬數(shù)據(jù)。執(zhí)行之后查看數(shù)據(jù)目錄發(fā)現(xiàn)了.1這個文件。

[root@localhost?14486]#?pwd /var/lib/pgsql/14/data/base/14486 -rw-------?1?postgres?postgres?1073741824?Jun?22?17:35?16398 -rw-------?1?postgres?postgres??282771456?Jun?22?17:36?16398.1 -rw-------?1?postgres?postgres?????352256?Jun?22?17:35?16398_fsm -rw-------?1?postgres?postgres??????24576?Jun?22?17:35?16398_vm

可以通過啟動參數(shù) --with-segsize更改表和索引的最大文件大小。

2.3.4 `_fsm` 和 `_vm` 文件

仔細(xì)觀察目錄列表，會發(fā)現(xiàn)很多文件都會帶有 ?_fsm and _vm 為后綴的相關(guān)文件，這些文件叫做 free space map （空閑空間映射）和 visibility map（可見性映射）。數(shù)據(jù)文件或者索引文件存在下面的差別。

數(shù)據(jù)文件：

空閑空間映射 free space map ：存儲 free space capacity（表文件每個頁面上的空閑空間信息）。
可見性映射 visibility map：存儲表文件中每一頁的可見性信息。

索引文件：

只有單獨的free space map（空閑空間映射），沒有可見性映射 visibility map。

上面的例子中就有類似的文件產(chǎn)生：

主體數(shù)據(jù)文件，空閑空間映射文件，可見性映射文件等這些文件在Postgresql的術(shù)語中被叫做“分支”，通常這些分支的排布規(guī)則如下：

數(shù)據(jù)文件分支編號為 1。
空閑空間映射/索引數(shù)據(jù)文件分支的第一個編號為 1。
可見性映射表為數(shù)據(jù)文件第二個分支 2。

這些規(guī)則概念比較復(fù)雜，只需要知道1號分支fsm保存了main分支中空閑空間的信息，2號分支vm保存了main分支中可見性的信息即可。

此外3號分支init是很少見的特殊分支，主要存儲不被日志記錄（unlogged）的表與索引。同時為了防止單個分支文件過大，PostgreSQL會將過大的分支文件切分為若干段，段的大小為 1GB，也就是類似上面數(shù)據(jù)文件的 relfilenode 分隔方式。

三、表空間

Postgresql的表空間可以看作是外部數(shù)據(jù)文件，和很多常見的RDBMS的設(shè)計理念不一樣。表空間有點類似基礎(chǔ)數(shù)據(jù)的一個映射，在基礎(chǔ)數(shù)據(jù)中建立映射會按照版本和文件夾命名規(guī)則建立對應(yīng)的表空間映射，用于存儲基礎(chǔ)數(shù)據(jù)以外的內(nèi)容。

數(shù)據(jù)庫集簇的表空間結(jié)構(gòu)圖如下：

3.1 創(chuàng)建表空間

如何創(chuàng)建表空間？答案是使用 CREATE TABLESPACE 語句，這個語句會在特定的目錄下面創(chuàng)建表空間，并且會構(gòu)建特定的子目錄。構(gòu)建規(guī)則如下：

PG_主版本號_目錄版本號

構(gòu)建表空間并且指定特定位置命令如下，需要注意指定位置之前需要確保對應(yīng)位置存在，同時還需要注意權(quán)限問題：

postgres=# create tablespace tbs_test owner postgres location '/opt/postgres/tbs_test'; ERROR: ?directory "/opt/postgres/tbs_test" does not exist postgres=# create tablespace tbs_test owner postgres location '/opt/postgres/tbs_test'; ERROR: ?could not set permissions on directory "/opt/postgres/tbs_test": Operation not permitted

創(chuàng)建對應(yīng)表空間目錄以及權(quán)限設(shè)置：

[root@localhost?14486]#?mkdir?-p?/opt/postgres/tbs_test [root@localhost?14486]#?chown?postgres:postgres?/opt/postgres/ [root@localhost?14486]#?ll?/opt total?0 drwxr-xr-x?3?postgres?postgres?22?Jun?22?18:20?postgres

創(chuàng)建完成之后可以在對應(yīng)的data目錄下面看到一個新增的目錄：

postgres=# create tablespace tbs_test owner postgres location '/opt/postgres/tbs_test'; CREATE TABLESPACE[root@localhost?data]#?pwd? /var/lib/pgsql/14/data [root@localhost?data]#?ll total?72 ...... drwx------?2?postgres?postgres????19?Jun?22?18:22?pg_tblspc .....

新增的目錄pg_tblspc下有一個連接文件16408，指向到/usr/local/pgdata下，這里用tree命令觀察結(jié)果。

[root@localhost data]# tree pg_tblspc/ pg_tblspc/ └── 16408 -> /opt/postgres/tbs_test

我們訪問/opt/postgres/tbs_test ?查看具體文件內(nèi)容。

[root@localhost?data]#?ll?/opt/postgres/tbs_test total?0 drwx------?2?postgres?postgres?6?Jun?22?18:22?PG_14_202107181

這里的文件確實對應(yīng)上面提到的PG_主版本號_目錄版本號規(guī)則，這里的202107181個人認(rèn)為是發(fā)布時間（最后一個1個人認(rèn)為為版本號）。

3.2 新建表到表空間

特別注意，如果在該表空間內(nèi)創(chuàng)建一個新表，但新表所屬的數(shù)據(jù)庫卻創(chuàng)建在基礎(chǔ)目錄下，那么PG會首先在版本特定的子目錄下創(chuàng)建名稱與現(xiàn)有數(shù)據(jù)庫OID相同的新目錄，然后將新表文件放置在剛創(chuàng)建的目錄下。

比如下面的方法構(gòu)建一個新表并且表空間指向tbs_test。

postgres=# create table newtab(id int) tablespace tbs_test; CREATE TABLE

通過下面的指令可以看到新表被創(chuàng)建在之前創(chuàng)建的表空間下面。

postgres=# select pg_relation_filepath('newtab'); pg_relation_filepath ? ? ? ? ? ? --------------------------------------------- pg_tblspc/16408/PG_14_202107181/14486/16409 (1 row)

可以查找 pg_class 表有對應(yīng)的oid進(jìn)一步驗證。

postgres=# select relname,oid from pg_class where relname='newtab'; relname | ?oid ? ---------+------- newtab ?| 16409 (1 row)

3.3 刪除表空間

刪除表空間前必須要刪除該表空間下的所有數(shù)據(jù)庫對象，否則會有下面的報錯：

ERROR: ?tablespace "tbs_test" is not empty

刪除數(shù)據(jù)表對象之后，再刪除對應(yīng)的表空進(jìn)啊

postgres=# drop table if exists newtab; DROP TABLE postgres=# drop tablespace if exists tbs_test; DROP TABLESPACE

通過下面的命令查看發(fā)現(xiàn)數(shù)據(jù)物理文件已經(jīng)被刪除了。

[root@localhost?data]#?ll?/opt/postgres/tbs_test/ total?0

四、堆表文件的內(nèi)部布局

4.1 堆表和索引組織表對比

Postgresql 的數(shù)據(jù)組織方式和Mysql完全不同，初次接觸可能比較蒙圈。這里簡單總結(jié)一下兩者設(shè)計上的區(qū)別：

堆表：

數(shù)據(jù)存儲在表中，索引存儲在索引里，兩者分開的。
數(shù)據(jù)在堆中是無序的，索引讓鍵值有序，但數(shù)據(jù)還是無序的。
堆表中主鍵索引和普通索引一樣的，都是存放指向堆表中數(shù)據(jù)的指針。

索引組織表：

數(shù)據(jù)存儲在聚簇索引中，數(shù)據(jù)按照主鍵的順序來組織數(shù)據(jù)，兩者合二為一。
主鍵索引，葉子節(jié)點存放整行數(shù)據(jù)。
其他索引稱為輔助索引(二級索引），葉子節(jié)點存放鍵值和主鍵值。

兩者數(shù)據(jù)結(jié)構(gòu)的主要區(qū)別為：堆表索引和實際數(shù)據(jù)分開，索引組織表則通常非葉子節(jié)點為索引，葉子節(jié)點為數(shù)據(jù)，所以數(shù)據(jù)和索引是直接在一塊存儲的。

4.2 堆表基礎(chǔ)結(jié)構(gòu)介紹

在堆表，索引，也包括空閑空間映射和可見性映射內(nèi)部結(jié)構(gòu)包含下面幾項。

頁（pages） 或者叫 塊（block）：默認(rèn)大小 8192字節(jié)（8KB）
頁按照 0 編號，這些數(shù)字可以叫做 區(qū)塊號（block numbers），如果一個區(qū)塊頁面被寫滿，則會自動追加一個新的空頁面來存儲增長文件。

上圖中包含三種類型的數(shù)據(jù)：

堆元組（heap tuples）：也就是數(shù)據(jù)本身，類似棧結(jié)構(gòu)從底部開始堆疊。數(shù)據(jù)庫內(nèi)部是用元組標(biāo)識符（tuple identifier, TID） 標(biāo)識堆元組。

TID 有多個值組成： 區(qū)塊號 + 行指針偏移號。（用于索引）。

行指針（line pointer）：也叫做項目指針（item pointer）。每個行指針占用4個字節(jié)，這些指針都是指向堆元組的。

行指針的結(jié)構(gòu)是簡單的線性數(shù)組設(shè)計，充當(dāng)堆元組的索引，注意索引是從1開始不是0開始，這些索引被叫做偏移號（offset number），偏移號和堆元組意義對應(yīng)。

首部數(shù)據(jù)（header data）：頁面的起始位置是PageHeaderData 首部數(shù)據(jù)，固定大小為24個字節(jié)，首部數(shù)據(jù)組成如下：

pd_lsn：8字節(jié)的無符號整數(shù)，代表當(dāng)前頁面最后一次更新XLOG記錄的LSN，主要和WAL機(jī)制有關(guān)。
pd_checksum：校驗和，在 9.3 版本之前存儲時間線標(biāo)識。
pd_lower，pd_upper：分別代表行指針的末尾和最新堆元組的起始位置。從結(jié)構(gòu)圖可以看出，它用來標(biāo)識空閑空間的的范圍。（空余空間稱為空閑空間（free space）或空洞（hole））
pd_special：索引頁中會用到該字段（指向特殊空間的起始位置）。而堆表頁中則指向頁尾。

特殊空間指的是索引使用的特殊區(qū)域，具體內(nèi)容根據(jù)索引類型而定，如B樹，GiST，GiN。

理解堆元組結(jié)構(gòu)對于理解PostgreSQL并發(fā)控制與WAL機(jī)制是必須的。

4.3 源碼解讀

這部分設(shè)計可以閱讀 postgres/src/include/storage/bufpage.h at master · postgres/postgres · GitHub 源碼了解。

4.3.1 基礎(chǔ)結(jié)構(gòu)介紹

我們根據(jù)堆表的結(jié)構(gòu)圖以及源碼注釋了解基礎(chǔ)結(jié)構(gòu)，首先從頭部結(jié)構(gòu)開始：

disk page organization：磁盤頁面布局space management information generic to any page：對任何頁面都適用的通用空間管理信息pd_lsn： identifies xlog record for last change to this page：最近變更對應(yīng)xlog記錄的標(biāo)識。pd_checksum：如果設(shè)置則為校驗和。pd_flags：標(biāo)記位。pd_lower ：行指針的末尾。pd_upper ：最新堆元組的起始位置。pd_special ：堆表頁中則指向頁尾。索引中代表特殊空間開始位置。pd_pagesize_version ：頁面的大小，以及頁面布局的版本號pd_prune_xid ：可以修剪的最老的元組中的XID（MVCC使用）

下面介紹關(guān)鍵參數(shù)的作用。

LSN值

The LSN is used by the buffer manager to enforce the basic rule of WAL thou shalt write xlog befor data". A dirty buffer cannot be dumped to disk until xlog has been flushed at least as far as the page'

xlog至少被刷到該頁的LSN甚至操作才允許緩沖區(qū)臟頁刷新到磁盤
緩沖區(qū)管理器使用LSN來執(zhí)行WAL的基本規(guī)則

校驗和 pd_checksum

pd_checksum stores the page checksum, if it has been set for this page; zero is a valid value for a checksum.

0是合法的校驗和值，pd_checksum 存儲著頁面的校驗和。

If a checksum is not in use then we leave the field unset.

為了向前兼容，沒有使用校驗和這個字段不會有值。

This will typically mean the field is zero though non-zero values may also be present if databases have been pg_upgraded from releases prior to 9.3, when the same byte offset was used to store the current timelineid when the page was last updated.

這樣的原因是因為 9.3 版本之前存在非0的“校驗和”，因為這個字段在9.3之前是最后更新時的時間線標(biāo)識。

Note that there is no indication on a page as to whether the checksum is valid or not, a deliberate design choice which avoids the problem of relying on the page contents to decide whether to verify it. Hence there are no flag bits relating to checksums

注意頁面上沒有顯示校驗和是否有效，所以也就沒有與校驗和有關(guān)的標(biāo)志位，這里故意這樣設(shè)計是避免依靠校驗和決定是否驗證這一個問題。

pd_prune_xid is a hint field that helps determine whether pruning will be useful. It is currently unused in index pages.

pd_prune_xid

pd_prune_xid 是一個提示字段，有助于確定修剪是否有用。（注意索引頁暫時沒有使用此字段）

The page version number and page size are packed together into a single uint16 field. This is for historical reasons: before PostgreSQL 7.3, there was no concept of a page version number, and doing it this way lets us pretend that pre-7.3 databases have page version number zero. We constrain page sizes to be multiples of 256, leaving the low eight bits available for a version number.

在PostgreSQL 7.3之前，沒有頁面版本號的概念，為了兼容假設(shè)版本號為0。
頁面版本號和頁面大小被打包到一個uint16字段中。
約束頁面的尺寸必須為256的倍數(shù)，留下低8位用于頁面版本編號。

Minimum possible page size is perhaps 64B to fit page header, opaque space and a minimal tuple; of course, in reality you want it much bigger, so the constraint on pagesize mod 256 is not an important restriction. On the high end, we can only support pages up to 32KB because lp_off/lp_len are 15 bits.

最小的可行頁面大小可能是64字節(jié)，能放下頁的首部，空閑空間，以及一個最小的元組。
pagesize mod 256的限制并不是一個重要的限制。
只能支持最大32KB的頁面，因為lp_off/lp_len是15位

4.3.2 PageHeaderData 結(jié)構(gòu)

本部分是接著緩沖頁結(jié)構(gòu)介紹的，PageHeaderData 的結(jié)構(gòu)定義網(wǎng)址如下：postgres/src/include/storage/bufpage.h at master · postgres/postgres · GitHub

typedef struct PageHeaderData { // ?XXX LSN是任何塊的成員，不僅是頁面組織的成員。 /* XXX LSN is member of *any* block, not only page-organized ones */ //本頁面最近變更對應(yīng)xlog記錄的標(biāo)識 // 用于記錄該頁的最后一次變化 PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog * record for last change to this page */ // 校驗和 uint16 pd_checksum; /* checksum */ // 標(biāo)記位 uint16 pd_flags; /* flag bits, see below */ // 空閑空間起始位置 LocationIndex pd_lower; /* offset to start of free space */ // 空閑空間終止位置 LocationIndex pd_upper; /* offset to end of free space */ // 特殊用途空間的開始位置 LocationIndex pd_special; /* offset to start of special space */ // 頁面版本編號（尺寸必須為256的倍數(shù)，留下低8位用于頁面版本編號） uint16 pd_pagesize_version; // 最老的可修剪XID, 如果沒有設(shè)置為0 TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */ // 行指針數(shù)組 ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */ } PageHeaderData; typedef PageHeaderData *PageHeader;

4.3.3 ItemIdData 結(jié)構(gòu)

在 PageHeaderData 單中有一個 ItemIdData 的結(jié)構(gòu)體定義。它的結(jié)構(gòu)如下：

// 緩沖區(qū)頁中的項目指針(item pointer)，也被稱為行指針(line pointer) typedef struct ItemIdData ItemIdData // 元組偏移量 (相對頁面起始處) unsigned lp_off:15 // 行指針的狀態(tài) unsigned lp_flags:2 // 元組的字節(jié)長度 // lp_len == 0 表示該行指針沒有關(guān)聯(lián)存儲。獨立于其lp_flags的狀態(tài) unsigned lp_len:15

lp_flags ?這個字段一共有四種取值：

#define LP_UNUSED ? ? ? 0 ? ? ? // 空閑行指針 (lp_len必需始終為0) #define LP_NORMAL ? ? ? 1 ? ? ? // 行指針被使用，指向一個元組 (lp_len必需始終>0) #define LP_REDIRECT ? ? 2 ? ? ? // HOT技術(shù)標(biāo)識 (lp_len必需為0) #define LP_DEAD ? ? ? ? 3 ? ? ? // 行指針對應(yīng)的元組為死元組

這些內(nèi)容和得知呢？因為 ItemIdData 的結(jié)構(gòu)在上面的鏈接并沒有找到任何有關(guān)ItemIdData的代碼，說明被定義到其他地方。

授人以魚不如授人以漁，這里解釋下這部分源碼如何層層遞進(jìn)查找：PostgreSQL Source Code: src/include/storage/itemid.h File Reference

此鏈接中找到如下頁面內(nèi)容：

我們點擊 ItemIdData 跳轉(zhuǎn)：

點擊對應(yīng)內(nèi)容我們會訪問到下面的 Git master 的頁面定義。

PostgreSQL Source Code: ItemIdData Struct Reference

從上面的截圖可以看到它們各自都有相關(guān)參考頁面。

lp_flags：Referenced by?mask_lp_flags().
lp_len：Referenced by?PageIndexTupleOverwrite().
lp_off：Referenced by?compactify_tuples(),?PageIndexTupleDelete(),?PageIndexTupleDeleteNoCompact(), and?PageIndexTupleOverwrite().

這里挑選前面介紹的 lp_flags 字段介紹，于是就有了下面的相關(guān)源代碼，這些代碼寫的非常工整，根據(jù)單詞含義不難猜出邏輯：

/* ?* mask_lp_flags ?* ?* In some index AMs, line pointer flags can be modified on the primary ?* without emitting any WAL record. ?*/ void mask_lp_flags(Page page) { ? ? OffsetNumber offnum, ? ? ? ? ? ? ? ? maxoff; // 獲取頁面最大偏移量 ? ? maxoff = PageGetMaxOffsetNumber(page); ? ? // 交易當(dāng)前的偏移量是否合法 ? ? for (offnum = FirstOffsetNumber; ? ? ? ? ?offnum <= maxoff; ? ? ? ? ?offnum = OffsetNumberNext(offnum)) ? ? { ? ? // 根據(jù)偏移量取出對應(yīng)的編號 ? ? ? ? ItemId ? ? ?itemId = PageGetItemId(page, offnum); ? ? ? ? // 檢查當(dāng)前的編號是否被使用，根據(jù)結(jié)果更新lp_flags標(biāo)記。 ? ? ? ? if (ItemIdIsUsed(itemId)) ? ? ? ? ? ? itemId->lp_flags = LP_UNUSED; ? ? } }

這里的 LP_UNUSED 可以點擊（不得不夸一下Postgresql的源碼閱讀網(wǎng)站做的有點好）

再次點擊 "itemid.h" 我們可以訪問到具體的源代碼位置、

/* ?* A line pointer on a buffer page. ?See buffer page definitions and comments ?* for an explanation of how line pointers are used. ?* 緩沖區(qū)頁面上的一個行指針。關(guān)于行指針的使用方法，請參見緩沖區(qū)頁面的定義和注釋。 ?* ?* In some cases a line pointer is "in use" but does not have any associated ?* storage on the page. ?By convention, lp_len == 0 in every line pointer ?* that does not have storage, independently of its lp_flags state. ?* 在某些情況下，行指針是 "使用中"z狀態(tài)，但在頁面上沒有任何相關(guān)的存儲。 ?* 根據(jù)慣例，在每一個沒有存儲空間的行指針中，lp_len == 0。都沒有存儲空間，這與它的lp_flags狀態(tài)無關(guān)。 ?*/ typedef struct ItemIdData { ? ? unsigned ? ?lp_off:15, ? ? ?/* offset to tuple (from start of page)從頁面的開始到元組的偏移量*/ ? ? ? ? ? ? ? ? lp_flags:2, ? ? /* state of line pointer, see below 行指針狀態(tài) */ ? ? ? ? ? ? ? ? lp_len:15; ? ? ?/* byte length of tuple 元組的字節(jié)長度 ?*/ } ItemIdData; ? typedef ItemIdData *ItemId; ? /* ?* lp_flags has these possible states. ?An UNUSED line pointer is available ?* for immediate re-use, the other states are not. ?*/ #define LP_UNUSED ? ? ? 0 ? ? ? /* unused (should always have lp_len=0) */ #define LP_NORMAL ? ? ? 1 ? ? ? /* used (should always have lp_len>0) */ #define LP_REDIRECT ? ? 2 ? ? ? /* HOT redirect (should have lp_len=0) */ #define LP_DEAD ? ? ? ? 3 ? ? ? /* dead, may or may not have storage */

五、元祖CRUD操作詳解

5.1 寫入方式

假設(shè)我們有一個表，僅僅由一個頁面組成，頁面只包含一個堆元組，此時的 pd_lower 會指向第一個行指針，pd_upper 指向第一個堆元組。

第二個元組會放到第一個元祖后面，第二個行指針被插入到第一個行指針的后面，pd_lower 會改為指向第二個行指針，pd_upper 更改指向第二個堆元組，然后更新頭部的 pd_lsn，pg_checksum，pg_flag 等相關(guān)參數(shù)。

從上面的步驟可以看到，寫入方式比較好理解，就是在行指針后面插入新的數(shù)據(jù)，以及在末端元組加入新數(shù)據(jù)，之后更新指針引用以及更新頭部信息即可。

5.2 刪除方式

刪除方式在源代碼中對應(yīng)方法PageIndexTupleDelete，這里不對源代碼做講解，而是主要提一下思路：

首先刪除行指針，然后把后面的位置向前填充補齊空位，如果刪除pd_lower指向位置則不需要移動，只需要對應(yīng)更新為上一個行指針即可。
刪除實際的堆元組。對應(yīng)的也需要進(jìn)行填補空缺，如果刪除pd_upper指向位置則不需要移動，只需要更新為后一個即可。
數(shù)據(jù)的存儲位置發(fā)生移動，更新數(shù)據(jù)指針的 offset 屬性。

負(fù)責(zé)刪除指定位置的數(shù)據(jù)，刪除數(shù)據(jù)后，會將需要將空閑的數(shù)據(jù)指針和數(shù)據(jù)進(jìn)行壓縮合并。

/*?負(fù)責(zé)刪除指定位置的數(shù)據(jù)，刪除數(shù)據(jù)后，會將需要將空閑的數(shù)據(jù)指針和數(shù)據(jù)進(jìn)行壓縮合并?*/ void?PageIndexTupleDelete(Page?page,?OffsetNumber??offnum?)? { ?????PageHeader??phdr?=?(PageHeader)?page; ?????char???????*addr; ?????ItemId??????tup; ?????Size????????size; ?????unsigned????offset; ?????int?????????nbytes; ?????int?????????offidx; ?????int?????????nline; ?? ?????/* ??????*?As?with?PageRepairFragmentation,?paranoia?seems?justified. ??????*/ ?????if?(phdr->pd_lower?<?SizeOfPageHeaderData?|| ?????????phdr->pd_lower?>?phdr->pd_upper?|| ?????????phdr->pd_upper?>?phdr->pd_special?|| ?????????phdr->pd_special?>?BLCKSZ?|| ?????????phdr->pd_special?!=?MAXALIGN(phdr->pd_special)) ?????????ereport(ERROR, ?????????????????(errcode(ERRCODE_DATA_CORRUPTED), ??????????????????errmsg("corrupted?page?pointers:?lower?=?%u,?upper?=?%u,?special?=?%u", ?????????????????????????phdr->pd_lower,?phdr->pd_upper,?phdr->pd_special))); ?? ?????nline?=?PageGetMaxOffsetNumber(page); ?????if?((int)?offnum?<=?0?||?(int)?offnum?>?nline) ?????????elog(ERROR,?"invalid?index?offnum:?%u",?offnum); ?? ?????/*?change?offset?number?to?offset?index?*/ ?????offidx?=?offnum?-?1; ?? ?????tup?=?PageGetItemId(page,?offnum); ?????Assert(ItemIdHasStorage(tup)); ?????size?=?ItemIdGetLength(tup); ?????offset?=?ItemIdGetOffset(tup); ?? ?????if?(offset?<?phdr->pd_upper?||?(offset?+?size)?>?phdr->pd_special?|| ?????????offset?!=?MAXALIGN(offset)) ?????????ereport(ERROR, ?????????????????(errcode(ERRCODE_DATA_CORRUPTED), ??????????????????errmsg("corrupted?line?pointer:?offset?=?%u,?size?=?%u", ?????????????????????????offset,?(unsigned?int)?size))); ??? ?????/*?Amount?of?space?to?actually?be?deleted?*/ ?????size?=?MAXALIGN(size); ???//?首先刪除行指針，然后把后面的位置向前填充補齊空位，如果刪除**pd_lower**指向位置則不需要移動，只需要對應(yīng)更新為上一個行指針即可。 ?????/* ??????*?First,?we?want?to?get?rid?of?the?pd_linp?entry?for?the?index?tuple.?We ??????*?copy?all?subsequent?linp's?back?one?slot?in?the?array.?We?don't?use ??????*?PageGetItemId,?because?we?are?manipulating?the?_array_,?not?individual ??????*?linp's. ??????*/ ?????nbytes?=?phdr->pd_lower?- ?????????((char?*)?&phdr->pd_linp[offidx?+?1]?-?(char?*)?phdr); ?? ?????if?(nbytes?>?0) ?????????memmove((char?*)?&(phdr->pd_linp[offidx]), ?????????????????(char?*)?&(phdr->pd_linp[offidx?+?1]), ?????????????????nbytes); ???//?刪除實際的堆元組。對應(yīng)的也需要進(jìn)行填補空缺，如果刪除**pd_upper**指向位置則不需要移動，只需要更新為后一個即可 ?????/* ??????*?Now?move?everything?between?the?old?upper?bound?(beginning?of?tuple ??????*?space)?and?the?beginning?of?the?deleted?tuple?forward,?so?that?space?in ??????*?the?middle?of?the?page?is?left?free.??If?we've?just?deleted?the?tuple ??????*?at?the?beginning?of?tuple?space,?then?there's?no?need?to?do?the?copy. ??????*/ ?? ?????/*?beginning?of?tuple?space?*/ ?????addr?=?(char?*)?page?+?phdr->pd_upper; ?? ?????if?(offset?>?phdr->pd_upper) ?????????memmove(addr?+?size,?addr,?offset?-?phdr->pd_upper); ?? ?????/*?adjust?free?space?boundary?pointers?*/ ?????phdr->pd_upper?+=?size; ?????phdr->pd_lower?-=?sizeof(ItemIdData); ??//?3.??數(shù)據(jù)的存儲位置發(fā)生移動，更新數(shù)據(jù)指針的?offset?屬性。 ?????/* ??????*?Finally,?we?need?to?adjust?the?linp?entries?that?remain. ??????* ??????*?Anything?that?used?to?be?before?the?deleted?tuple's?data?was?moved ??????*?forward?by?the?size?of?the?deleted?tuple. ??????*/ ?????if?(!PageIsEmpty(page)) ?????{ ?????????int?????????i; ?? ?????????nline--;????????????????/*?there's?one?less?than?when?we?started?*/ ?????????for?(i?=?1;?i?<=?nline;?i++) ?????????{ ?????????????ItemId??????ii?=?PageGetItemId(page,?i); ?? ?????????????Assert(ItemIdHasStorage(ii)); ?????????????if?(ItemIdGetOffset(ii)?<=?offset) ?????????????????ii->lp_off?+=?size; ?????????} ?????} ?}

5.3 修改數(shù)據(jù)

修改數(shù)據(jù)方法對應(yīng)PageIndexTupleOverwrite，它對應(yīng)的源代碼如下：

bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize); { ? ? PageHeader ?phdr = (PageHeader) page; ? ? ItemId ? ? ?tupid; ? ? int ? ? ? ? oldsize; ? ? unsigned ? ?offset; ? ? Size ? ? ? ?alignednewsize; ? ? int ? ? ? ? size_diff; ? ? int ? ? ? ? itemcount; ? ? ? /* ? ? ?* As with PageRepairFragmentation, paranoia seems justified. ? ? ?*/ ? ? if (phdr->pd_lower < SizeOfPageHeaderData || ? ? ? ? phdr->pd_lower > phdr->pd_upper || ? ? ? ? phdr->pd_upper > phdr->pd_special || ? ? ? ? phdr->pd_special > BLCKSZ || ? ? ? ? phdr->pd_special != MAXALIGN(phdr->pd_special)) ? ? ? ? ereport(ERROR, ? ? ? ? ? ? ? ? (errcode(ERRCODE_DATA_CORRUPTED), ? ? ? ? ? ? ? ? ?errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u", ? ? ? ? ? ? ? ? ? ? ? ? phdr->pd_lower, phdr->pd_upper, phdr->pd_special))); ? ? ? itemcount = PageGetMaxOffsetNumber(page); ? ? if ((int) offnum <= 0 || (int) offnum > itemcount) ? ? ? ? elog(ERROR, "invalid index offnum: %u", offnum); ? ? ? tupid = PageGetItemId(page, offnum); ? ? Assert(ItemIdHasStorage(tupid)); ? ? oldsize = ItemIdGetLength(tupid); ? ? offset = ItemIdGetOffset(tupid); ? ? ? if (offset < phdr->pd_upper || (offset + oldsize) > phdr->pd_special || ? ? ? ? offset != MAXALIGN(offset)) ? ? ? ? ereport(ERROR, ? ? ? ? ? ? ? ? (errcode(ERRCODE_DATA_CORRUPTED), ? ? ? ? ? ? ? ? ?errmsg("corrupted line pointer: offset = %u, size = %u", ? ? ? ? ? ? ? ? ? ? ? ? offset, (unsigned int) oldsize))); ? ? ? /* ? ? ?* Determine actual change in space requirement, check for page overflow. ? ? ?*/ ? ? oldsize = MAXALIGN(oldsize); ? ? alignednewsize = MAXALIGN(newsize); ? ? if (alignednewsize > oldsize + (phdr->pd_upper - phdr->pd_lower)) ? ? ? ? return false; // 重新定位現(xiàn)有數(shù)據(jù)并更新行指針，除非新的元組與舊元組的大小相同（對齊后）,要重新定位的是目標(biāo)元組之前的數(shù)據(jù) ? ? /* ? ? ?* Relocate existing data and update line pointers, unless the new tuple ? ? ?* is the same size as the old (after alignment), in which case there's ? ? ?* nothing to do. ?Notice that what we have to relocate is data before the ? ? ?* target tuple, not data after, so it's convenient to express size_diff ? ? ?* as the amount by which the tuple's size is decreasing, making it the ? ? ?* delta to add to pd_upper and affected line pointers. ? ? ?*/ ? ? size_diff = oldsize - (int) alignednewsize; ? ? if (size_diff != 0) ? ? { ? ? ? ? char ? ? ? *addr = (char *) page + phdr->pd_upper; ? ? ? ? int ? ? ? ? i; ? ? ? ? ? /* relocate all tuple data before the target tuple */ ? ? ? ? memmove(addr + size_diff, addr, offset - phdr->pd_upper); ? ? ? ? ? /* adjust free space boundary pointer */ ? ? ? ? phdr->pd_upper += size_diff; ? ? ? ? ? /* adjust affected line pointers too */ ? ? ? ? for (i = FirstOffsetNumber; i <= itemcount; i++) ? ? ? ? { ? ? ? ? ? ? ItemId ? ? ?ii = PageGetItemId(page, i); ? ? ? ? ? ? ? /* Allow items without storage; currently only BRIN needs that */ ? ? ? ? ? ? if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset) ? ? ? ? ? ? ? ? ii->lp_off += size_diff; ? ? ? ? } ? ? } ? ? ? /* Update the item's tuple length without changing its lp_flags field */ ? ? tupid->lp_off = offset + size_diff; ? ? tupid->lp_len = newsize; ? ? ? /* Copy new tuple data onto page */ ? ? memcpy(PageGetItem(page, tupid), newtup, newsize); ? ? ? return true; }

上面的邏輯大致如下：

如果原有數(shù)據(jù)的大小和新數(shù)據(jù)相同，那么直接修改對應(yīng)的數(shù)據(jù)指針和實際的數(shù)據(jù)。
如果不一致，需要先將數(shù)據(jù)進(jìn)行刪除。
將刪除的空間進(jìn)行壓縮合并，并且更新所有數(shù)據(jù)指針的 offset 屬性。最后才完成添加數(shù)據(jù)。

5.4 常用讀取方式

讀取方式分兩種：順序掃描與B樹索引掃描。

順序掃描：是通過行指針數(shù)組遍歷，O(1) 的查找速度。

BTree掃描：鍵存儲被索引的列值，值存儲的是堆元組的tid。查找的先按照Key搜索，找到之后根據(jù)值的TID讀取對應(yīng)堆元祖。TID這個屬性記錄堆元組偏移量和長度信息，可以直接通過掃描堆元組找到。

5.5 其他讀取方式

除了上面兩種經(jīng)典讀取方式之外，Postgresql還支持下面的讀取方式。

TID掃描
僅索引掃描
位圖掃描
GIN索引掃描

5.5.1 TID掃描

TID掃描是通過使用所需元組的TID直接訪問元組的方法。我們可以通過explain命令的tid scan確認(rèn)是否為tid掃描。

sampledb=# SELECT ctid, data FROM sampletbl WHERE ctid = '(0,1)'; ctid ?| ? data ? ? -------+----------- (0,1) | AAAAAAAAA (1 row) sampledb=# EXPLAIN SELECT ctid, data FROM sampletbl WHERE ctid = '(0,1)'; ? ? ? ? ? ? ? ? ? ? ? ?QUERY PLAN ---------------------------------------------------------- Tid Scan on sampletbl ?(cost=0.00..1.11 rows=1 width=38) ? TID Cond: (ctid = '(0,1)'::tid)

元組標(biāo)識符（tuple identifier, TID）包含區(qū)塊號和行指針偏移量

5.5.2 僅索引掃描

和索引組織表的構(gòu)建思路一樣，建立index時包含的字段集合囊括了需要查詢的字段，這樣就只需在索引中取數(shù)據(jù)，就不必回表了。

僅索引掃描是幾乎所有的關(guān)系型數(shù)據(jù)庫查詢的必備方式。

上面的案例分析，下面是分析過程：

我們假設(shè)有下面的表和索引。

id - integer
name text
data text

索引

"tbl_idx" btree (id, name)

查詢語句

select id,key from tbl where id between 18 and 19

元組數(shù)據(jù)分析

id=18, name = 'Queen'的Tuple_18存儲在0號數(shù)據(jù)頁中。
id=19, name='BOSTON'的Tuple_19存儲在1號數(shù)據(jù)頁中。

可見性分析

0號頁面中的元組永遠(yuǎn)可見
可見性映射（visibility map）

可見性映射根本作用是幫助VACUUM確定是否包含死元組，提高死元組的掃描效率

僅索引查詢優(yōu)化

某一頁中存儲所有的元組都是可見的，PostgreSQL就會使用索引元組。
如果存在不可見元祖，則PostgreSQL讀取索引元組指向的數(shù)據(jù)元組并檢查元組可見性。

由于存在不可見的元組，所以本查詢的僅索引查詢優(yōu)化需要二次檢查可見性。

5.5.3 位圖掃描

位圖掃描最初是為了Greenplum的Bizgres系統(tǒng)（專業(yè)操作系統(tǒng)）開發(fā)，之后被Postgresql列入標(biāo)準(zhǔn)實現(xiàn)。

參考：https://wiki.postgresql.org/wiki/Bitmap_Indexes#Index_Scan

bitmap scan的作用就是通過建立位圖的方式，將回表過程中對標(biāo)訪問隨機(jī)性IO的轉(zhuǎn)換為順行性行為，從而減少查詢過程中IO的消耗。

注意頁面位圖是為每個查詢動態(tài)創(chuàng)建的，并在位圖索引掃描結(jié)束時被丟棄。

位圖掃描的過程如下：

掃描滿足條件的TID。
TID按照頁面訪問順序構(gòu)建位圖。
讀取記錄對應(yīng)的頁面只需要讀取一次。

相關(guān)文章閱讀：

位圖掃描應(yīng)用場景不多，具體可以看這篇文章介紹：

第一篇：PostgreSQL中的位圖索引掃描（bitmap index scan） - MSSQL123 - 博客園 (cnblogs.com)
第二篇：PostgreSQL技術(shù)內(nèi)幕（七）索引掃描_數(shù)據(jù)庫_HashData_InfoQ寫作社區(qū)
第三篇：PostgreSQL優(yōu)化器之從一個關(guān)于掃描方式選擇引發(fā)的思考 - 掘金 (juejin.cn)

5.5.4 GIN索引掃描

也叫做Generalized Inverted Index，通用倒排索引。

GIN索引掃描（Generalized Inverted Index, 通用倒排索引）

GIN索引特別適用于支持全文搜索。內(nèi)部使用了倒排索引的數(shù)據(jù)結(jié)構(gòu)，存儲結(jié)構(gòu)為(key, posting list)，意味著key是關(guān)鍵字，posting list 是一組出現(xiàn)過key的位置。

GIN 最大的問題是不能頻繁插入，并且插入效率很低，因為倒排索引的設(shè)計特性，增加一個索引需要更多索引項。

為了優(yōu)化GIN索引插入性能，Postgresql引入了插入模式進(jìn)行優(yōu)化，主要思路是將GIN索引插入分為兩類模式。

正常模式：基表元組產(chǎn)生的新的GIN索引立即插入GIN索引。
fastupdate(快速更新)模式：基表元組產(chǎn)生的新的GIN索引會以追加的方式被插入到pending list列表中。

fastupdate(快速更新)模式這種優(yōu)化思路和Mysql的插入緩沖類似，就把大量的GIN插入合并為一次插入并且一次刷新到磁盤。需要注意GIN索引的pending list代價要大，因為posting list 是一組出現(xiàn)過key的位置，一次大批量插入會導(dǎo)致掃描效率低。

注意：通過 create index 的WITH FASTUPDATE = OFF參數(shù)來關(guān)閉fastupdate模式

為什么GIN 不使用正排索引？答案是類似鏈表模式進(jìn)行構(gòu)建，雖然構(gòu)建索引的方式簡單，但是每次查找最壞需要O(n)的時間。倒排索引則記錄該文檔的ID和字符在該文檔中出現(xiàn)的位置情況，只需要掃描一次即可查找到所需的信息。

Postgresql的GIN索引具備一定的擴(kuò)展性，代碼上只需要實現(xiàn)三個用戶定義方法即可。

比較兩個鍵（不是被索引項）并且返回一個整數(shù)。

int compare(Datum a, Datum b)

根據(jù)參數(shù)inputValue生成一個鍵值數(shù)組

Datum * extractValue(Datum itemValue, int32 * nkeys, bool ** nullFlags)

根據(jù)參數(shù)query生成一個用于查詢的鍵值數(shù)組，并返回其指針。

** pmatch, Pointer ** extra_data, bool ** nullFlags, int32 * searchMode)

標(biāo)簽：postgresql 內(nèi)幕探索

国产精品天干天干,亚洲毛片在线,日韩gay小鲜肉啪啪18禁,女同Gay自慰喷水

《Postgresql 內(nèi)幕探索》讀書筆記 - 第一章：集簇、數(shù)據(jù)庫、數(shù)據(jù)表

《Postgresql 內(nèi)幕探索》讀書筆記 - 第一章：集簇、表空間、元組

引言