蘇州專業(yè)網(wǎng)站建設(shè)定制正規(guī)考證培訓(xùn)機構(gòu)
背景
對postgres數(shù)據(jù)庫熟悉的同學(xué)會發(fā)現(xiàn)在高并發(fā)場景下在獲取快照處易出現(xiàn)性能瓶頸,其原因在于PG使用全局數(shù)組在共享內(nèi)存中保存所有事務(wù)的狀態(tài),在獲取快照時需要加鎖以保證數(shù)據(jù)一致性。獲取快照時需要持有ProcArraryLock共享鎖比遍歷ProcArray數(shù)組中活躍事務(wù),與此同時提交或回滾的事務(wù)需要申請ProcArray排他鎖已清除本事務(wù)??上攵?#xff0c;在高并發(fā)場景下對ProcArrayLock的申請會成為數(shù)據(jù)庫的瓶頸。為克服上述問題,polardb引入CSN(COMMIT SEQUENCE NUM)事務(wù)快照機制避免對ProcarryLock的申請。
1 CSN 機制
1.1 CSN原理
PolarDB在事務(wù)層,通過CSN快照來代替PG原生快照
如圖所示,每個非只讀事務(wù)在運行過程中會被分配一個xid,在事務(wù)提交時推進CSN;同時會將單前的CSN與事務(wù)的XID的映射關(guān)系保存起來。
圖中實心豎線標(biāo)識獲取快照時刻,會獲取最新提交CSN的下一個值4。TX1、TX3、TX5均已提交,其對應(yīng)的CSN為1、2、3。TX2、TX4、TX6正在運行,TX6、TX8是未來還未開啟的事務(wù)。對于當(dāng)前快照而言,嚴格小于CSN=4的事務(wù)的提交結(jié)果均可見;其余事務(wù)還未提交,不可見。
1.2 CSN的實現(xiàn)
CSN(Commit Sequence Number,提交順序號)本身與XID(事務(wù)號)也會留存一個映射關(guān)系,以便將事務(wù)本身以及其對應(yīng)的可見性進行關(guān)聯(lián),這個映射關(guān)系會留存在CSNLog中。事務(wù)ID 2048、2049、2050、2051、2052、2053對應(yīng)的CSN號依次是5、4、7、10、6、8,也就是事務(wù)的提交順序是2049、2048、2052、2050、2053、2051.
PolarDB與之對應(yīng)為每個事務(wù)id分配8個字節(jié)uint64的CSN號,所以一個8kB頁面能保存1k個事務(wù)的CSN號。CSNLOG達到一定大小后會分塊,每個CSNLOG文件塊的大小為256kB。同xid號類似,CSN號預(yù)留了幾個特殊的號。CSNLOG定義代碼如下:
2 CSN快照與可見性判斷
2.1 CSN相關(guān)數(shù)據(jù)結(jié)構(gòu)
polar_csn_mvcc_var_cache結(jié)構(gòu)體維護了最老的活躍事務(wù)xid、下一個將要分配的CSN以及最新完成的事務(wù)xid。
當(dāng)其他事務(wù)要獲取該事務(wù)的CSN狀態(tài)時,如果該事務(wù)處于正在提交階段,那么其他事務(wù)通過獲取CommitSeqNoLock鎖的排他模式來等待其完成。
CSNLogControlLock用于寫入csnlog文件時加鎖保護。
2.2 CSN快照的獲取
PolarDB中獲取CSN快照函數(shù)為GetSnapshotDataCSN,實現(xiàn)流程如下:
1、獲取polar_shmem_csn_mvcc_var_cache->polar_next_csn作為snapshot->polar_snapshot_csn值。
2、snapshot->xmin = polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid
3、snapshot->xmax=polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid+1
4、根據(jù)GUC參數(shù)old_snapshot_threshold,決定是否需要設(shè)置snapshot->lsn以及snapshot->whenTaken 。
5、最后根據(jù)GUC參數(shù)polar_csn_xid_snapshot表示是否從csn快照中生成xid快照。
tatic Snapshot
GetSnapshotDataCSN(Snapshot snapshot)
{TransactionId xmin;TransactionId xmax;CommitSeqNo snapshotcsn;Assert(snapshot != NULL);/** The ProcArrayLock is not needed here. We only set our xmin if* it's not already set. There are only a few functions that check* the xmin under exclusive ProcArrayLock:* 1) ProcArrayInstallRestored/ImportedXmin -- can only care about* our xmin long after it has been first set.* 2) ProcArrayEndTransaction is not called concurrently with* GetSnapshotData.*//* Anything older than oldestActiveXid is surely finished by now. */xmin = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid);/* If no performance issue, we try best to maintain RecentXmin for xid based snapshot */RecentXmin = xmin;/* Announce my xmin, to hold back GlobalXmin. */if (!TransactionIdIsValid(MyPgXact->xmin)){TransactionId oldest_active_xid;MyPgXact->xmin = xmin;TransactionXmin = xmin;/** Recheck, if oldestActiveXid advanced after we read it.** This protects against a race condition with GetRecentGlobalXmin().* If a transaction ends runs GetRecentGlobalXmin(), just after we fetch* polar_oldest_active_xid, but before we set MyPgXact->xmin, it's possible* that GetRecentGlobalXmin() computed a new GlobalXmin that doesn't* cover the xmin that we got. To fix that, check polar_oldest_active_xid* again, after setting xmin. Redoing it once is enough, we don't need* to loop, because the (stale) xmin that we set prevents the same* race condition from advancing RecentGlobalXmin again.** For a brief moment, we can have the situation that our xmin is* lower than RecentGlobalXmin, but it's OK because we don't use that xmin* until we've re-checked and corrected it if necessary.*//** memory barrier to make sure that setting the xmin in our PGPROC entry* is made visible to others, before the read below.*/pg_memory_barrier();oldest_active_xid = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid);if (oldest_active_xid != xmin){/*no cover begin*/xmin = oldest_active_xid;RecentXmin = xmin;MyPgXact->xmin = xmin;TransactionXmin = xmin;/*no cover end*/}}/** Get the current snapshot CSN. This* serializes us with any concurrent commits.*/snapshotcsn = pg_atomic_read_u64(&polar_shmem_csn_mvcc_var_cache->polar_next_csn);/** Also get xmax. It is always latestCompletedXid + 1.* Make sure to read it after CSN (see TransactionIdAsyncCommitTree())*/pg_read_barrier();xmax = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid);Assert(TransactionIdIsNormal(xmax));TransactionIdAdvance(xmax);snapshot->xmin = xmin;snapshot->xmax = xmax;snapshot->polar_snapshot_csn = snapshotcsn;snapshot->polar_csn_xid_snapshot = false;snapshot->xcnt = 0;snapshot->subxcnt = 0;snapshot->suboverflowed = false;snapshot->curcid = GetCurrentCommandId(false);/** This is a new snapshot, so set both refcounts are zero, and mark it as* not copied in persistent memory.*/snapshot->active_count = 0;snapshot->regd_count = 0;snapshot->copied = false;if (old_snapshot_threshold < 0){/** If not using "snapshot too old" feature, fill related fields with* dummy values that don't require any locking.*/snapshot->lsn = InvalidXLogRecPtr;snapshot->whenTaken = 0;}else{/** Capture the current time and WAL stream location in case this* snapshot becomes old enough to need to fall back on the special* "old snapshot" logic.*/snapshot->lsn = GetXLogInsertRecPtr();snapshot->whenTaken = GetSnapshotCurrentTimestamp();MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);}/* * We get RecentGlobalXmin/RecentGlobalDataXmin lazily in polar csn.* In master mode, we reset it when end transaction;* In hot standby mode, wal replayed by startup backend, we has to reset* it when get snapshot,* because RecentGlobalXmin/RecentGlobalDataXmin are backend variables.*/if (RecoveryInProgress())resetGlobalXminCacheCSN();/* * We need xid snapshot, should generate it from csn snapshot.* The logic is:* 1. Scan csnlog from xmin(inclusive) to xmax(exclusive)* 2. Add xids whose status are in_progress or committing or * committed csn >= snapshotcsn to xid array* Like hot standby, we don't know which xids are top-level and which are* subxacts. So we use subxip to store xids as more as possible. */if (polar_csn_xid_snapshot){if (TransactionIdPrecedes(xmin, xmax))polar_csnlog_get_running_xids(xmin, xmax, snapshotcsn, GetMaxSnapshotSubxidCount(),&snapshot->subxcnt, snapshot->subxip, &snapshot->suboverflowed);snapshot->polar_csn_xid_snapshot = true;}return snapshot;
}
2.3 MVCC可見性判斷流程
結(jié)合行頭的結(jié)構(gòu)(其中的xmin、xmax)以及Clog、上述CSNLOG的映射機制,MVCC的大致判斷流程如下所示,實現(xiàn)函數(shù)為HeapTupleSatisfiesMVCC,對于xid在CSN快照中的可見性判斷函數(shù)為XidVisibleInSnapshotCSN,其流程圖如下:
2.4 事務(wù)commit和abort如何更新CSN
CSN快照獲取主要依據(jù)polar_shmem_csn_mvcc_var_cache變量中維護的成員變量,參考前面的CSN快照獲取。
因此,這里主要關(guān)注事務(wù)在commit和abort時如何更新polar_shmem_csn_mvcc_var_cache的成員變量。
AdvanceOldestActiveXidCSN函數(shù)用于推進->polar_oldest_active_xid這個值:
進程退出、事務(wù)提交以及回滾之后、以及在備機上回放commit和abort時需要推進polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid,當(dāng)事務(wù)的xid等于polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid時,才會推進polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid的值,否則直接返回。
polar_xact_abort_tree_csn在事務(wù)回滾時設(shè)置CSN的值(POLAR_CSN_ABORTED),并推進polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid值。
polar_xact_commit_tree_csn在事務(wù)提交時設(shè)置該事務(wù)CSN的值,并推進和polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid和polar_shmem_csn_mvcc_var_cache->polar_next_csn的值。
polar_shmem_csn_mvcc_var_cache->polar_next_csn只有事務(wù)提交才會推進,回滾事務(wù)不會推進該值。
對于開啟CSN功能之后,PG中原來的維護xid分配的全局變量ShmemVariableCache中的數(shù)據(jù)成員只有ShmemVariableCache->nextXid會更新(用于分配xid)。而原來的ShmemVariableCache->latestCompletedXid等在已經(jīng)被polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid所取代,因此事務(wù)狀態(tài)變化時并不需要維護其值。