logo
Tags

Yingyu's Magic World

Pureflash空间计算问题分析

问题: 分析为什么数据库里面alloc_size出现大于total_size情况、free_size为负数;分析代码alloc size是怎么来的-shard_size的计算流程分析

根据代码函数select_suitable_store_tray内对于v_store_free_size的查询,可见conductor部分仅用这部分数据库访问select筛选剩余可用空间,用来分配存储节点。

if(hostId == -1){
  list = S5Database.getInstance()
      .sql("select * from v_store_free_size as s "
          + "where s.status='OK' "
          + " order by free_size desc limit ? ", replica_count ).transaction(trans)
      .results(HashMap.class);

} else {
  if (replica_count > 1 ) {
    list = S5Database.getInstance()
        .sql("select * from v_store_free_size as s "
            + "where s.status='OK' and store_id!=?"
            + " order by free_size desc limit ? ", hostId, replica_count - 1).transaction(trans)
        .results(HashMap.class);
  }
  List<HashMap> list2 = S5Database.getInstance()
      .sql("select * from v_store_free_size as s "
          + "where s.status='OK' "
          + " and store_id=? ", hostId ).transaction(trans)
      .results(HashMap.class);
  if(list2 == null || list2.size() == 0){
    throw new InvalidParamException("Can't find specified store ID: " + hostId);
  }
  list.add(0, list2.get(0));
}
  • 根据文件jconductor/res/init_s5metadb.sql里面的创建v_store_alloc_size表的语句
    create view v_store_alloc_size as  select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
    /* v_store_alloc_size即 */
    select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
    

    v_store_alloc_size是从t_volume、t_replica两个表,是volume里面的shard_size的总和

** 注: 这里docker的代码和conductor代码不同,docker用的是volume的size总和

create view v_store_alloc_size as  select store_id, sum(size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
  • 创建v_tray_alloc_size语句是
    select  t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid
    , t_replica.store_id;
    

    可以看到这里其实tray_alloc_sizeshard_size的总和

  • 创建v_tray_total_size语句是
    select store_id, uuid as tray_uuid, raw_capacity as total_size, status from t_tray;
    
  • AI认为此处有漏洞:
    1. 该视图通过 sum(t_volume.shard_size) 计算分配给托盘的容量,但忽略了一个关键前提 - 一个卷(t_volume)的分片(shard)可能被复制到多个副本(t_replica),导致 shard_size 被重复统计。
    2. v_tray_total_size 直接取 t_tray.raw_capacity 作为 total_size, 若 raw_capacity 未扣除系统预留空间、元数据空间等,可能导致 total_size 被高估(但示例中是 alloc_size 更大,因此这不是主因)。

根据代码/home/flyslice/yangxiao/cocalele/PureFlash/pfs/src/pf_flash_store.cpp中的以下语句,可以看到这里在recovery_replicastore存储空间时pfs采取的是’前N-1个分片均使用标准大小(64G), 最后一个分片可能因总容量不是标准大小的整数倍, 而取剩余空间作为其大小,确保所有分片的总容量之和等于存储卷的总容量’.

int64_t shard_size = std::min<int64_t>(SHARD_SIZE, vol->size - rep_id.shard_index()*SHARD_SIZE);
MariaDB [s5]> select  t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid
, t_replica.store_id;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid                            | alloc_size    |
+----------+--------------------------------------+---------------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+
5 rows in set (0.001 sec)

MariaDB [s5]> select * from v_tray_alloc_size;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid                            | alloc_size    |
+----------+--------------------------------------+---------------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+

查看当前环境的v_tray_total_size结果如下:

MariaDB [s5]> select * from v_tray_total_size where status = 'OK';
+----------+--------------------------------------+---------------+--------+
| store_id | tray_uuid                            | total_size    | status |
+----------+--------------------------------------+---------------+--------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 8001563222016 | OK     |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2048408248320 | OK     |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 |  500107862016 | OK     |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 1000204886016 | OK     |
|        1 | cd7d26e6-9d99-4ea9-b31d-bc57b2a6c43c | 2048408248320 | OK     |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 8001563222016 | OK     |
+----------+--------------------------------------+---------------+--------+

可见上表的第1235条是已用的盘(作为存储节点),可以看到目前问题就是总的大小比已分配的还小

  • 创建v_tray_total_size语句是
    select store_id, uuid as tray_uuid, raw_capacity as total_size, status from t_tray;
    

通过上述sql语句可以看出这里是把raw_capacity直接作为总的size大小的

根据以下代码,jconductor/src/com/netbric/s5/conductor/handler/StoreHandler.java下,在函数add_storenode中,定义一个新的tray时,设置他的raw_capacity8T

Tray t = new Tray();
    for (int i = 0; i < 20; ++i)
    {
        t.device = "Tray-" + i;
        t.status = Status.OK;
        t.raw_capacity = 8L << 40;
        t.store_id = n.id;
        S5Database.getInstance().insert(t);
		}

/home/flyslice/yangxiao/cocalele/jconductor/src/com/netbric/s5/cluster/ClusterManager.java下的updateStoreTrays函数中,这里更新tray的raw_capacity用以下语句

tr.raw_capacity = Long.parseLong(new String(zk.getData(zkBaseDir + "/stores/"+store_id+"/trays/"+t+"/capacity", false, null)));

根据上下文这里应该是根据实际的硬盘的容量来定义的 tr.raw_capacity的具体值完全取决于ZooKeeper对应节点中存储的字符串数值,也就是leader conductor节点上的值。(?)

根据ps -ef | grep zookeeper可以看到环境启动zookeeper的log在/opt/apache-zookeeper-3.7.2-bin/bin/../logs下的zookeeper-root-server-node1.out文件,配置文件在/opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg

flyslice@node1:~/yangxiao$ cat /opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# list of cluster servers
server.1=192.168.61.229:2888:3888
server.2=192.168.61.143:2888:3888
server.3=192.168.61.122:2888:3888

不过ai说这个配置是zookeeper自身的运行配置,容量数据tray相关存储应该在zookeeper集群的节点路径中; 算tray总空间是用raw_capacity 来计算的,而这个值看起来是通过zk上面读取capacity获得的

查看zookeeper路径方法

# /opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -h
/opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -server 127.0.0.1:2181

登录连接zk后可以查看当前路径等等 251017-image1

[zk: 127.0.0.1:2181(CONNECTED) 2]  /pureflash
ZooKeeper -server host:port -client-configuration properties-file cmd args
	addWatch [-m mode] path # optional mode is one of [PERSISTENT, PERSISTENT_RECURSIVE] - default is PERSISTENT_RECURSIVE
	addauth scheme auth
	close 
	config [-c] [-w] [-s]
	connect host:port
	create [-s] [-e] [-c] [-t ttl] path [data] [acl]
	delete [-v version] path
	deleteall path [-b batch size]
	delquota [-n|-b|-N|-B] path
	get [-s] [-w] path
	getAcl [-s] path
	getAllChildrenNumber path
	getEphemerals path
	history 
	listquota path
	ls [-s] [-w] [-R] path
	printwatches on|off
	quit 
	reconfig [-s] [-v version] [[-file path] | [-members serverID=host:port1:port2;port3[,...]*]] | [-add serverId=host:port1:port2;port3[,...]]* [-remove serverId[,...]*]
	redo cmdno
	removewatches path [-c|-d|-a] [-l]
	set [-s] [-v version] path data
	setAcl [-s] [-v version] [-R] path acl
	setquota -n|-b|-N|-B val path
	stat [-w] path
	sync path
	version 
	whoami 

因此,查找capacity的命令如下: 251017-image2

get /pureflash/cluster1/stores/1/trays/211558c6-c024-4bb9-9a0c-398f0959dbf7/capacity

分析sql语句

create view v_tray_free_size as select t.store_id as store_id, t.tray_uuid as tray_uuid, t.total_size as total_size,
 COALESCE(a.alloc_size,0) as alloc_size , t.total_size-COALESCE(a.alloc_size,0) as free_size, t.status as status from v_tray_total_size as t left join v_tray_alloc_size as a on t.store_id=a.store_id and t.tray_uuid=a.tray_uuid order by free_size desc;

创建一个名为 v_tray_free_size 的视图,用于统计每个托盘(tray)的总容量、已分配容量和剩余容量

  • v_tray_total_size : t
  • v_tray_alloc_size : a
  • 根据两个表共同的store_id/tray_uuid关联, 以free_size为索引降序 251021-image2
  • left join是保留左表v_tray_total_size所有记录(右表多余字段alloc_size当与左表store_id, tray_uuid都匹配记录时合并; 左表多余字段status保留, 共同字段total_size, store_id, tray_uuid)
  • 当右表中存在与左表 store_idtray_uuid 都匹配的记录时,就会将右表的 alloc_size(已分配容量)合并到结果中;当右表中没有匹配的记录(比如该托盘从未分配过空间),右表的 alloc_size 会显示为 NULL,但通过 COALESCE(a.alloc_size, 0) 会将其转换为 0,确保后续计算 free_size 时正确(总容量 - 0 = 总容量)。

举个例子帮助理解:

  • 左表 v_tray_total_size(t) 有两条记录:
store_id tray_uuid total_size status
s1 t1 1000 OK
s1 t2 2000 OK
  • 右表 v_tray_alloc_size(a) 只有一条记录(t1 有分配,t2 未分配):
store_id tray_uuid alloc_size
s1 t1 300
  • 通过 LEFT JOIN 关联后,结果会是:
store_id tray_uuid total_size alloc_size(合并右表) free_size status
s1 t1 1000 300(右表匹配到) 700 OK
s1 t2 2000 0(右表无匹配,COALESCE 处理) 2000 OK

哪里涉及alloc_size、shard_size

  • 这个计算空间除了初始化init数据库的时候,在更新增删节点add_storenode,新增volume会用到,还有啥地方会涉及
  • 这个shard_size的计算好像只在do_create_volumedo_create_pfs2里面有
    ...
    v.shard_size = Config.DEFAULT_SHARD_SIZE;
    ...
    long shardCount = (v.size + v.shard_size - 1) / v.shard_size;
    ...
    assert(v.shard_size == 1L<<reply.shard_lba_cnt_order);
    ...
    

    系统应该采取了预分配,有prepare_volume方法

好像系统分配方法是先不管实际占用alloc多少空间固定分配64G,无论是实际还是docker情况;都不是实际写入多少

AI猜测验证

猜测: 一个卷(t_volume)的分片(shard)可能被复制到多个副本(t_replica),导致 shard_size 被重复统计。

-- v_tray_alloc_size
select store_id, uuid as tray_uuid, raw_capacity as total_size, status from t_tray;

多副本本身没错,错误出在 “副本分配时未限制单个托盘的总分配容量”,而视图 v_tray_alloc_size 如实统计了每个托盘上所有副本的物理占用,最终导致 alloc_size 超过 total_size.

MariaDB [s5]> 
SELECT v.id AS volume_id,v.shard_size, COUNT(r.id) AS replica_count,  -- 该卷的总副本数
 v.shard_size * COUNT(r.id) AS total_allocated  -- 被累加的总大小
 FROM t_volume v
 JOIN t_replica r ON v.id = r.volume_id
 WHERE v.id = '2030043136'  GROUP BY v.id, v.shard_size;
+------------+-------------+---------------+-----------------+
| volume_id  | shard_size  | replica_count | total_allocated |
+------------+-------------+---------------+-----------------+
| 2030043136 | 68719476736 |            30 |   2061584302080 |
+------------+-------------+---------------+-----------------+

251021-image3

看空间数据是怎么来的

一定是真正写入数据了,才占用空间的

一开始分配的时候都没有写入数据,只是分配了64G的shard(比如do_create_volume中的相关代码)

zk里面所有的object_size都是64G 251021-image4

25.10.23结论

问题描述:下图中的v_store_total_size小于v_store_alloc_size 251023-image1

问题相关代码:

create view v_store_alloc_size as  select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;

create view v_store_total_size as  select s.id as store_id, sum(t.raw_capacity) as total_size from t_tray as t, t_store as s where t.status="OK" and t.store_id=s.id group by store_id;

结论: | v_store_alloc_size是根据之前预分配的大小来的,没有存储实际分配大小;v_store_total_size只计算了状态为OK的大小,掉线的盘不计算; 因此当节点掉线,已分配的空间不会变,但总空间变小了,问题产生

注:因为v_store_total_size计算没有包含状态掉线的盘(offline),所以该问题与’t_tray表格里有冗余tray信息’问题不相关.

修改方案