logo
Tags

Yingyu's Magic World

Pureflash空间计算问题分析

问题: 分析为什么数据库里面alloc_size出现大于total_size情况、free_size为负数;分析代码alloc size是怎么来的-shard_size的计算流程分析

根据代码函数select_suitable_store_tray内对于v_store_free_size的查询,可见conductor部分仅用这部分数据库访问select筛选剩余可用空间,用来分配存储节点。

if(hostId == -1){
  list = S5Database.getInstance()
      .sql("select * from v_store_free_size as s "
          + "where s.status='OK' "
          + " order by free_size desc limit ? ", replica_count ).transaction(trans)
      .results(HashMap.class);

} else {
  if (replica_count > 1 ) {
    list = S5Database.getInstance()
        .sql("select * from v_store_free_size as s "
            + "where s.status='OK' and store_id!=?"
            + " order by free_size desc limit ? ", hostId, replica_count - 1).transaction(trans)
        .results(HashMap.class);
  }
  List<HashMap> list2 = S5Database.getInstance()
      .sql("select * from v_store_free_size as s "
          + "where s.status='OK' "
          + " and store_id=? ", hostId ).transaction(trans)
      .results(HashMap.class);
  if(list2 == null || list2.size() == 0){
    throw new InvalidParamException("Can't find specified store ID: " + hostId);
  }
  list.add(0, list2.get(0));
}
  • 根据文件jconductor/res/init_s5metadb.sql里面的创建v_store_alloc_size表的语句
    create view v_store_alloc_size as  select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
    /* v_store_alloc_size即 */
    select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
    

    v_store_alloc_size是从t_volume、t_replica两个表,是volume里面的shard_size的总和

** 注: 这里docker的代码和conductor代码不同,docker用的是volume的size总和

create view v_store_alloc_size as  select store_id, sum(size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
  • 创建v_tray_alloc_size语句是
    select  t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid
    , t_replica.store_id;
    

    可以看到这里其实tray_alloc_sizeshard_size的总和

根据代码/home/flyslice/yangxiao/cocalele/PureFlash/pfs/src/pf_flash_store.cpp中的以下语句,可以看到这里在recovery_replicastore存储空间时pfs采取的是’前N-1个分片均使用标准大小(64G), 最后一个分片可能因总容量不是标准大小的整数倍, 而取剩余空间作为其大小,确保所有分片的总容量之和等于存储卷的总容量’.

int64_t shard_size = std::min<int64_t>(SHARD_SIZE, vol->size - rep_id.shard_index()*SHARD_SIZE);
MariaDB [s5]> select  t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid
, t_replica.store_id;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid                            | alloc_size    |
+----------+--------------------------------------+---------------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+
5 rows in set (0.001 sec)

MariaDB [s5]> select * from v_tray_alloc_size;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid                            | alloc_size    |
+----------+--------------------------------------+---------------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+

查看当前环境的v_tray_total_size结果如下:

MariaDB [s5]> select * from v_tray_total_size where status = 'OK';
+----------+--------------------------------------+---------------+--------+
| store_id | tray_uuid                            | total_size    | status |
+----------+--------------------------------------+---------------+--------+
|        1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 8001563222016 | OK     |
|        2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2048408248320 | OK     |
|        2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 |  500107862016 | OK     |
|        3 | b4f726d1-fffb-444a-89f9-814314acc680 | 1000204886016 | OK     |
|        1 | cd7d26e6-9d99-4ea9-b31d-bc57b2a6c43c | 2048408248320 | OK     |
|        1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 8001563222016 | OK     |
+----------+--------------------------------------+---------------+--------+

可见上表的第1235条是已用的盘(作为存储节点),可以看到目前问题就是总的大小比已分配的还小

  • 创建v_tray_total_size语句是
    select store_id, uuid as tray_uuid, raw_capacity as total_size, status from t_tray;
    

通过上述sql语句可以看出这里是把raw_capacity直接作为总的size大小的

根据以下代码,jconductor/src/com/netbric/s5/conductor/handler/StoreHandler.java下,在函数add_storenode中,定义一个新的tray时,设置他的raw_capacity8T

Tray t = new Tray();
    for (int i = 0; i < 20; ++i)
    {
        t.device = "Tray-" + i;
        t.status = Status.OK;
        t.raw_capacity = 8L << 40;
        t.store_id = n.id;
        S5Database.getInstance().insert(t);
		}

/home/flyslice/yangxiao/cocalele/jconductor/src/com/netbric/s5/cluster/ClusterManager.java下的updateStoreTrays函数中,这里更新tray的raw_capacity用以下语句

tr.raw_capacity = Long.parseLong(new String(zk.getData(zkBaseDir + "/stores/"+store_id+"/trays/"+t+"/capacity", false, null)));

根据上下文这里应该是根据实际的硬盘的容量来定义的 tr.raw_capacity的具体值完全取决于ZooKeeper对应节点中存储的字符串数值,也就是leader conductor节点上的值。(?)

根据ps -ef | grep zookeeper可以看到环境启动zookeeper的log在/opt/apache-zookeeper-3.7.2-bin/bin/../logs下的zookeeper-root-server-node1.out文件,配置文件在/opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg

flyslice@node1:~/yangxiao$ cat /opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# list of cluster servers
server.1=192.168.61.229:2888:3888
server.2=192.168.61.143:2888:3888
server.3=192.168.61.122:2888:3888

不过ai说这个配置是zookeeper自身的运行配置,容量数据tray相关存储应该在zookeeper集群的节点路径中; 算tray总空间是用raw_capacity 来计算的,而这个值看起来是通过zk上面读取capacity获得的

查看zookeeper路径方法

# /opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -h
/opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -server 127.0.0.1:2181

登录连接zk后可以查看当前路径等等 251017-image1

[zk: 127.0.0.1:2181(CONNECTED) 2]  /pureflash
ZooKeeper -server host:port -client-configuration properties-file cmd args
	addWatch [-m mode] path # optional mode is one of [PERSISTENT, PERSISTENT_RECURSIVE] - default is PERSISTENT_RECURSIVE
	addauth scheme auth
	close 
	config [-c] [-w] [-s]
	connect host:port
	create [-s] [-e] [-c] [-t ttl] path [data] [acl]
	delete [-v version] path
	deleteall path [-b batch size]
	delquota [-n|-b|-N|-B] path
	get [-s] [-w] path
	getAcl [-s] path
	getAllChildrenNumber path
	getEphemerals path
	history 
	listquota path
	ls [-s] [-w] [-R] path
	printwatches on|off
	quit 
	reconfig [-s] [-v version] [[-file path] | [-members serverID=host:port1:port2;port3[,...]*]] | [-add serverId=host:port1:port2;port3[,...]]* [-remove serverId[,...]*]
	redo cmdno
	removewatches path [-c|-d|-a] [-l]
	set [-s] [-v version] path data
	setAcl [-s] [-v version] [-R] path acl
	setquota -n|-b|-N|-B val path
	stat [-w] path
	sync path
	version 
	whoami 

因此,查找capacity的命令如下: 251017-image2

get /pureflash/cluster1/stores/1/trays/211558c6-c024-4bb9-9a0c-398f0959dbf7/capacity

分析sql语句

create view v_tray_free_size as select t.store_id as store_id, t.tray_uuid as tray_uuid, t.total_size as total_size,
 COALESCE(a.alloc_size,0) as alloc_size , t.total_size-COALESCE(a.alloc_size,0) as free_size, t.status as status from v_tray_total_size as t left join v_tray_alloc_size as a on t.store_id=a.store_id and t.tray_uuid=a.tray_uuid order by free_size desc;

创建一个名为 v_tray_free_size 的视图,用于统计每个托盘(tray)的总容量、已分配容量和剩余容量

  • v_tray_total_size : t
  • v_tray_alloc_size : a
  • 根据两个表共同的store_id/tray_uuid关联, 以free_size为索引降序 251021-image2
  • left join是保留左表v_tray_total_size所有记录(右表多余字段alloc_size当与左表store_id, tray_uuid都匹配记录时合并; 左表多余字段status保留, 共同字段total_size, store_id, tray_uuid)
  • 当右表中存在与左表 store_idtray_uuid 都匹配的记录时,就会将右表的 alloc_size(已分配容量)合并到结果中;当右表中没有匹配的记录(比如该托盘从未分配过空间),右表的 alloc_size 会显示为 NULL,但通过 COALESCE(a.alloc_size, 0) 会将其转换为 0,确保后续计算 free_size 时正确(总容量 - 0 = 总容量)。

举个例子帮助理解:

  • 左表 v_tray_total_size(t) 有两条记录:
store_id tray_uuid total_size status
s1 t1 1000 OK
s1 t2 2000 OK
  • 右表 v_tray_alloc_size(a) 只有一条记录(t1 有分配,t2 未分配):
store_id tray_uuid alloc_size
s1 t1 300
  • 通过 LEFT JOIN 关联后,结果会是:
store_id tray_uuid total_size alloc_size(合并右表) free_size status
s1 t1 1000 300(右表匹配到) 700 OK
s1 t2 2000 0(右表无匹配,COALESCE 处理) 2000 OK

哪里涉及alloc_size、shard_size

  • 这个计算空间除了初始化init数据库的时候,在更新增删节点add_storenode,新增volume会用到,还有啥地方会涉及
  • 这个shard_size的计算好像只在do_create_volumedo_create_pfs2里面有
    ...
    v.shard_size = Config.DEFAULT_SHARD_SIZE;
    ...
    long shardCount = (v.size + v.shard_size - 1) / v.shard_size;
    ...
    assert(v.shard_size == 1L<<reply.shard_lba_cnt_order);
    ...
    

    系统应该采取了预分配,有prepare_volume方法

好像系统分配方法是先不管实际占用alloc多少空间固定分配64G,无论是实际还是docker情况;都不是实际写入多少