Pureflash空间计算问题分析
问题: 分析为什么数据库里面alloc_size出现大于total_size情况、free_size为负数;分析代码alloc size是怎么来的-shard_size的计算流程分析
根据代码函数select_suitable_store_tray内对于v_store_free_size的查询,可见conductor部分仅用这部分数据库访问select筛选剩余可用空间,用来分配存储节点。
if(hostId == -1){
list = S5Database.getInstance()
.sql("select * from v_store_free_size as s "
+ "where s.status='OK' "
+ " order by free_size desc limit ? ", replica_count ).transaction(trans)
.results(HashMap.class);
} else {
if (replica_count > 1 ) {
list = S5Database.getInstance()
.sql("select * from v_store_free_size as s "
+ "where s.status='OK' and store_id!=?"
+ " order by free_size desc limit ? ", hostId, replica_count - 1).transaction(trans)
.results(HashMap.class);
}
List<HashMap> list2 = S5Database.getInstance()
.sql("select * from v_store_free_size as s "
+ "where s.status='OK' "
+ " and store_id=? ", hostId ).transaction(trans)
.results(HashMap.class);
if(list2 == null || list2.size() == 0){
throw new InvalidParamException("Can't find specified store ID: " + hostId);
}
list.add(0, list2.get(0));
}
- 根据文件
jconductor/res/init_s5metadb.sql里面的创建v_store_alloc_size表的语句create view v_store_alloc_size as select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id; /* v_store_alloc_size即 */ select store_id, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;v_store_alloc_size是从t_volume、t_replica两个表,是volume里面的shard_size的总和
** 注: 这里docker的代码和conductor代码不同,docker用的是volume的size总和
create view v_store_alloc_size as select store_id, sum(size) as alloc_size from t_volume, t_replica where t_volume.id=t_replica.volume_id group by t_replica.store_id;
- 创建
v_tray_alloc_size语句是select t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid , t_replica.store_id;可以看到这里其实
tray_alloc_size是shard_size的总和
根据代码/home/flyslice/yangxiao/cocalele/PureFlash/pfs/src/pf_flash_store.cpp中的以下语句,可以看到这里在recovery_replicastore存储空间时pfs采取的是’前N-1个分片均使用标准大小(64G), 最后一个分片可能因总容量不是标准大小的整数倍, 而取剩余空间作为其大小,确保所有分片的总容量之和等于存储卷的总容量’.
int64_t shard_size = std::min<int64_t>(SHARD_SIZE, vol->size - rep_id.shard_index()*SHARD_SIZE);
MariaDB [s5]> select t_replica.store_id as store_id, tray_uuid, sum(t_volume.shard_size) as alloc_size from t_volume, t_replica where t_volume.id = t_replica.volume_id group by t_replica.tray_uuid
, t_replica.store_id;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid | alloc_size |
+----------+--------------------------------------+---------------+
| 1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
| 2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
| 2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
| 3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
| 1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+
5 rows in set (0.001 sec)
MariaDB [s5]> select * from v_tray_alloc_size;
+----------+--------------------------------------+---------------+
| store_id | tray_uuid | alloc_size |
+----------+--------------------------------------+---------------+
| 1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 3092376453120 |
| 2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2680059592704 |
| 2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 1168231104512 |
| 3 | b4f726d1-fffb-444a-89f9-814314acc680 | 2336462209024 |
| 1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 3092376453120 |
+----------+--------------------------------------+---------------+
查看当前环境的v_tray_total_size结果如下:
MariaDB [s5]> select * from v_tray_total_size where status = 'OK';
+----------+--------------------------------------+---------------+--------+
| store_id | tray_uuid | total_size | status |
+----------+--------------------------------------+---------------+--------+
| 1 | 7054366f-c8f1-404f-925e-09bb04288df1 | 8001563222016 | OK |
| 2 | 88c73032-9b6b-493c-b826-2fccb4e245dd | 2048408248320 | OK |
| 2 | ae6582a5-fd11-446c-a8de-22eb7a8a6540 | 500107862016 | OK |
| 3 | b4f726d1-fffb-444a-89f9-814314acc680 | 1000204886016 | OK |
| 1 | cd7d26e6-9d99-4ea9-b31d-bc57b2a6c43c | 2048408248320 | OK |
| 1 | d46569d5-a82a-47a3-b34d-608c8afb5e06 | 8001563222016 | OK |
+----------+--------------------------------------+---------------+--------+
可见上表的第1235条是已用的盘(作为存储节点),可以看到目前问题就是总的大小比已分配的还小
- 创建
v_tray_total_size语句是select store_id, uuid as tray_uuid, raw_capacity as total_size, status from t_tray;
通过上述sql语句可以看出这里是把raw_capacity直接作为总的size大小的
根据以下代码,jconductor/src/com/netbric/s5/conductor/handler/StoreHandler.java下,在函数add_storenode中,定义一个新的tray时,设置他的raw_capacity为8T
Tray t = new Tray();
for (int i = 0; i < 20; ++i)
{
t.device = "Tray-" + i;
t.status = Status.OK;
t.raw_capacity = 8L << 40;
t.store_id = n.id;
S5Database.getInstance().insert(t);
}
在/home/flyslice/yangxiao/cocalele/jconductor/src/com/netbric/s5/cluster/ClusterManager.java下的updateStoreTrays函数中,这里更新tray的raw_capacity用以下语句
tr.raw_capacity = Long.parseLong(new String(zk.getData(zkBaseDir + "/stores/"+store_id+"/trays/"+t+"/capacity", false, null)));
根据上下文这里应该是根据实际的硬盘的容量来定义的 tr.raw_capacity的具体值完全取决于ZooKeeper对应节点中存储的字符串数值,也就是leader conductor节点上的值。(?)
根据ps -ef | grep zookeeper可以看到环境启动zookeeper的log在/opt/apache-zookeeper-3.7.2-bin/bin/../logs下的zookeeper-root-server-node1.out文件,配置文件在/opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg
flyslice@node1:~/yangxiao$ cat /opt/apache-zookeeper-3.7.2-bin/bin/../conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# list of cluster servers
server.1=192.168.61.229:2888:3888
server.2=192.168.61.143:2888:3888
server.3=192.168.61.122:2888:3888
不过ai说这个配置是zookeeper自身的运行配置,容量数据tray相关存储应该在zookeeper集群的节点路径中; 算tray总空间是用raw_capacity 来计算的,而这个值看起来是通过zk上面读取capacity获得的
查看zookeeper路径方法
# /opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -h
/opt/apache-zookeeper-3.7.2-bin/bin/zkCli.sh -server 127.0.0.1:2181
登录连接zk后可以查看当前路径等等

[zk: 127.0.0.1:2181(CONNECTED) 2] /pureflash
ZooKeeper -server host:port -client-configuration properties-file cmd args
addWatch [-m mode] path # optional mode is one of [PERSISTENT, PERSISTENT_RECURSIVE] - default is PERSISTENT_RECURSIVE
addauth scheme auth
close
config [-c] [-w] [-s]
connect host:port
create [-s] [-e] [-c] [-t ttl] path [data] [acl]
delete [-v version] path
deleteall path [-b batch size]
delquota [-n|-b|-N|-B] path
get [-s] [-w] path
getAcl [-s] path
getAllChildrenNumber path
getEphemerals path
history
listquota path
ls [-s] [-w] [-R] path
printwatches on|off
quit
reconfig [-s] [-v version] [[-file path] | [-members serverID=host:port1:port2;port3[,...]*]] | [-add serverId=host:port1:port2;port3[,...]]* [-remove serverId[,...]*]
redo cmdno
removewatches path [-c|-d|-a] [-l]
set [-s] [-v version] path data
setAcl [-s] [-v version] [-R] path acl
setquota -n|-b|-N|-B val path
stat [-w] path
sync path
version
whoami
因此,查找capacity的命令如下:

get /pureflash/cluster1/stores/1/trays/211558c6-c024-4bb9-9a0c-398f0959dbf7/capacity
分析sql语句
create view v_tray_free_size as select t.store_id as store_id, t.tray_uuid as tray_uuid, t.total_size as total_size,
COALESCE(a.alloc_size,0) as alloc_size , t.total_size-COALESCE(a.alloc_size,0) as free_size, t.status as status from v_tray_total_size as t left join v_tray_alloc_size as a on t.store_id=a.store_id and t.tray_uuid=a.tray_uuid order by free_size desc;
创建一个名为 v_tray_free_size 的视图,用于统计每个托盘(tray)的总容量、已分配容量和剩余容量
v_tray_total_size:tv_tray_alloc_size:a- 根据两个表共同的
store_id/tray_uuid关联, 以free_size为索引降序
left join是保留左表v_tray_total_size所有记录(右表多余字段alloc_size当与左表store_id,tray_uuid都匹配记录时合并; 左表多余字段status保留, 共同字段total_size,store_id,tray_uuid)- 当右表中存在与左表
store_id和tray_uuid都匹配的记录时,就会将右表的alloc_size(已分配容量)合并到结果中;当右表中没有匹配的记录(比如该托盘从未分配过空间),右表的alloc_size会显示为NULL,但通过COALESCE(a.alloc_size, 0)会将其转换为 0,确保后续计算free_size时正确(总容量 - 0 = 总容量)。
举个例子帮助理解:
- 左表 v_tray_total_size(t) 有两条记录:
| store_id | tray_uuid | total_size | status |
| s1 | t1 | 1000 | OK |
| s1 | t2 | 2000 | OK |
- 右表 v_tray_alloc_size(a) 只有一条记录(t1 有分配,t2 未分配):
| store_id | tray_uuid | alloc_size |
| s1 | t1 | 300 |
- 通过 LEFT JOIN 关联后,结果会是:
| store_id | tray_uuid | total_size | alloc_size(合并右表) | free_size | status |
| s1 | t1 | 1000 | 300(右表匹配到) | 700 | OK |
| s1 | t2 | 2000 | 0(右表无匹配,COALESCE 处理) | 2000 | OK |
哪里涉及alloc_size、shard_size
- 这个计算空间除了初始化init数据库的时候,在更新增删节点
add_storenode,新增volume会用到,还有啥地方会涉及 - 这个shard_size的计算好像只在
do_create_volume和do_create_pfs2里面有... v.shard_size = Config.DEFAULT_SHARD_SIZE; ... long shardCount = (v.size + v.shard_size - 1) / v.shard_size; ... assert(v.shard_size == 1L<<reply.shard_lba_cnt_order); ...系统应该采取了预分配,有
prepare_volume方法
好像系统分配方法是先不管实际占用alloc多少空间固定分配64G,无论是实际还是docker情况;都不是实际写入多少