pfconductor代码分析
目标/任务: 查看com/netbric/s5/conductor/handler/S5RestfulHandler.java,看看还有什么命令,如果pfccli没有实现,思考应该怎么实现
S5RestfulHandler.java
这是一个服务接口,用于后端可以操作volume/shard/replica等
尝试增加新接口’list_post’
在CliMian文件里添加以下代码
static void cmd_list_port(Namespace cmd, Config cfg) throws Exception {
// ListNodePortReply r = SimpleHttpRpc.invokeConductor(cfg, "list_port", "192.168.61.143", ListNodePortReply.class);
ListNodePortReply r = SimpleHttpRpc.invokeConductor(cfg, "list_port", ListNodePortReply.class);
if(r.retCode == RetCode.OK)
logger.info("Succeed list_store");
else
throw new IOException(String.format("Failed to list_port , code:%d, reason:%s", r.retCode, r.reason));
String [] header = { "IP Address", "Store Id", "Status"};
String[][] data = new String[r.Ports.size()][];
for(int i=0;i<r.Ports.size();i++) {
data[i] = new String[]{ r.Ports.get(i).ip_addr, Long.toString(r.Ports.get(i).store_id), r.Ports.get(i).status };
}
ASCIITable.getInstance().printTable(header, data);
}
发现需要增加对于的java类ListNodePortReply
/home/flyslice/yangxiao/cocalele/jconductor/src/com/netbric/s5/conductor/rpc/ListNodePortReply.java
package com.netbric.s5.conductor.rpc;
import com.netbric.s5.orm.Port;
import java.util.List;
public class ListNodePortReply extends RestfulReply {
public List<Port> Ports;
public ListNodePortReply(String op) {
super(op);
}
public ListNodePortReply(String op, int retCode, String reason) {
super(op, retCode, reason);
}
public ListNodePortReply(String op, List<Port> Ports) {
super(op);
this.Ports = Ports;
}
}
目前这样的修改,编译可以通过
ant -f jconductor.xml
但是运行结果还有问题(10.15待解决):
root@node2:/home/flyslice/yangxiao/cocalele/jconductor# ./pfcli list_port
[main] ERROR com.netbric.s5.conductor.rpc.SimpleHttpRpc - Failed http GET http://192.168.61.229:49180/s5c/?op=list_port
java.io.IOException: Failed RPC invoke, code:2, reason:Invalid argument: node_name
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeGET(SimpleHttpRpc.java:60)
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeConductor(SimpleHttpRpc.java:90)
at com.netbric.s5.cli.CliMain.cmd_list_port(CliMain.java:357)
at com.netbric.s5.cli.CliMain$4.run(CliMain.java:134)
at com.netbric.s5.cli.CliMain.main(CliMain.java:211)
[main] ERROR com.netbric.s5.cli.CliMain - Failed: Failed RPC invoke, code:2, reason:Invalid argument: node_name
解决思路: 增加node_name参数
ListNodePortReply r = SimpleHttpRpc.invokeConductor(cfg, "list_port", ListNodePortReply.class,
"node_name", "192.168.61.143");
然后发现这里原本serve端的代码jconductor/src/com/netbric/s5/conductor/handler/StoreHandler.java中是查询node_name, 但目前环境上没有配, 所以尝试改成id(也可以改成ip, 但id是唯一标识机器的, 用id更加准确)

MariaDB [s5]> select * from t_store;
+----+------+------+-------+----------------+--------+
| id | name | sn | model | mngt_ip | status |
+----+------+------+-------+----------------+--------+
| 1 | NULL | NULL | NULL | 192.168.61.229 | OK |
| 2 | NULL | NULL | NULL | 192.168.61.143 | OK |
| 3 | NULL | NULL | NULL | 192.168.61.122 | OK |
+----+------+------+-------+----------------+--------+
- 增加
id参数String id = cmd.getString("i"); ListNodePortReply r = SimpleHttpRpc.invokeConductor(cfg, "list_port", ListNodePortReply.class, "id", id); - 对于cli参数
./pfcli list_port -i 1
发现修改serve jconductor/src/com/netbric/s5/conductor/handler/StoreHandler.java文件内容没有生效,怀疑是因为这个是pfc服务,应该重启pfconductor服务才能生效 - 尝试重启pfconductor serve端仍然报错
| 报错如下:
root@node2:/home/flyslice/yangxiao/cocalele/jconductor# ./pfcli list_port
[main] ERROR com.netbric.s5.conductor.rpc.SimpleHttpRpc - Failed http GET http://192.168.61.229:49180/s5c/?op=list_port&id=1
java.io.IOException: Failed RPC invoke, code:2, reason:Invalid argument: node_name
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeGET(SimpleHttpRpc.java:60)
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeConductor(SimpleHttpRpc.java:90)
at com.netbric.s5.cli.CliMain.cmd_list_port(CliMain.java:357)
at com.netbric.s5.cli.CliMain$4.run(CliMain.java:134)
at com.netbric.s5.cli.CliMain.main(CliMain.java:211)
[main] ERROR com.netbric.s5.cli.CliMain - Failed: Failed RPC invoke, code:2, reason:Invalid argument: node_name
- 启动 pfconductor
source /home/flyslice/yangxiao/cocalele/jconductor/env-pfc.sh
nohup pfc -c /etc/pureflash/pfc.conf &
目前问题: 这个pfcli报错一直是这个[main] ERROR com.netbric.s5.cli.CliMain - Failed: Failed RPC invoke, code:2, reason:Invalid argument: node_name,但是我修改了代码里面唯一有包含字段”Invalid argument: node_name”的地方重新编译报错一直不变,找不到他是从哪里出来的;其实报错也可以看到执行的http命令已修改为传递id(因为climain里面修改了),但看起来后端serve对于的地方还是没有修改。 |
问题原因:没有在leader conductor服务器上修改
更新serve端步骤:
- 更新后端代码并重新编译
- kill掉三台服务器上的conductor serve进程
- 重新启动serve,现在229上执行,因为默认只有一个主leader,最先执行服务的是leader,若kill掉leader上的进程默认会跳到别的服务器
- 想要一直保持229作为主节点需要三台同时以上操作
例如如下操作:
root@node2:/home/flyslice/yangxiao/cocalele/jconductor# ps -ef | grep pfc
root 1168411 1128129 0 15:44 pts/0 00:00:01 java -classpath /home/flyslice/yangxiao/cocalele/jconductor/out/production/jconductor:/home/flyslice/yangxiao/cocalele/jconductor/lib/* -Dorg.slf4j.simpleLogger.showDateTime=true -Dorg.slf4j.simpleLogger.dateTimeFormat=[yyyy/MM/dd H:mm:ss.SSS] -XX:+HeapDumpOnOutOfMemoryError com.netbric.s5.conductor.Main -c /etc/pureflash/pfc.conf
root 1168483 1128129 0 15:59 pts/0 00:00:00 grep --color=auto pfc
kill 1168411
# 启动pfconductor:三个节点上分别执行
source /home/flyslice/yangxiao/cocalele/jconductor/env-pfc.sh
nohup pfc -c /etc/pureflash/pfc.conf &
- 更新到229后生效,出现新问题
(2025.10.16)
flyslice@node1:~/yangxiao/cocalele/jconductor$ ./pfcli list_port -i 1
cmd_list_port
[main] ERROR com.netbric.s5.conductor.rpc.SimpleHttpRpc - Failed http GET http://192.168.61.229:49180/s5c/?op=list_port&id=1
java.io.IOException: Failed RPC invoke, code:4, reason:Cannot run program "c:/eclipse/plink.exe" (in directory "."): error=2, 没有那个文件或目录
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeGET(SimpleHttpRpc.java:60)
at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeConductor(SimpleHttpRpc.java:90)
at com.netbric.s5.cli.CliMain.cmd_list_port(CliMain.java:359)
at com.netbric.s5.cli.CliMain$4.run(CliMain.java:135)
at com.netbric.s5.cli.CliMain.main(CliMain.java:212)
[main] ERROR com.netbric.s5.cli.CliMain - Failed: Failed RPC invoke, code:4, reason:Cannot run program "c:/eclipse/plink.exe" (in directory "."): error=2, 没有那个文件或目录
handler下的StoreHandler.java/TenantHandler.java/VolumeHandler.java
StoreHandler.java的注释里写了backend handler of CLI s5_add_store_node.py,应该是基于原本的s5_add_store_node.py文件改造的,是后端操作对于storenode增删检查的
同理,TenantHandler.java针对tenants, VolumeHandler针对volume
Recovery流程的处理
Recovery的处理是:
./pfcli recovery_volume -v test_v1
针对test_v1 volume查询状态不是OK的shard
遍历shard中的replica,对于状态不是OK的slave replica
给这个slave replica所在机器的pfstore发送recovery_replica,让其从primary拷贝数据恢复数据该slave replica
251020
分析代码(2)
main流程整理
conductor的初始化流程如下图:

ClusterManager.zkBaseDir = "/pureflash/"+clusterName;
ClusterManager.registerAsConductor(managmentIp, zkIp);
ClusterManager.waitToBeMaster(managmentIp);
S5Database.getInstance().init(cfg);
ClusterManager.zkHelper.createZkNodeIfNotExist(ClusterManager.zkBaseDir + "/stores", null);
ClusterManager.watchStores();
ClusterManager.updateStoresFromZk();
ClusterManager.zkHelper.createZkNodeIfNotExist(ClusterManager.zkBaseDir + "/shared_disks", null);
ClusterManager.watchSharedDisks();
ClusterManager.updateSharedDisksFromZk();
- 解释以上代码:
从zk服务中读取目录路径zkBaseDir |
注册conductor |
注册leader conductor |
S5Database.getInstance().init(cfg) - .init(cfg):这是对getInstance()返回的实例对象调用init方法,作用是初始化数据库。 |
| 注册节点进zk |
zk加watchStores |
zk加updateStoresFromZk |
(shared) zk加watchSharedDisks |
(shared) zk加updateSharedDisksFromZk |
getInstance()方法是返回一个S5Database instance, 一个S5Database类的实例. (jconductor/src/com/netbric/s5/orm/S5Database.java)