I am writing this blog to share my experience about scheduling a maintenance activity on NetApp FAS3270 with Clustered DATA ONTAP. I had to reboot one node which was hosting 500 virtual machines across eight ESXi hosts.
When a storage administrator has to schedule a maintenance activity like firmware/hardware upgrade which requires a reboot he has the following options:
Work hard with Traditional Storage
- Spend several minutes trying to shutdown the VMs on all eight ESXi hosts.
- Make sure all VMs are powered off and there is no active I/O to avoid any application specific issues.
- Reboot the Controller.
- Again spend several hours trying to power on all the 500 virtual machines.
- Spend hours working on your weekend trying to complete this maintenance
Work Smart with Clustered Data ONTAP
- Use Clustered Data ONTAP with LIF migration and SFO (Storage Failover).
- Perform takeover/give back of the controller.
- No changes required in the vSphere Infrastrucutre
- Migrate the LIFs back to the source node
- Complete the maintenance within 10-15 minutes during production hours.
This is the procedure that I followed to perform this activity
I have the following cluster configured with 515 VMs
IMPORTANT: You don’t have to make any changes in your vSphere Infrastructure. You do NOT need any downtime for VMs.
The following activity has to be performed on your NetApp Storage
Make sure that the cluster is healthy.
f3270::> cluster show
Node Health Eligibility
--------------------- ------- ------------
lab-filer1 true true
lab-filer2 true true
lab-filer3 true true
lab-filer4 true true
4 entries were displayed.
Check the Storage Failover settings
lab-f3270::> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
lab-filer1 lab-filer2 true Connected to lab-filer2
lab-filer2 lab-filer1 true Connected to lab-filer1
lab-filer3 lab-filer4 true Connected to lab-filer4
lab-filer4 lab-filer3 true Connected to lab-filer3
4 entries were displayed.
Enable Advanced mode
lab-f3270::> set adv
Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y
Check how many lifs are currently on this node
lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Lab_Vserver
nfs_lif04 up/up 192.168.40.244/24 lab-filer4 i0a-400 true
Make sure that the LIF is migrated to another node in the cluster
lab-f3270::*> network interface migrate-all -node lab-filer4
lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
There are no entries matching your query.
IMPORTANT: Create LIF Failover groups to perform seamless migration of the LIFs during link failure and takeover. In this blog post I have shared the steps to perform link migration in case you have not configured Failover groups. I encourage that you configure failover groups, refer to the Clustered Data ONTAP ® 8.2 High-Availability Configuration Guide for detailed information.
Initiate the takeover of the controller to reboot it.
lab-f3270::*> storage failover takeover -ofnode lab-filer4
The controller now reboots
lab-filer4% Waiting for PIDS: /usr/sbin/ypbind 722.
Waiting for PIDS: /usr/sbin/rpcbind 688.
Terminated
.
Uptime: 112d2h54m45s
Top Shutdown Times (ms): {if_reset=1161, shutdown_wafl=223(multivol=0, sfsr=0, abort_scan=0, snapshot=0, start=62, sync1=77, sync2=4, mark_fs=80), wafl_sync_tagged=148, shutdown_raid=28, iscsimgt_notify_shutdown_appliance=22, shutdown_fm=15}
Shutdown duration (ms): {CIFS=2607, NFS=2607, ISCSI=2585, FCP=2585}
HALT: HA partner has taken over (ic) on Fri Jan 24 04:08:38 EST 2014
System rebooting...
Once the reboot is complete and the storage is ready for give back, initiate the give back for this controller
lab-f3270::*> storage failover giveback -ofnode lab-filer4
Info: Run the storage failover show-giveback command to check giveback status.
Revert the lif back to its home node
lab-f3270::*> network interface revert -vserver Lab_Vserver -lif nfs_lif04
lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Lab_Vserver
nfs_lif04 up/up 192.168.40.244/24 lab-filer4 i0a-400 true
Make sure that the cluster is healthy again.
Within 10-15 minutes and the entire maintenance activity of rebooting the controller and making sure that its online was complete.
IMPORTANT: It’s important that you setup the cluster as per best practices, refer to Clustered Data ONTAP 8.2 Documentation for more information.