File systems

ZFS

zetabyte file system It is not part of linux distributions but you can use the user space version zfs-fuse which seems to work fine (or use openzfs)

iscsi with zfs

You can use zfs fuse on iscsi disks, however i had problems with the array after rebooting. Mostly if you see that all or most disk are reporting corrupt data or missing labels what probably is the case is you created the pool like :

create pool

zpool create mypool raidz sdb sdc sdd sde

However iscsi does not guarantee that sdb will be sdb after a reboot. As a matter of fact you can just as well have that problem with you BIOS !! The solution is to use the alternate paths for disks. For instance your disks are listed in /dev by id, path or uuid.

disk id

list disks by id

>ls /dev/disk/by-id
....
ata-SAMSUNG_HD322HJ_S17AJ90Q409427-part1
ata-SAMSUNG_HD322HJ_S17AJ90Q409427-part2
scsi-SATA_HDS728080PLA380_PFDB30S2TEAE8M
scsi-SATA_HDS728080PLA380_PFDB30S2TEAE8M-part1
scsi-SATA_HDS728080PLA380_PFDB30S2TEAE8M-part2
....

disk path

list disks by path

>ls /dev/disk/by-path
pci-0000:00:11.0-scsi-0:0:0:0        
pci-0000:00:11.0-scsi-0:0:0:0-part1  
pci-0000:00:11.0-scsi-0:0:0:0-part2 
pci-0000:00:11.0-scsi-0:0:0:0-part3  
pci-0000:00:11.0-scsi-1:0:0:0

disk uuid

this is a real unique disk id

list disks by path

>ls /dev/disk/by-uuid
3e9b7634-4769-4188-817a-155d434842d4  a37d8d46-3bcf-4f43-871a-34c306031d39
6b2fc0c6-e83a-4053-bbcd-07f938999a55  ae96e8a3-a8d4-44d6-be07-4f6dda206cf4

disk label

Sometimes also by label

show disk label

>ls /dev/disk/by-label
Ubuntu-Serverx2010.10x20i386

I used by path for the iscsi disks, they looked like this

output

ip-10.10.1.14:3260-iscsi-iqn.2010-11.net.almende:storage1.disk1-lun-0
ip-10.10.1.14:3260-iscsi-iqn.2010-11.net.almende:storage1.disk1-lun-0-part1
ip-10.10.1.14:3260-iscsi-iqn.2010-11.net.almende:storage1.disk2-lun-1
ip-10.10.1.14:3260-iscsi-iqn.2010-11.net.almende:storage1.disk2-lun-1-part1
ip-10.10.1.15:3260-iscsi-iqn.2010-11.net.almende:storage2.disk1-lun-0
ip-10.10.1.15:3260-iscsi-iqn.2010-11.net.almende:storage2.disk1-lun-0-part1
ip-10.10.1.15:3260-iscsi-iqn.2010-11.net.almende:storage2.disk2-lun-1
...

So that's specific enough not to get switched around.

iscsi

Iscsi is scsi over a network. For linux admin's this will be rather explanatory : 'you can fdisk a disk on another machine' And of course you can do almost anything ales you would with anything you can 'fdisk'. I myself used it to create a zfs (-fuse) raidz array over multiple machines ;-). So it might be best to describe how to do that ?

setting up iscsi

Of course you need an operating system to run the software, so i chose a simple debian lenny netinst. From there i will be very terse and am for the commands issued, but one thing in advance for installation iscsitarget is the server side software, so to provide iscsi disks. open-iscsi is the client side, so it' to use iscsi disks. In iscsi the client is often called initiator, and the server the target.

server

The server is called iSCSI Enterprise Target Daemon, and called ietd. The admin command therefore also is called ietadm.

host info

uname -a

The result from that last command will give you a clue about what modules to install, for instance mine was :

install command

apt-get install iscsitarget iscsitarget-modules-2.6.26-2-686

It will warn about not being started up, because it is disabled in /etc/default/iscsitarget. First solve that by editing the file or if ye'r really lazy :

iscsitarget

echo "ISCSITARGET_ENABLE=true" > /etc/default/iscsitarget

At your own risk, it is currently the only line there so it works fine but you are warned. But you have to alter /etc/ietd.conf it has some reasonable defaults but of course your disks should be defined. Take the example and alter/copy it. I did it without authentication, so my only alterations to the default section where :

ietd.conf

...
Target iqn.2010-11.net.almende:storage1.disk1
...
Lun 0 Path=/dev/hda1,Type=fileio
...
Target iqn.2010-11.net.almende:storage1.disk2
 Lun 1 Path=/dev/hdb1,Type=fileio

The iqn is the iscsi qualified name, it has to be globally and chronologically unique. So it was made to look like this :

iqn

iqn.yyyy.mm.:anyidentfifier

The yyyy.mm date is any date that you owned the domain you use. The idea is that you have an unique identifier for an Internet domain at a certain time and it could be sold to another one in which case the date kicks in. I just did the current date because i know we still own almende.net. Play around with the option if you like, i did not. Of course you decide which partitions (or even file it seems ?) you want to export as a target. I made two equally sized partitions because they are going to act as part a zfs array later on. Now restart the target and watch for any errors

restart iscsitarget

/etc/init.d/iscsitarget restart
Removing iSCSI enterprise target devices: succeeded.
Stopping iSCSI enterprise target service: succeeded.
Removing iSCSI enterprise target modules: succeeded.
Starting iSCSI enterprise target service: succeeded.

On the server you can issue some commands like:

view status

cat /proc/net/iet/session 
 tid:2 name:iqn.2010-11.net.almende:storage1.disk2 
 tid:1 name:iqn.2010-11.net.almende:storage1.disk1

/home/kees# cat /proc/net/iet/volume 
 tid:2 name:iqn.2010-11.net.almende:storage1.disk2
       lun:1 state:0 iotype:fileio iomode:wt path:/dev/hdb1
 tid:1 name:iqn.2010-11.net.almende:storage1.disk1
       lun:0 state:0 iotype:fileio iomode:wt path:/dev/hda1

To view the status, or go on to the client section to view over the network.

client

The client is also called the initiator, and since you tend to be either a client or a server, there is a different package to install :

install

apt-get install open-iscsi

Now to spare you some work alter the startup in /etc/iscsi/iscsi.conf to 'automatic' see below. And run this command to find iscsi targets :

find scsi targets

iscsiadm -m discovery -t sendtarget -p 10.10.1.14

-m means mode so discovery mode
-t is type, sendtarget or st for short
-p is portal , so give the ip address of a target

It will return :

output

10.10.1.14:3260,1 iqn.2010-11.net.almende:storage1.disk1
10.10.1.14:3260,1 iqn.2010-11.net.almende:storage1.disk2

After this, you are logged in and discovery will return :

discover

iscsiadm -m discovery 
10.10.1.14:3260 via sendtargets

After doing all 3 machine it looks like :

discover

10.10.1.16:3260 via sendtargets
10.10.1.15:3260 via sendtargets
10.10.1.14:3260 via sendtargets

With 'node' mode you can view all nodes :

view nodes

iscsiadm -m node
10.10.1.16:3260,1 iqn.2010-11.net.almende:storage3.disk1
10.10.1.15:3260,1 iqn.2010-11.net.almende:storage2.disk1
10.10.1.14:3260,1 iqn.2010-11.net.almende:storage1.disk2
10.10.1.14:3260,1 iqn.2010-11.net.almende:storage1.disk1
10.10.1.16:3260,1 iqn.2010-11.net.almende:storage3.disk2
10.10.1.15:3260,1 iqn.2010-11.net.almende:storage2.disk2

Also the settings for each node are made permanent in /etc/iscsi/nodes, there is a directory there for each of the nodes. But still we need to login in to the remote target, so use :

login

iscsiadm -m node --targetname "iqn.2010-11.net.almende:storage1.disk1" --portal "10.10.1.14:3260" --login

If you use no authentication, like i did above, you can now do :

list

fdisk -l

And hey you get an extra harddisk /dev/sdb !! You can do this for all targets but to automate it at boot time see next section.

auto startup

Alter /etc/iscsi/iscsi.conf :

iscsi.conf

# To request that the iscsi initd scripts startup a session set to "automatic".
node.startup = automatic
# 
# To manually startup the session set to "manual". The default is manual.
# node.startup = manual

Change the default from manual to automatic, this only affects targets discovered 'after' you set this, so here is a manual way of doing it. You can edit the node files yourself, but this is the correct way of doing it :

set automatic

iscsiadm -m node --targetname "iqn.2010-11.net.almende:storage2.disk1" --portal "10.10.1.15:3260" --op update -n node.conn[0].startup -v automatic

You may notice there is also a 'node.startup=manual' in the node files , but that does not seem to affect startup. And nowhere in the entire documentation and Internet is there anyone mentioning the difference between the two so i gave up. See [[Zfs_linux]] for how to create a useful system out of these. But here is the command i used for 3 machine with each 2 disks :

create zpool

zpool create -f storage raidz2 sdb sdc sdd sde sdf sdg

rename pool

I named my test pool 'share' which is not very recognizable as zfs pool. So i wanted to rename it to 'tank' because most examples use that term and therefore very recognizable. There is no rename command, so this is the fastest way :

zpool export share
zpool import tank

restoring a failed disk

Note that the first time i tried this it failed because i did not created the pool correctly.

zpool status
  pool: share
 state: ONLINE
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sun Apr 14 00:24:02 2024
config:

        NAME                                          STATE     READ WRITE CKSUM
        share                                         ONLINE       0     0     0
          ata-Hitachi_HDP725032GLAT80_GE2330RC1165AB  ONLINE       0     0     0
          ata-SAMSUNG_HD321KJ_S0ZEJ1MP807222          ONLINE       0     0     0
          ata-SAMSUNG_HD322HJ_S17AJ90Q409426          ONLINE       0     0     0
          ata-SAMSUNG_HD322HJ_S17AJ90Q409427          ONLINE       0     0     0
          ata-SAMSUNG_HD322HJ_S17AJ9BS104161          ONLINE       0     0     0

errors: No known data errors

Note that share does not mention raidz anywhere. If you unhook a disk (or loose one) you cannot recreate the data.

zpool status
no pools available

import may be better

zpool import
   pool: tank
     id: 17581964764773003445
  state: UNAVAIL
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

        tank                                          UNAVAIL  insufficient replicas
          ata-Hitachi_HDP725032GLAT80_GE2330RC1165AB  ONLINE
          ata-SAMSUNG_HD321KJ_S0ZEJ1MP807222          ONLINE
          ata-SAMSUNG_HD322HJ_S17AJ90Q409426          ONLINE
          ata-SAMSUNG_HD322HJ_S17AJ90Q409427          ONLINE
          ata-SAMSUNG_HD322HJ_S17AJ9BS104161          UNAVAIL

Note that it says insufficient replicas , because we just created a 'striped' raid with 5 disks. Reboot with the disk attached and you will see that the size is

cd /tank
df -h .
Filesystem      Size  Used Avail Use% Mounted on
tank            1.5T  170M  1.5T   1% /tank

So that's 5 times 320 (1600) not 4 times 320 (1280).

Recreate the pool like this

zpool destroy tank
find /dev/disk/by-id | grep -v part | grep ata > cmd
vim cmd

The -v means exclude all line having 'part' in it : the partition lines. There are also 'wvn' lines in there so this command leaves :

/dev/disk/by-id/ata-Hitachi_HDP725032GLAT80_GE2330RC1165AB
/dev/disk/by-id/ata-_NEC_DVD_RW_ND-3520A
/dev/disk/by-id/ata-SAMSUNG_HD321KJ_S0ZEJ1MP807222
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ90Q409426
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ90Q409427
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ9BS104161
/dev/disk/by-id/ata-SanDisk_SDSSDP064G_144632401890

Also remove the root system and DVD and the complete command will become.

zpool create tank raidz \
/dev/disk/by-id/ata-Hitachi_HDP725032GLAT80_GE2330RC1165AB \
/dev/disk/by-id/ata-SAMSUNG_HD321KJ_S0ZEJ1MP807222 \
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ90Q409426 \
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ90Q409427 \
/dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ9BS104161

raidz has 1 disk redundancy, raidz2, and raidz3 have 2 and 3. The disk space is now correct :

df -h .
Filesystem      Size  Used Avail Use% Mounted on
tank            1.2T  128K  1.2T   1% /tank

replace a disk

If you now switch the SATA cables with another disk it will complain but still say you can continue.

 zpool status
  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            ata-Hitachi_HDP725032GLAT80_GE2330RC1165AB  ONLINE       0     0     0
            ata-SAMSUNG_HD321KJ_S0ZEJ1MP807222          ONLINE       0     0     0
            ata-SAMSUNG_HD322HJ_S17AJ90Q409426          ONLINE       0     0     0
            ata-SAMSUNG_HD322HJ_S17AJ90Q409427          ONLINE       0     0     0
            6263933613393558152                         UNAVAIL      0     0     0  was /dev/disk/by-id/ata-SAMSUNG_HD322HJ_S17AJ9BS104161-part1

errors: No known data errors

Also note that it renamed the disk to 6263933613393558152. So use that in the replace command : zpool replace

zpool replace tank 6263933613393558152 /dev/disk/by-id/ata-Hitachi_HDP725032GLA360_GEA434RF1LP2GG

The pool is now working again and 'resilvered'

zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 89K in 00:00:02 with 0 errors on Fri Apr 26 12:21:19 2024
config:
...

I just left it that way and now the old disk is the 'spare' disk.

reimport

To be sure we can also reconstruct the pool on another machine, to be sure just reimport the pool :

zpool export tank
zpool import tank