If you are managing your own infrastructure in your own private data center, then you are bound to go through a selection of different storage offerings. Selecting a storage solution pretty much depends on your requirement. Before finalizing a particular storage option for your use case, a little bit of understanding about the technology is always helpful.
I was actually going to write an article about object storage (which is the current hottest storage option in the cloud). But before going and discussing that part of the storage arena, I thought its better to discuss the two main storage methods which co-exists together from a very long time, used by companies internally for their needs.
The decision of your storage type will depend on many factors like the below ones.
Type of data that you want to store
Usage pattern
Scaling concerns
Finally your budget
When you begin your career as a system administrator, you will often hear your colleagues talking about different storage methods like SAN, NAS, DAS etc. And without a little bit of digging, you are bound to get confused with different terms in storage. The confusion arises often because of the similarities between the different approaches in storage. The only hard and fast rule to stay up to date in technical terms, is to keep on reading stuffs (especially concepts behind a certain technology.)
Today we will be discussing two different methods that defines the structure of storage in your environment. Your choice of the two in your architecture should only depend on your use case, and type of data that you store.
By the end of this tutorial, I hope you will have a clear picture about the main two types of storage methods, and what to select for your need.
SAN (Storage Area Network) and NAS(Network Attached Storage)
The main things that differentiate each of these technologies are mentioned below.
The main things that differentiate each of these technologies are mentioned below.
How a storage is connected to a system. In short how the connection is made between the accessing system and the storage component (directly attached or network attached)
Type of cabling used to connect. In short this is the type of cabling done to connect a system to the storage component (eg. Ethernet & Fiber channel)
How are input and output requests done. In short this is the protocol used to conduct input and output requests (eg. SCSI, NFS, CIFS etc)
Related: How to monitor IO on linux
Let's discuss SAN first and then NAS, and at the end, let's compare each of these technologies to clear the differences between them.
SAN(Storage Area Network)
Today's applications are very much resource intensive, due to the kind of requests that needs to be processed simultaneously per second. Take example of an e-commerce website, where thousands of people are making orders per second, and all needs to be stored properly in the database for later retrieval. The storage technology used to store such high traffic data bases must be fast in request serving and response(in short it should be fast in Input and Output).
Related: Web server Performance test
In such cases(where you need high performance, and fast I/O ) we can use SAN.
SAN is nothing but a high speed network that makes connections between storage devices and servers.
Traditionally application servers used to have their own storage devices attached to them. Server's talk to these devices by a protocol known as SCSI(Small Computer System Interface). SCSI is nothing but a standard used to communicate between servers and storage devices. All normal hard disks, tape drives etc uses SCSI. In the beginning the storage needs of a server was fulfilled by a storage devices that was included inside the server(the server used to talk to those internal storage device, using SCSI. This is very much similar to how a normal desktop talks to its internal hard disk.).
Devices like Compact Disk drives are attached to the server(which are part of the server) using SCSI. The main advantage of SCSI for connecting devices to a server was its high throughput. Although this architecture is sufficient for low end requirements, there are few limitations like the below mentioned ones.
The server can only access data on the devices, which are directly attached to it.
If something happens to the server, access to data will fail (because the storage device is part of the server and is attached to it using SCSI)
There is a limit in the number of storage devices the server can access. In case the server needs more storage space, there will be no more space that can be attached, as the SCSI bus can accommodate only a finite number of devices.
Also the server using the SCSI storage has to be near the storage device(because parallel SCSI, which is the normal implementation in most computer's and servers, has some distance limitations. It can work up to 25 meters.)
Some of these limitations can be overcame using DAS (Directly Attached Storage). The media used to directly connect storage to the server can be any one of SCSI, Ethernet, Fiber channel etc.). Low complexity, Low investment, Simplicity in deployment caused DAS to be adopted by many for normal requirement's. The solution was good even performance wise, if used with faster mediums like fiber channel.
Even an external USB drive attached to a server is also a DAS(well conceptually its DAS, as its directly attached to the server's USB bus). But USB drives are normally not used due to the speed limitation of USB bus. Normally for heavy and large DAS storage solutions, the media used are SAS(Serially attached SCSI). Internally the storage device can use RAID(which normally is the case) or anything to provide storage volumes to servers. SAS storage options provide 6Gb/s speed these days.
An example of DAS storage device is Dell's MD1220
To the server, a DAS storage will appear very much similar to its own internal drive or an external drive that you plugged in.
Although DAS is good for normal needs and gives good performance, there are limitations like the number of servers that can access it. Storage device, or say DAS storage has to be near to the server (in the same rack or within the limits of the accepted distance of the medium used.).
It can be argued that, directly attached storage(DAS) is faster than any other storage methods. This is because it does not involve any overhead of data transfer over the network (all data transfer occurs on a dedicated connection between the server and the storage device. Mostly its Serially attached SCSI or SAS). However due to latest improvement's in fiber channel and other caching mechanism's, SAN also provides better speed's similar to DAS, and in some cases, it surpasses the speed provided by a DAS.
Before getting inside SAN, let's understand several media types and methods that are used to interconnect storage devices(when i say storage devices, please dont consider it as one single hard disk. Take it as an array of disk's, probably in some RAID level. Consider it as something like Dell's MD1200).
what is SAS(Serially Attached SCSI), FC(Fibre Channel), and iSCSI (Internet Small Computer System Interface)?
Traditionally the SCSI devices like the internal hard disk's are connected to a shared parallel SCSI bus. This means all devices attached, will be using the same bus to send/receive data. But shared parallel connections are not good for high accuracy, and create issues during high speed transfers. However a serial connection between the device and the server can increase the overall throughput of the data transfer. SAS connections between storage devices and servers uses a dedicated 300 MB/Sec per disk. Think of SCSI bus that shares the same speed for all devices connected.
SAS uses the same SCSI commands to send and receive data from a device. Also please do not think that SCSI is only used for internal storage. It is also used for external storage device to be connected to the server.
If data transfer performance and reliability is the choice, then using SAS is the best solution. In terms of reliability and error rate SAS disks are much better compared to the old SATA disks. SAS was designed by keeping performance in mind, due to which it is full-duplex. This means, data can be send and received simultaniously from a device using SAS. Also a single SAS host port can connect to multiple SAS drives using expanders. SAS uses point to point data transfer by using serial communication between devices (storage device, like disk drives & disk array's) and hosts.
The first generation of SAS provided around 3Gb/s of speed. The second generation of SAS improved this to 6Gb/s. And the third generation (which is currently used by many organization's for extremly high throughput) improved this to 12Gb/s.
Fiber Channel Protocol
Fiber Channel is a relatively new interconnection technology used for fast data transfer. The main purpose of its design is to enable transport of data at faster rates with a very less/negligible delay. It can be used to interconnect workstations, peripherals, storage array's etc.
The major factor that distinguishes fiber channel from other interconnecting method is that, it can manage both networking and I/O communication over a single channel using the same adapters.
ANSI (American National Standards Institute) standardized Fiber channel during 1988. When we say Fiber (in Fiber channel) do not think that it only supports optical fiber medium. Fiber is a term used for any medium used to interconnect in fiber channel protocol. You can even use copper wire for lower cost.
Please note the fact that fiber channel standard from ANSI supports networking, storage and data transfer. Fiber channel is not aware of the type of data that you transfer. It can send SCSI commands encapsulated inside a fiber channel frame(it does not have its own I/O commands to send and receive storage). The main advantage is that it can incorporate widely adopted protocols like SCSI and IP inside.
The components of making a fiber channel connection are mentioned below. The below requirement is very minimal to achieve a point to point connection. Typically this can be used for a direct connection between a storage array and a host.
An HBA (Host Bus Adapter) with Fiber channel port
Driver for the HBA card
Cables to interconnect devices in HBA fiber channel port
As mentioned earlier, SCSI protocol is encapsulated inside fiber channel. So normally SCSI data has to be modified to a different format that fiber channel can deliver to the destination. And when the destination receives the data it then retranslates it to SCSI.
You might be thinking that why do we need this mapping and re-mapping, why cant we directly use SCSI to deliver data. Its because SCSI cannot deliver data to greater distances to large number of devices (or large number of hosts).
Fiber cannel can be used to interconnect systems as far as 10KM (if used with optical fibers. You can increase this distance by having repeaters in between). And you can also transfer data to an extent of 30m using a copper wire for lower cost in fiber cannel.
With the emergence of fiber channel switches from variety of major vendors, connecting many large number of storage devices and servers have now become an easy task(provided you have the budget to invest). The networking ability of fiber channel led to the advanced adoption of SAN(Storage Area Networks) for faster, long distance, and reliable data access. Most of the high computing environment's(which requires fast and large volume data transfers) uses fiber channel SAN with optical fiber cables.
The current fiber channel standard (called as 16GFC) can transmit data at the rate of 1600MB/s(dont forget the fact that this standard was released in 2011). The upcoming standards in the coming years are expected to provide 3200MB/s and 6400MB/s speed.
iSCSI(Internet Small Computer System Interface )
iSCSI is nothing but an IP based standard for interconnecting storage arrays and hosts. It is used to carry SCSI traffic over IP networks. This is the simplest and cheap solution(although not the best) to connect to a storage device.
This is a nice technology for location independent storage. Because it can establish connection to a storage device using local area networks, Wide area network. Its a Storage Area Network interconnection standard. It does not require special cabling and equipments like the case of a fiber channel network.
To the system using a storage array with iSCSI, the storage appears as a locally attached disk. This technology came after fiber channel and was widely adopted due to it low cost.
Its a networking protocol which is made on top of TCP/IP. You can guess that its not at all good performance wise, when compared with fiber channel(simply because everything is running over TCP with no special hardware and modifications to your architecture.)
iSCSI introduces a little bit of CPU load on the server, because the server has to do the extra processing for all storage requests over the network, with the regular TCP.
Related: Linux CPU performance Monitoring
iSCSI has the following disadvantages, compared to fiber channel
iSCSI introduces a little bit more latency compared to fiber channel, due to the overhead of IP headers
Database applications have small read and write operations, which when done on iSCSI will introduce more latency
iSCSI when done on the same LAN, which contains other normal traffic (other infrastructure traffic other than iSCSI), it will introduce a read/write lag or say low performance.
The maximum speed/bandwidth is limited to your ethernet and network speed. Even if you aggregate multiple links, it does not scal to the level of a fiber channel.
NAS(Network Attached Storage)
The simplest definition of NAS is "Any server that shares its own storage with others on the network and acts as a file server is the simplest form NAS".
Please make a note of the fact that Network Attached Storage shares files over the network. Not storage device over the network.
NAS will be using an ethernet connection for sharing files over the network. The NAS device will have an IP address, and then will be accessible over the network through that IP address. When you access files on a file server on your windows system, its basically NAS.
The main difference is in how your computer or the server treats a particular storage. If the computer treats a storage as part of itself(similar to how you attach a DAS to your server), in other words, if the server's processor is responsible for managing the storage attached, it will be some sort of DAS. And if the computer/server treats the storage attached as another computer, which is sharing its data through the network, then its a NAS.
Directly attached storage(DAS) can be viewed as any other peripheral device like mouse keyboard etc. Because to the server/computer, its a directly attached storage device. However NAS is another server, or say an equipment having its own computing features that can share its own storage with others.
Even SAN storage can also be considered as an equipment that has its own processing/computing power. So the main difference between NAS, SAN and DAS is how the server/computer accessing it sees. A DAS storage device appears to the server as part of itself. The server sees it as its own physical part. Although the DAS storage device might not be inside the server(its normally another device with its own storage array), the server sees it as its own internal part(DAS storage appears to the server as its own internal storage)
When we talk about NAS, we need to call them shares rather than storage devices. Because NAS appears to a server as a shared folder instead of a shared device over the network. Please do not forget the fact that NAS devices are computers in themselves, who can share their storage space with others. When you share a folder with access control using SAMBA, its NAS.
Although NAS is a cheaper option for your storage needs. It really does not suit for an enterprise level high performance application. Never ever think of using a database storage (which needs to be high performing) with a NAS. The main downside of using NAS is its performance issue, and dependency on network(most of the times, the LAN which is used for normal traffic is also used for sharing storage with NAS, which makes it more congested)
Related: Linux Network Performance Tuning