IBM’s Spectrum Scale! Have you not heard of it yet? Spectrum Scale is IBM’s GPFS file system widely used for large scale enterprise clustered file systems that need to scale up to petabytes of storage, thousands of nodes, billions of files and thousands of users concurrently accessing data. Spectrum Scale supports a wide variety of data warehouse and business analytics applications.
Most of the traditional Big Data Cluster implementations commonly use Hadoop Distributed File System (HDFS) as an underlying file system to store data. This blog briefly discusses some of the IBM Spectrum Scale features that can benefit large scale Big Data clusters. Many Big Data Architects often overlook these Spectrum Scale features which are not supported by HDFS at present.
Spectrum Scale is a POSIX compliant file system and HDFS is not.
All applications will run as is or with very minimal changes in a Hadoop cluster if Spectrum Scale is used as the underlying file system instead of HDFS. Using GPFS minimizes the new application code development, testing costs and your big data cluster is production ready in the least amount of time.
Spectrum Scale provides seamless integration of Hadoop clusters with other data warehouse environments and transfers the data with ease between your Hadoop cluster and Spectrum Scale. This offers high flexibility to easily integrate your big data environment with traditional analytics environments.
Spectrum Scale is a highly available File system.
Managing large clusters with thousands of nodes and petabytes of storage is a complex task and providing high availability is a very key requirement in such environments. Spectrum Scale provides both data and metadata replication up to 3 copies, file system replication across multiple sites, multiple failure groups, and node based quorum and disk based quorum, automated node recovery, automated data striping and rebalancing and more. These high availability features in my view make Spectrum Scale a better choice than HDFS for enterprise production data.
Security Compliance
Security Compliance of business critical data is another critical requirement for any enterprise. But it is often overlooked during the development phase of many Big Data Proof of Concepts. As many Big Data PoCs use a wide variety of open source components, it often becomes a daunting task to get the required security compliance. The PoC implementation cannot go to production unless it meets all the security compliance requirements. Consider the security compliance features of Spectrum Scale like file system encryption, NIST SP 800-131A compliance, NFS V4 ACLs support, SeLinux compatibility offered by Spectrum Scale when selecting appropriate file system for Big Data clusters. It’s much easier to implement these Operating System Security features with Spectrum Scale than with HDFS.
Information Life-cycle Management
Spectrum Scale has an extensive Information Life-cycle Management (ILM) features which are necessary when working with large Big Data clusters with petabytes of storage. Using Spectrum Scale ILM policies, aging data can be automatically archived, deleted or moved to a low performance disk. This is a major advantage of Spectrum Scale over HDFS which minimizes continuously growing storage costs.
Additional References:
For a detailed comparison of HDFS vs. GPFS please visit:
https://www-01.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/over_filesystem_comparison.html
Detailed White Paper on deploying Spectrum Scale for Big Data clusters:
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=DCW03051USEN#loaded