Securing any system requires you to implement layers of protection. Access Control Lists (ACLs) are typically applied to data to restrict access to data to approved entities. Application of ACLs at every layer of access for data is critical to secure a system. The layers for hadoop are depicted in this diagram and in this post we will cover the lowest level of access… ACLs for HDFS.
This is part of the HDFS Developer Trail series. Other posts in this series include:
- Heterogenous Storages in HDFS
- HDFS 2.0 Next Generation Architecture
- NameNode High Availability in HDP 2.0
- Protecting your Enterprise Data with HDFS Snapshots
- 3 Minutes on Apache Hadoop HDFS with Sanjay Radia
- Understanding NameNode Startup Operations in HDFS
- Simplifying Data Management: NFS Access to HDFS
Background
For several years, HDFS has supported a permission model equivalent to traditional Unix permission bits [5]. For each file or directory, permissions are managed for a set of 3 distinct user classes: owner, group, and others. There are 3 different permissions controlled for each user class: read, write, and execute. When a user attempts to access a file system object, HDFS enforces permissions according to the most specific user class applicable to that user. If the user is the owner, then HDFS checks the owner class permissions. If the user is not the owner, but is a member of the file system object’s group, then HDFS checks the group class permissions. Otherwise, HDFS checks the others class permissions.
This model is sufficient to express a large number of security requirements at the block level. For example, consider a sales department that wants a single user, the department manager, to control all modifications to sales data. Other members of the department need to view the data, but must not be able to modify it. Everyone else in the company outside the sales department must not be able to view the data. This requirement can be implemented by running chmod 640 on the file, with the following outcome:
-rw-r----- 3 bruce sales 0 2014-03-04 16:31 /sales-data
Only bruce may modify the file, only members of the sales group may read the file, and no one else may access the file in any way.
Now suppose there are new requirements. The sales department has grown such that it’s no longer feasible for the manager, bruce, to control all modifications to the file. Instead, the new requirement is that bruce, diana, and clark are allowed to make modifications. Unfortunately, there is no way for permission bits to express this requirement, because there can be only one owner and one group, and the group is already used to implement the read-only requirement for the sales team. A typical workaround is to set the file owner to a synthetic user account, such as salesmgr, and allow bruce, diana, and clark to use the salesmgr account via sudo or similar impersonation mechanisms.
Also suppose that in addition to the sales staff, all executives in the company need to be able to read the sales data. This is another requirement that cannot be expressed with permission bits, because there is only one group, and it’s already used by sales. A typical workaround is to set the file’s group to a new synthetic group, such as salesandexecs, and add all users of sales and all users of execs to that group.
Both of these workarounds incur significant drawbacks though. It forces complexity on to cluster administrators to manage additional users and groups. It also forces complexity on to end users, because it requires them to use different accounts for different actions.
ACLs Applied!
In general, plain Unix permissions aren’t sufficient when you have permission requirements that don’t map cleanly to an enterprise’s natural hierarchy of users and groups. Working in collaboration with the Apache community, we developed the HDFS ACLs feature to address this shortcoming. HDFS ACLs will be available in Apache Hadoop 2.4.0 and Hortonworks Data Platform 2.1.
HDFS ACLs give you the ability to specify fine-grained file permissions for specific named users or named groups, not just the file’s owner and group. HDFS ACLs are modeled after POSIX ACLs [5]. If you’ve ever used POSIX ACLs on a Linux file system, then you already know how ACLs work in HDFS. Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules.
To use ACLs, first you’ll need to enable ACLs on the NameNode by adding the following configuration property to hdfs-site.xml and restarting the NameNode.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
Most users will interact with ACLs using 2 new commands added to the HDFS CLI: setfacl and getfacl. Let’s look at several examples of how HDFS ACLs can help implement complex security requirements.
Example 1: Granting Access to Another Named Group
Going back to our original example, let’s set an ACL that grants read access to sales-data for members of the execs group.
-
Set the ACL.
> hdfs dfs -setfacl -m group:execs:r-- /sales-data
-
Check results by running getfacl.
> hdfs dfs -getfacl /sales-data
# file: /sales-data
# owner: bruce
# group: sales
user::rw-
group::r--
group:execs:r--
mask::r--
other::---
-
Additionally, the output of ls has been modified to append ‘+’ to the permissions of a file or directory that has an ACL.
> hdfs dfs -ls /sales-data
Found 1 items
-rw-r-----+ 3 bruce sales 0 2014-03-04 16:31 /sales-data
The new ACL entry is added to the existing permissions defined by the permission bits. User bruce has full control as the file owner. Members of either the sales group or the execs group have read access. All others have no access.
Example 2: Using a Default ACL for Automatic Application to New Children
In addition to an ACL enforced during permission checks, there is also a separate concept of a default ACL. A default ACL may be applied only to a directory, not a file. Default ACLs have no direct effect on permission checks and instead define the ACL that newly created child files and directories receive automatically.
Suppose we have a monthly-sales-data directory, further sub-divided into separate directories for each month. Let’s set a default ACL to guarantee that members of the execs group automatically get access to new sub-directories, as they get created for each month.
-
Set default ACL on parent directory.
> hdfs dfs -setfacl -m default:group:execs:r-x /monthly-sales-data
-
Make sub-directories.
> hdfs dfs -mkdir /monthly-sales-data/JAN
> hdfs dfs -mkdir /monthly-sales-data/FEB
-
Verify HDFS has automatically applied default ACL to sub-directories.
> hdfs dfs -getfacl -R /monthly-sales-data
# file: /monthly-sales-data
# owner: bruce
# group: sales
user::rwx
group::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---
# file: /monthly-sales-data/FEB
# owner: bruce
# group: sales
user::rwx
group::r-x
group:execs:r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---
# file: /monthly-sales-data/JAN
# owner: bruce
# group: sales
user::rwx
group::r-x
group:execs:r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---
The default ACL is copied from the parent directory to the child file or child directory at time of creation. Subsequent changes to the parent directory’s default ACL do not alter the ACLs of existing children.
Example 3: Blocking Access to a Sub-Tree for a Specific User
Suppose there is an emergency need to block access to an entire sub-tree for a specific user. Applying a named user ACL entry to the root of that sub-tree is the fastest way to accomplish this without accidentally revoking permissions for other users.
-
Add ACL entry to block all access to monthly-sales-data by user diana.
> hdfs dfs -setfacl -m user:diana:--- /monthly-sales-data
-
Check results by running getfacl.
> hdfs dfs -getfacl /monthly-sales-data
# file: /monthly-sales-data
# owner: bruce
# group: sales
user::rwx
user:diana:---
group::r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---
It’s important to keep in mind the order of evaluation for ACL entries when a user attempts to access a file system object:
-
If the user is the file owner, then the owner permission bits are enforced.
-
Else if the user has a named user ACL entry, then those permissions are enforced.
-
Else if the user is a member of the file’s group or any named group in an ACL entry, then the union of permissions for all matching entries are enforced. (The user may be a member of multiple groups.)
-
If none of the above were applicable, then the other permission bits are enforced.
In this example, the named user ACL entry accomplished our goal, because the user is not the file owner, and the named user entry takes precedence over all other entries.
Development
This feature was addressed in issue HDFS-4685 [1]. Development and testing was a joint effort during the next several months between multiple active Apache Hadoop contributors. The scope of the effort required coding across multiple layers of HDFS: new APIs, new shell commands, new file system metadata persisted in the NameNode and enhancements to permission enforcement logic.
During the initial planning, I expected our greatest challenge would be efficient storage management for the new metadata. This is always an important consideration for new features in the NameNode, where file system metadata consumes precious RAM at runtime and long-term persistence consumes disk for the fsimage and edits. One of our goals was that existing deployments that do not wish to use ACLs must not suffer increased RAM consumption after introduction of the feature. An ACL is associated to an inode, and it would be unacceptable to introduce an O(n) (n = # inodes) increase in RAM consumption even if ACLs were not used. This immediately ruled out the naive implementation of adding a new nullable field to the inode data structure. (Even if set to null, memory is still consumed by the pointer.) Early revisions of the design document proposed optimization techniques that involved repurposing unused bits in the permission data structure to act as an index into a shared ACL table. I expected this would be tricky code requiring detailed edge case testing.
Fortunately, development of ACLs benefited from another change happening in the HDFS codebase simultaneously. HDFS-5284 [2] provided support for inode features. An inode feature is a generalized concept of optional attributes associated to a specific inode. If a particular inode does not have a feature, then that feature does not consume additional RAM. This was a perfect fit for ACLs! Much of the early design was discarded and simplified by just defining a new AclFeature class and attaching it to the inode data structure on only the inodes that required it. We’re also starting to use inode features as a building block for many other NameNode metadata needs, such as snapshots and quotas.
We also benefited from the work in HDFS-5698 [3], which converted the fsimage file from a custom binary format to a much more flexible format utilizing Protocol Buffers. With that change in place, we no longer needed to write error-prone custom serialization and deserialization logic for the new ACL metadata. Instead, we simply defined new protobuf messages to represent the new metadata.
After those 2 changes simplified the design, I was quite surprised to find that the real challenge of the project was correctly implementing the core logic for ACL manipulation and enforcement. We wanted to match existing implementations on Linux as closely as possible to make the feature familiar and easy to use for system administrators. We also wanted to make sure that ACLs would compose well with other existing HDFS features, like snapshots, the sticky bit and WebHDFS. We added more than 200 new tests to HDFS and wrote a comprehensive system test plan to cover these scenarios.
This was a community effort. I want to thank all of the Apache contributors who participated in design, coding, testing and review of the feature: Arpit Agarwal, Dilli Arumugam, Vinayakumar B, Sachin Jose, Renil Joseph, Brandon Li, Haohui Mai, Colin Patrick McCabe, Kevin Minder, Sanjay Radia, Suresh Srinivas, Tsz-Wo Nicholas Sze, Yesha Vora and Jing Zhao.
Have you ever considered contributing to HDFS? We still need help on libHDFS API bindings for the new ACL APIs. Patches welcome!
References
- HDFS-4685. Implementation of ACLs in HDFS.
- HDFS-5284. Flatten INode hierarchy.
- HDFS-5698. Use protobuf to serialize / deserialize FSImage.
- Gruenbacher, A. (2003). POSIX Access Control Lists on Linux.
- Wikipedia contributors (2013). File system permissions – Traditional Unix permissions.
The post HDFS ACLs: Fine-Grained Permissions for HDFS Files in Hadoop appeared first on Hortonworks.