Announcing Apache Slider 0.80.0

June 5, 2015, 9:36 am

≫ Next: New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption

≪ Previous: Apache Spark on HDP: Learn, Try and Do

Last week, the Apache Slider community released Apache Slider 0.80.0. Although there are many new features in Slider 0.80.0, few innovations are particularly notable:

Containerized application onboarding
Seamless zero-downtime application upgrade
Adding co-processors to app packages without reinstallation
Simplified application onboarding without any packaging requirement

Below are some details about these important features. For the complete list of features, improvements, and bug fixes, see the release notes.

Notable Changes:

Containerized application onboarding

This release of Apache Slider provides a way to deploy containerized applications on YARN and leverage YARN’s resource management capabilities. Now cluster administrators can deploy and manage long-running applications, which are containerized using Docker, on YARN without any changes. This allows consolidation of all edge clusters into a single, unified Hadoop cluster.

Seamless zero-downtime application upgrade

From this release onwards, Slider will support applications rolling upgrade from one version to next. Application admins can now upgrade to a newer version of their application binary and/or configuration, without any downtime, by following a list of atomic steps in a planned fashion.

Adding co-processors to app packages without reinstallation

Many big data applications allow plugins/co-processors that are essentially a set of additional jar files in the classpath, with a set of configuration files or changes. This feature enables the app package to support such plugins, without recreation of app packages. App packaging and create commands are modified such that the app admin can dynamically specify additional jars, configuration files, etc. For example, with this support in Slider, HBase-on-Yarn can now add support for Phoenix or Ranger to HBase application, without any change to the base package.

Simplified application onboarding without any packaging requirement

This enhancement enables certain types of simple applications to run on YARN without any packaging requirement apart from providing some metadata information about the application. This feature eliminates the need for application packaging and installation steps altogether. Only basic metadata information about the application is required. This feature is applicable for applications with all dependencies pre-installed.

What’s Next?

The Slider community plans to develop exciting new capabilities for the next release, including:

Complete management of containerized applications on YARN
Tighter integration of YARN applications with Ambari through Slider
More new applications on YARN via Slider
Application relocation for Backup and Recover & Test-to-Production

We want to thank the Apache Kafka and Apache Tajo communities for their work to enable these powerful long-running applications to run on YARN via Slider. And we want to thank the Apache Slider community for this release.

Learn More

Slider 0.80.0 Release Source Code
Slider 0.80.0 Release Notes
Blog – Deploying Long-running services on YARN using Apache Slider
Apache Slider project page

The post Announcing Apache Slider 0.80.0 appeared first on Hortonworks.

↧

New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption

June 10, 2015, 9:30 am

≫ Next: Introducing Hortonworks SmartSense

≪ Previous: Announcing Apache Slider 0.80.0

Apache Hadoop has emerged as a critical data platform to deliver business insights hidden in big data. As a relatively new technology, system administrators hold Hadoop to higher security standards. There are several reasons for this scrutiny:

External ecosystem that comprise of data repositories and operational systems that feed Hadoop deployments are highly dynamic and can introduce new security threats on a regular basis.
Hadoop deployment contains large volume of diverse data stored over longer periods of time. Any breach of this enterprise-wide data can be catastrophic.
Hadoop enables users across multiple business units to access, refine, explore and enrich data using different methods, thereby raising the risk for potential breach.

Security Pillars in Hortonworks Data Platform (HDP)

HDP is the only Hadoop platform offering comprehensive security and centralized administration of security policies across the entire stack. At Hortonworks we take a holistic view to enterprise security requirements and ensure that Hadoop can not only define but also apply a comprehensive policy. HDP leverages Apache Ranger for centralized security administration, authorization and auditing; Kerberos and Apache Knox for authentication and perimeter security, and support for native/partner solutions for encrypting over the wire and data-at-rest.

hdf_sec_1

Data at REST Encryption – State of the union

In addition to authentication and access control, data protection adds a robust layer of security, by making data unreadable in transit over the network or at rest on a disk.

Compliance regulations, such as HIPAA and PCI, stipulate that encryption is used to protect sensitive patient information and credit card data. Federal agencies and enterprises in compliance driven industries, such as healthcare, financial services and telecom, leverage data at rest encryption as core part of their data protection strategy. Encryption helps protect sensitive data, in case of an external breach or unauthorized access by privileged users.

There are several encryption methods, varying in degrees of protection. Disk or OS level encryption is the most basic version, which protects against stolen disks. Application level encryption, on the other hand, provides higher level of granularity and prevents rogue admin access; however, it adds a layer of complexity to the architecture.

Traditional Hadoop users have been using disk encryption methods such as dm-crypt as their choice for data protection. Although OS level encryption is transparent to Hadoop, it adds a performance overhead and does not prevent admin users from accessing sensitive data. Hadoop users are now looking to identify and encrypt only sensitive data, a requirement that involves delivering finer grain encryption at the data level.

Certifying HDFS Encryption

The HDFS community worked together to build and introduce transparent data encryption in HDFS. The goal was to encrypt specific HDFS files by writing them to HDFS directories known as encryption zones (EZ). The solution is transparent to applications leveraging HDFS file system, such as Apache Hive and Apache HBase. In other words, there is no major code change required for existing applications already running on top of HDFS. One big advantage of encryption in HDFS is that even privileged users, such as the “hdfs” superuser, can be blocked from viewing encrypted data.

As with any other Hadoop security initiative, we have adopted a phased approach of introducing this feature to customers running HDFS in production environment. After the technical preview announcement earlier this year, Hortonworks team has worked with select group of customers to gather use cases and perform extensive testing against those use cases. We have also devoted significant development effort in building a secure key storage in Ranger, by leveraging the open source Hadoop KMS. Ranger now provides centralized policy administration, key management and auditing for HDFS encryption.

We believe that HDFS encryption, backed by Ranger KMS, is now enterprise ready for specific use cases. We will introduce support for these use cases as part of the HDP 2.3 release.

HDFS encryption in HDP – Components and Scope

hdfs_sec_2

The HDFS encryption solution consists of 3 components (more details in the Apache website here)

HDFS encryption/decryption enforcement: HDFS client level encryption and decryption for files within an Encryption Zone

Key provider API: API used by HDFS client to interact with KMS and retrieve keys

Ranger KMS: The open source Hadoop KMS is a proxy that retrieves keys for a client. Working with the community, we have enhanced Ranger GUI to enable securely store key using a database and centralize policy administration and auditing. (Please refer to the screenshots below)

hdfs_sec_3

hdfs_sec_4

We have extensively tested HDFS data at rest encryption across the HDP stack and will provide a detailed set of best practices for how to use HDFS data at rest encryption among various use cases as part of the HDP 2.3 release.

We are also working with key encryption partners so that they can integrate their own enterprise ready KMS offerings with HDFS encryption. This offers a broader choice to customers looking to encrypt their data in Hadoop.

Summary

In summary, to encrypt sensitive data, protect privileged access and go beyond OS level encryption, enterprise can now use HDFS transparent encryption. Both HDFS encryption and Ranger’s KMS are open source, enterprise-ready, and satisfy compliance sensitive requirements. As such they facilitate Hadoop adoption among compliant conscious enterprises.

The post New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption appeared first on Hortonworks.

↧

Introducing Hortonworks SmartSense

June 11, 2015, 1:07 pm

≫ Next: Announcing Apache Pig 0.15.0

≪ Previous: New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption

The components in a modern data architecture vary from one enterprise to the next and the mix changes over time. Many of our Hortonworks subscribers need support ensuring that their Hortonworks Data Platform (HDP) clusters are optimally configured. This means that they need proactive, intelligent cluster analysis.

As businesses onboard new workloads to the platform, it taxes the resources of Hadoop operators. And so our customers have asked Hortonworks for guidance and best practices to reduce their operational risk and efficiently resource their staff for Hadoop operations.

Proactive Support

Many of the best practices that Hortonworks has developed over the years of working with Apache Hadoop take into account a number of cluster diagnostic variables that take time to collect and analyze.

Apache Ambari helps analyze these variables and allows our customers to understand the health of their cluster through Ambari’s single pane of glass that manages its configuration and services. As Open Enterprise Hadoop becomes evermore critical to enterprise data management, companies need a more proactive approach to achieving the optimal configuration.

Late last year, Hortonworks began to outline objectives for a new proactive support service that would add value for customers with:

Rapid collection of cluster diagnostic information
Concise and actionable recommendations for resource-constrained Hadoop operations staff
Proactive views of configuration problems before they result in cluster degradation or downtime
Dashboards on cluster configuration that help the ecosystem keep pace with changing cluster topologies and workloads

So Hortonworks decided to focus our efforts in areas with the biggest potential impact. This included providing tools to quickly capture cluster diagnostics and display them in one central location, both for support case resolution, and as input to an analytical service that can produce configuration-related recommendations to improve cluster performance, security, and operations.

This was the genesis of Hortonworks SmartSense, which is a collection of tools and services that help Hortonworks Data Platform’s operators quickly resolve issues, and also act on proactive recommendations that help avoid future issues.

ss_1

Hortonworks SmartSense – An Insider’s View

The first step in the process is to quickly capture cluster diagnostic information. To accomplish this, we’ve created a tool called the Hortonworks Support Tool, or HST for short. HST plugs into Ambari and allows Hadoop Operators to quickly combine and display cluster diagnostic information in a single bundle that can be attached to a support case for troubleshooting, or analyzed by Hortonworks SmartSense.

Hortonworks SmartSense then analyzes that diagnostic information and produces recommended configurations affecting performance, security, and operations. For the upcoming release of Hortonworks SmartSense, we plan to deliver recommendations for the following components:

The Operating System
HDFS
YARN
MapReduce2
Apache Hive and Apache Tez

ss_2

Hortonworks research shows that recommendations related to the components mentioned above will provide the most value to our support customers, because these components are the most heavily used and they cause the largest number of configuration-related support cases. In fact, our analysis has shown that across all components in HDP, 25%-30% of support cases are created when the configuration of a component has not kept pace with that component’s actual use. That is when Hortonworks SmartSense recommends changes for optimization..

As customers mature with their use of Open Enterprise Hadoop, they use HDP’s components in more complex ways, they add new users with different data types and workloads, and they need to update their cluster configuration to maintain optimal performance, security, and operations. Hortonworks SmartSense facilitates these updates with dynamic recommendations for updates as configuration calculations, best practices, field experiences, and real-world operating conditions evolve.

Once HST collects the data and analyzes the bundle, new recommendations are displayed in the Hortonworks Support Portal. Each recommendation includes:

A description of the recommended change, with a justification of that recommendation
Specific steps to follow in order to apply the recommendation. Commonly, this is done using Apache Ambari
A list of the components and services that are affected
A description of associated risks or potential side effects of implementing the recommendation
An indication of the hosts that will be affected by the recommendation

All of this information enables Hadoop operators to quickly evaluate the proposed recommendation, and then either apply it (or defer making the change if the timing is not right).

Conclusion

In summary, Hortonworks SmartSense enables customers to take advantage of a new proactive service in any Hortonworks subscription that provides faster support case resolution by easily capturing log files and metrics for insight into the root causes of issues. Hortonworks SmartSense also provides proactive cluster configuration via an intelligent stream of cluster analytics and data-driven recommendations.

Our goal at Hortonworks is to continue providing the world’s best support experience for Hadoop., Hortonworks SmartSense is the next step in that journey. We have further enhanced the value of our support by making Hortonworks SmartSense part of every HDP subscription.

Learn More

Read about Hortonworks SmartSense
Understand Hortonworks support subscriptions

The post Introducing Hortonworks SmartSense appeared first on Hortonworks.

↧

Announcing Apache Pig 0.15.0

June 18, 2015, 10:52 am

≫ Next: Multihoming on Hadoop YARN clusters

≪ Previous: Introducing Hortonworks SmartSense

The Apache community released Apache Pig 0.15.0 last week. Although there are many new features in Apache Pig 0.15.0, we would like to highlight two major improvements:

Pig on Tez enhancements
Using Hive UDFs inside Pig

Below are some details about these important features. For the complete list of features, improvements, and bug fixes, please see the release notes.

Notable Changes

1. Pig on Tez enhancements

Scalability of Pig on Tez

Yahoo! recently put Pig on Tez to a production cluster and they found certain issues at large scale. As a result, the Pig community has made improvements to Tez AM scalability as well as Pig on Tez internals to address these issues.

Tez UI and Tez local mode

The community worked closely with Tez team to get the Tez UI and Tez local mode working for Pig on Tez.

The Tez UI is fully functional now, and you can view Pig plan DAG, vertex, task, task attempts during runtime and after job finishes.

Tez-UI

Thanks for the effort from the Tez team to make Tez local mode stable, we are able to migrate more than 2000 Pig unit tests originally designed to run on MR local mode to Tez local mode. This drastically increase the test coverage for Pig on Tez.

Tez Grace auto-parallelism

The degree of parallelism used for processing the query has implications on latency and cluster resource utilization. Pig-on-Tez tries to pick the sweet spot for the user.

Though Tez can do auto-parallelism at runtime based on input size for each vertex, it suffers two issues: first, auto-parallelism only decreases parallelism but does not increase it. The reason is the the upstream data is already partitioned. Increasing parallelism needs repartition the incoming data, which is complex, and Tez has not implemented this functionality. Second, even for decreasing parallelism, Tez needs to merge smaller partitions into bigger ones at some cost.

In this release, we developed Grace auto-parallelism to alleviate this problem. The idea is, when the DAG progresses, we can adjust downstream vertex parallelism before it even starts. By doing so, we can partition the upstream data freely. With Tez’s grace auto-parallelism, we can run a vertex with more accuracy in parallelism.

2. Using Hive UDF inside Pig

We can use all types of Hive UDF (UDF/GenericUDF/UDAF/GenericUDAF/GenericUDTF) inside Pig with the newly introduced HiveUDF/HiveUDAF/HiveUDTF udfs in Pig.

Here is one example:

define upper HiveUDF('upper');
A = LOAD 'studenttab10k' as (name:chararray, age:int, gpa:double);
B = foreach A generate upper(name);
store B into 'output';

One tested use case for HiveUDF is Hivemall, and you can find the document to invoke Hivemall inside Pig at github.

Learn More

The post Announcing Apache Pig 0.15.0 appeared first on Hortonworks.

↧

Multihoming on Hadoop YARN clusters

June 19, 2015, 9:56 am

≫ Next: Better SLAs via Resource-preemption in YARN’s CapacityScheduler

≪ Previous: Announcing Apache Pig 0.15.0

Introduction

Multihoming is the practice of connecting a host to more than a single network. This is frequently used to provide network-level fault tolerance – if hosts are able to communicate on more than one network, the failure of one network will not render the hosts inaccessible. There are other use cases for multi-homing as well, including traffic segregation to isolate congestion and support for different network media optimized for different use cases. Multihoming can be physical, with multiple network interfaces, logical with multiple IP addresses on one interface, or a combination of the two.

mh_1

Configuring Hadoop YARN for Multihome environments

The Apache Hadoop YARN project has recently enhanced support for multi-homing by completing the rollout of the bind-host parameter across the YARN suite of services, available in Apache Hadoop release 2.6 and the Hortonworks Data Platform (HDP) 2.2. The implementation is tracked at the Apache JIRA ticket YARN-1994 and complements similar functionality available in HDFS. At a low level, for every YARN server, a new configuration property, “bind-host,” enables an administrator to control the binding argument passed to the java socket listener and underlying operating system for YARN services so that it can be made to listen on a different interface than it would based on the client-facing endpoint address. Typical usage for this feature is to cause YARN services to listen on all of the interfaces on a multihomed host by setting bind host to the all-wildcard value, “0.0.0.0”. Please note that there is no port component to the bind-host parameter, the port listened on by a service is still determined based on the behavior for the service’s address – which means it uses the port configured for the service if there is one and falls back to the in-code default when not. For example, if the yarn.resourcemanager.address is rm.mycluster.internal:9999 and the yarn.resourcemanager.bind-host is set to 0.0.0.0, the resource manager will listen on all of the hosts addresses on the 9999 port.

For convenience, bind-host parameters cover all listeners for a given daemon. For example, inserting the below into the yarn-site.xml will configure all of the ResourceManager services and web applications to listen on all interfaces, typical for a multihomed configuration:

<property>
	<name>yarn.resourcemanager.bind-host</name>
	<value>0.0.0.0</value>
</property>

Entries for other daemons will typically have the same value, here is a list of the relevant names grouped by the configuration file:

yarn-site.xml
- yarn.resourcemanager.bind-host
- yarn.nodemanager.bind-host
- yarn.timeline-service.bind-host
mapred-site.xml
- mapreduce.jobhistory.bind-host

Setting all of these values to 0.0.0.0 as in the example above will cause the core YARN and MapReduce daemons to listen on all addresses and interfaces of the hosts in the cluster.

Client connections to the service are not aware of the presence of bind-host configuration, they will continue to connect to the service based on its address configuration. Different clients may connect to different interfaces based on name resolution and their network location, but any given client will only connect to a single interface of the service at a time.

Distributed File System’s network considerations

Processes running on nodes in a Hadoop cluster (think YARN containers) typically need access to a distributed file system such as HDFS. Strictly speaking, it is not required that the DFS be accessible from all networks the physical cluster may be listening on via multihoming. As long as all the nodes resolve the DFS endpoints using reachable addresses, they will be able to use the storage. However, clients will generally want to access the storage as well, so more often than not HDFS will be configured for multihoming if YARN is also configured for multihoming. A guide to configuring HDFS for multihomed operation is available here.

(It is recommended that the parameters dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname are also set to true for best results, in addition to the HDFS bind-host parameters.)

Host Name Resolution

Properly managing name resolution is essential to a successful multihomed Hadoop rollout. This management will generally occur via DNS where different clients resolve cluster names to appropriate addresses depending on their network of origin, although host files may be an option for smaller clusters, ad-hoc clusters, or environments where the name management process is through the automated management of local files. Whatever the hostname management strategy is, here are the ground rules to follow for a multihomed setup:

All addressing must be name-based in a multihomed cluster – all nodes should always be referred to by their host-names during configuration and/or access, never directly by their IP-addresses.
A host must resolve from the same host-name on all networks for all clients (even though they may resolve to a different IP-address for the host depending on what network the client is coming from).
For any given client, the address the host resolves to must be reachable for that client – it will control which network and interface the client will use.

It’s important to note that the same rules apply to the cluster nodes themselves (as, among other things, they also function as clients to the services running in the cluster.) Although they may resolve to several addresses for clients coming into the cluster, hosts within the cluster should resolve one another to addresses on the network. This is intended to handle cluster traffic, for e.g., a high capacity network dedicated to the purpose. The name resolution of cluster nodes to one another will determine which network handles the cluster’s own traffic (e.g. between YARN nodes (map output, etc), and to/from HDFS).

Walkabout

Let’s take an example of a two network multihomed cluster designed to provide the cluster with a high performance backplane with a separate, dedicated network for client access.

We’ll use one subnet, 10.2.x.x, for the cluster itself, and a second, 10.3.x.x, for client access. In this case the separation will be physical as well as logical – the 10.2 network will be on a high bandwidth media on a dedicated physical interface on each host and the 10.3 network on a more standard networking technology for a general purpose network. We’ll consider one job and a handful of nodes (from within a larger cluster). For simplicity in this example we’ll let the final octet of the address be the same on each network interface for the same host, although there is no requirement that this be the case (only name resolution has to remain in common for the host across clients/networks/interfaces, details of the ip addresses are not important).

mh_2

ResourceManager listens on 10.2.0.44 on the internal cluster network and 10.3.0.44 on the external client network
A client (CLIENT) at 10.3.0.99 submits a wordcount job to the Resourcemanager, connecting to it on its 10.3.0.44 interface
The Resourcemanager schedules the application and starts a Mapreduce ApplicationMaster on an application host, we’ll call it APP1, by connecting to a NodeManager listening on the cluster network address for the host on it’s internal IP, 10.2.0.55
The ApplicationMaster connects to the ResourceManager on 10.2.0.44 and is allocated a container on it’s neighbor, APP2, which is called on 10.2.0.66 to start the map task.
In the meantime, the user on CLIENT wants to see how things are going, so he/she navigates to the ResourceManager’s web-interface which resolves to 10.3.0.44 for the user. He/she clicks on the application and is taken to the ApplicationMaster on APP1, reaching it via it’s 10.3.0.55 address
The map task completes, and a reduce task is started on APP1, which pulls the data in using its 10.2.0.55 interface and writes the results back out to a node, DATA, via its internal IP 10.2.0.77
At job completion the ApplicationMaster uploads its jobsummary to HDFS / DATA and JOBHIST loads the job details via its internal 10.2.0.88 address
Since the job is complete, the next time the user checks on status he/she is redirected from RM (which he/she reached on the 10.3 network) to the the JobHistoryServer, JOBHIST, which he/she will connect to on its 10.3.0.88 address
Finally, CLIENT pulls down the reducer output from HDFS using the second interface on DATA, 10.3.0.77, and the activity is complete

Kicking the Tires

If you have a local development environment which you use for testing purposes, you can try out multihoming easily just by using your existing network interface and the built in localhost interface. If you configure one or more service addresses, for e.g. yarn.resourcemanager.webapp.address, to your machine’s network name, you’ll find that the service in question will not be accessible via the localhost address 127.0.0.1 (using the port configured for the service address). If you add in the bind-host parameter for the service (yarn.resourcemanager.bind-host 0.0.0.0) and restart, you’ll find that you can now access the service using the 127.0.0.1 address as well (+ the port configured for the service…).

While this is not terribly useful in practice, it’s an easy way to see the feature in operation. You can also manually configure other non-conflicting logical addresses on your network interface, perhaps on a reserved subnet not in use on your network, and try it out that way. You would need to temporarily add an entry to your host file with your new address to fully vet the process. And if you have a mini-cluster of workstations, or a virtualized setup with bridging, you can configure them all and see things in operation on a small scale (they will all require host entries for one another in addition to themselves).

Taking a Test Drive

If you are considering deployment on an in-house cluster, you may want to gain some additional experience before taking the plunge. One option is to try it in the cloud – Amazon Web Services’ EC2 platform includes the Virtual Private Cloud (VPC) capability. A default VPC configuration with a public network, a private network, and a gateway server is enough to get started. Deploy a few boxes with virtual interfaces configured for both networks, ensure that the hosts resolve to one another via their private network addresses, install Hadoop as you normally would, set the bind-host parameters as described above, and you now have a place to explore a multihome YARN setup in more depth. Due to the isolation provided by VPC, you may want to spin up a couple of workstations inside the VPC to which you can remote-desktop to enable full exploration, you’ll need to perform port pass through of some sort from the gateway node (ssh tunneling is an option). This will be necessary to be able to hit the web applications via a web browser and so on.

Multihome and Fault Tolerance in Hadoop

Outside of Hadoop, multihoming is frequently used as a fault tolerance strategy, but for YARN its applicability to this purpose is not as strong.

Unlike many legacy or client-server systems, Hadoop availability is based on the fact that nodes have redundancy within the cluster by design. Failure of a network interface is, from the cluster’s perspective, simply a particular type of node failure.

Using multiple-NICs and networks to ensure the availability of individual key hosts is not, generally, a necessity for fault tolerance in YARN. At the network level, as well, clusters tend to live inside data centers where the need for backup networks to preserve network availability between hosts is somewhat less important, and network redundancy for remote clients which may access the network over unreliable links is generally handled at the level of the externally routed network and does not tend to require multihomed Hadoop hosts as such.

That’s nice, what’s it good for?

In Hadoop YARN’s case, multihoming is a strategy useful to assure network performance and predictability for both applications and management purposes and, to a lesser extent, a tool with use in the security and fault-tolerance domains. The ability to expose a YARN cluster directly over multiple networks enables monitoring to occur on a network which cannot be saturated by cluster traffic, ensuring that failure messages will not fail to reach management stations. It also enables dedicated high-speed interfaces to be established between nodes without having to compete with other traffic, and the introduction of intrusion detection and other safeguards between a cluster and client traffic without impact to cluster traffic. For these and other reasons, many organizations already employ multihomed networking with their data management services. With the general availability of the bind-host configuration within YARN in HDP 2.2, they can now enjoy these same advantages with their YARN Hadoop services.

The post Multihoming on Hadoop YARN clusters appeared first on Hortonworks.

↧

Better SLAs via Resource-preemption in YARN’s CapacityScheduler

June 19, 2015, 1:03 pm

≫ Next: Easy Steps to Create Hadoop Cluster on Microsoft Azure

≪ Previous: Multihoming on Hadoop YARN clusters

Mayank Bansal, of EBay, is a guest contributing author of this collaborative blog.

This is the 4th post in a series that explores the theme of enabling diverse workloads in YARN. See the introductory post to understand the context around all the new features for diverse workloads as part of Apache Hadoop YARN in HDP.

Background

In Hadoop YARN’s C apacity S cheduler, resources are shared by setting capacities on a hierarchy of queues.

A queue’s configured capacity ensures the minimum resources it can get from ResourceManager. Sum of capacities of all the leaf-queues under a parent queue at any level is equal to 100%. As shown below, queueA has 20% share of the cluster, queue-B has 30% and queue-C has 50%, sum of them equals to 100%.

yarn_1

To enable elasticity in a shared cluster, CapacityScheduler can allow queues to use more resources than their configured capacities when a cluster has idle resources.

Without preemption, let’s say queue-A uses more resources than its configured capacity taking advantage of the elasticity feature of CapacityScheduler. At the same time, say another queue, queue-B, currently under-satisfied starts asking for more resources. queue-B will now have to wait for a while before queue-A relinquishes resources it is currently using. Due to this, high delay of applications in queue-B will be expected, there isn’t any way we can meet SLAs of applications being submitted in queue-B in this situation.

Support for preemption is a way to respect elasticity and SLAs together: when no resources are unused in a cluster, and some under-satisfied queues ask for new resources, the cluster will take back resources from queues that are consuming more than their configured capacities.

How preemption works

All the information related to queues is tracked by the ResourceManager. A component called PreemptionMonitor in ResourceManager is responsible for performing preemption when needed. It runs in a separate thread making relatively slow passes to make decisions on when to rebalance queues. The regular scheduling itself happens in a thread different from the PreemptionMonitor.

We will now explain how preemption works in practice taking an example situation occurring in a Hadoop YARN cluster. Below is an example queue state in a cluster:

yarn_2

Let’s say that we have 3 queues in the cluster with 4 applications running. Queues A and B have already used resources more than their configured minimum capacities. Queue-C is under satisfied and asking for more resources. Two applications are running in queue-A and one in each of queues B and C.

We now describe the steps involved in the preemption process that tries to balance back the resource utilization across queues over time.

Preemption Step #1: Get containers to-be-preempted from over-used queues

The PreemptionMonitor checks queues’ status every X seconds, and figures out how many resources need to be taken back from each queue / application so as to fulfil needs of under-satisfied queues. As a result, it arrives at a list of containers to be preempted, as demonstrated below:

yarn_3

For example, containers C5, C6, C7 from App1 and containers C2 and C3 from App2 are marked to-be-preempted. We will explore internal algorithms of how we select containers to be preempted in a separate detailed post shortly.

Preemption Step #2: Notifying ApplicationMasters so that they can take actions

Instead of killing thus-marked containers immediately to free resources, PreemptionMonitor inside the ResourceManager notifies ApplicationMasters (AM) so that AMs can take advanced actions before ResourceManager itself commits a hard decision. More details on what information AMs obtain and how they can act on it is described in the section “Impact of preemption on applications” below. This is a way for ApplicationMasters to do a second level pass (similar to scheduling) on what containers for the right set of resources to free up.

Preemption Step #3: Wait before forceful termination

After some containers gets added to to-be-preempted list, and the queues over capacity don’t shrink down (through actions from applications) to the targeted capacity even after an admin-configured interval (see yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval in Configurations section below), such containers will be forcefully killed by the ResourceManager to ensure that SLAs of applications in under-satisfied queues are met.

Advantages of preemption

We now demonstrate the advantages of preemption.The following example scenario should give you an intuitive understanding about how preemption can achieve a balance between elasticity (and making use of fallow resources) and application SLAs.

Consider a simple case. Queues Q1 and Q2 have configured minimum capacity of 50%, and each of them have exactly one application running – App1 in Q1 and App2 in Q2. App1 starts running first, it occupies all the resources in the cluster (elasticity). Let’s say that while App1 is running at the full cluster capacity, a user submits App2 to queue Q2.

The following is the timeline of the above events:

App1 starts running at time 0
T1: Time of App2’s submission
T2: Time when App1 can complete
T3: Time when App2 can complete

When preemption is not enabled

yarn_4

When preemption is enabled

yarn_5

With preemption enabled, ResourceManager will start preempting resources in a short time after App2 gets submitted, irrespective of whether App1 voluntarily releases containers or not. Apps in Q1 and Q2 will now share cluster-resources, each of them will get 50% of cluster resource, just as they were originally promised via minimum configured capacities. T2’ is the new time at which App1 completes in this setup, T3’ is the new time App2 gets completed.

After preemption is enabled, App2 usually finishes faster (than when preemption was not enabled) – T3’ < T3. It is possible though that App1 now finishes slower – T2’ > T2. This is acceptable though given App1 originally got much more resources than was originally guaranteed. This way the system achieves a balance between using barren capacity and promising reasonable SLAs to new applications.

Preemption across a hierarchy of queues

So far, we have only talked about how preemption works across two queues at the same level. CapacityScheduler also supports the notion of a hierarchy of queues. Let’s see how preemption works in this context with an example.

yarn_6

Queues C1 and C2 are under-satisfied and start to ask for reclaiming their originally promised resources back. Queues A/B are over-allocated. In this case, PreemptionMonitor will preempt containers in queues A and B so that the next scheduling cycle can allocate the freed-up resources to applications in queues C1 and C2.

This way, preemption works in conjunction with the scheduling flow to make sure that resources freed up at any level in the hierarchy are given back to the right queues in the right level.

Impact of preemption on applications

As hinted before, ResourceManager sends preemption messages to ApplicationMasters in the AM-RM heartbeats. The preemption messages contain the list of containers about to be preempted – see the API record org.apache.hadoop.yarn.api.records.PreemptionMessage.

A PreemptionMessage may return two different types of ‘contracts’:

The first type is a StrictPreemptionContract, which contains a list of containers that should definitely be freed up by the ApplicationMaster. The AM cannot do anything beyond creating checkpoints of work done by such containers.

The second type of contract is simply a PreemptionContract. This contract leaves the AM the flexibility to either make checkpoints for such containers or release other containers matching the size of ResourceRequests in PreemptionContract#getResourceRequests().

As of the writing of this post, the PreemptionMonitor in the CapacityScheduler supports only the latter one.

Acting on about-to-be-preempted-containers

With the information present in PreemptionMessage, AMs can better handle preemption events

Instead of letting the RM forcefully kill marked containers, AMs themselves can kill other containers belonging to the same application. RM is always agnostic to a specific application’s life cycle, the AMs know much more about the characteristics of its containers. It is possible that some containers are more important than others from the AM’s perspective – they may have just have been further along in terms of their work-in-progress. AMs can release other ‘cheaper’ containers before their more “important” containers get forcefully killed.

Before RM pulls the trigger, applications can checkpoint the state of containers marked to be preempted. Using such checkpointed state, AMs can resume progress of the work done by the killed containers once there is a scope for obtaining a new wave of resources again.

Handling already preempted containers

If applications in a queue do not act on the preemption messages, after a while containers get forcefully killed by the RM. In this case, RM will set a special exit-code (ContainerExitStatus#Preempted) for such preempted containers, and AMs get notifications of the same in the next AM-RM heartbeat. The ApplicationMasters should handle these preempted containers specially – they don’t really count towards failure of the container process itself.

An example of this is the MapReduce AM (MAPREDUCE-5900, MAPREDUCE-5956), it doesn’t count any preempted containers towards task failures. So even if a task gets killed due to preemption a number of times, the AM will continue to launch a fresh task-attempt for that task.

Configurations

To enable preemption in CapacityScheduler, administrators have to set the following in the configuration file yarn-site.xml

(1) yarn.resourcemanager.scheduler.monitor.enable

This informs ResourceManager to enable the monitoring-policies specified by the next configuration property. Set this to true for enabling preemption.

(2) yarn.resourcemanager.scheduler.monitor.policies

Specifies a list of monitoring-policies that ResourceManager should load at startup if monitoring-policies are enabled through the previous property. To enable preemption, this has to be set to org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy

The following is a list of advanced configuration properties that administrators can use to further tune the preemption mechanism:

(3) yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval

Time in milliseconds between successive invocations of the preemption-policy, defaulting to 3000 (3 seconds). One invocation of the preemption-policy will scan for queues, running applications, containers and makes a decision as to which container will be preempted (Steps described above in the section “How preemption works”).

If an admin wants to set a fast-paced preemption, together with other properties below, he/she should set it to a low value. Similarly a higher value can be set for slower preemption.

(4) yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill

Time in milliseconds between the starting time when a container is marked to-be preempted and the final time when the preemption-policy forcefully kills the container. By default, preemption-policy will wait for 15 seconds for AMs to release resources before a forceful-kill of containers.

Administrators can adjust this parameter to a smaller value if he/she wants to reclaim resource back faster or set it to a higher value if he/she wants more gradual reclamation of resources.

(5) yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round

This option controls the pace at which containers-marked-for-preemption are actually preempted in each period (see monitoring_interval above). It is defined as a percentage of cluster-resource, defaulting to 0.1 (read as 10%). For example, if it is set to 0.1 and the total memory allocated for YARN in the cluster is 100G, at most 10G worth of containers can be preempted in each period.

This way, even if there are times during which there is a need to preempt a large share of the cluster suddenly, the process is spread out over time to smoothen the impact of preemption.

(6) yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity

Similar to total_preemption_per_round above, this configuration specifies a threshold to avoid resource-joggling and aggressive preemption. When configured to a value x >= 0, RM will wait till a queue uses resources amounting to configured_capacity * (1 + x) before starting to preempt containers from it. A similar example in real life is a freeway’s speed limit – it can be marked as X (65) MPH, but you will usually be considered overspeeding if your speed is above Y% (say 10%) of the speed limit (you are advised to refer to your State’s driving rules’ book). By default, it is 0.1, which means RM will start preemption for a queue only when it goes 10% above its guaranteed capacity – this avoids sharp oscillations of capacity allocated.

(7) yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor

Similar to total_preemption_per_round, we can apply this factor to slowdown resource preemption after preemption-target is computed for each queue (for e.g. “give me 5G resource back from queue-A”). By default, it equals to 0.2, meaning that at most 20% of the target-capacity is preempted in a cycle.

For example, if 5GB is needed back, in the first cycle preemption takes back 1GB (20% of 5GB), 0.8GB (20% of the remaining 4GB) in the next, 0.64GB (20% of the remaining 3.2GB) next and so on – a sort of a geometrically smoothened reclamation.You can increase this value to make resource reclamation faster.

Conclusion & future work

In this post, we have given an overview of CapacityScheduler’s preemption mechanism, how it works and how user-land applications interact with the preemption process.

For more details, please see the Apache JIRA ticket YARN-45 (YARN preemption support umbrella). We’re also continuously improving preemption support in YARN’s CapacityScheduler. In the near future, we plan to add:

Support preemption of containers within a queue while also respecting user-limit (YARN-2069)
Support preemption of containers while also respecting node labels (YARN-2498)

More efforts are in progress. Given that preemption is tied to the core of scheduling, any new scheduling functionality potentially needs corresponding changes to the preemption algorithm, so improving preemption is always an ongoing effort!

Acknowledgements

We would like to thank all those who contributed patches to CapacityScheduler’s preemption support: Carlo Curino, Chris Douglas, Eric Payne, Jian He and Sunil G (besides the authors of this post). Thanks also to Alejandro Abdelnur, Arun C Murthy, Bikas Saha, Karthik Kambatla, Sandy Ryza, Thomas Graves, Vinod Kumar Vavilapalli for their help with reviews! And thanks to Tassapol Athiapinya from Hortonworks who helped with testing the feature in great detail.

The post Better SLAs via Resource-preemption in YARN’s CapacityScheduler appeared first on Hortonworks.

↧

Easy Steps to Create Hadoop Cluster on Microsoft Azure

June 22, 2015, 1:15 pm

≫ Next: Announcing Apache Ranger 0.5.0

≪ Previous: Better SLAs via Resource-preemption in YARN’s CapacityScheduler

In his blog, Tim Hall wrote, “Enterprises are embracing Apache Hadoop to enable their modern data architectures and power new analytic applications. The freedom to choose the on-premises or cloud environments for Hadoop that best meets the business needs is a critical requirement.”

One of the choices in deploying Hadoop in the cloud environment is with Microsoft Azure using Cloudbreak. Other choices include Google Cloud Platform, Openstack, and AWS.

But in this blog, I’ll show how you can deploy Hadoop in Azure with few clicks by running HDP multimode cluster in Azure’s Linux VM using Cloudbreak.

Microsoft Azure

Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed datacenters.

Cloudbreak is a RESTful Hadoop as a Service API. Once Cloudbreak is deployed in your favorite servlet container, it exposes a REST API, allowing provisioning of Hadoop clusters of arbitrary sizes on your selected cloud provider.

Provisioning Hadoop has never been easier. Cloudbreak is built on the foundation of cloud providers API (Microsoft Azure, Amazon AWS, Google Cloud Platform, OpenStack), Apache Ambari, Docker containers, Swarm and Consul.

Prerequisites

Before you get started, you must setup two accounts and understand Ambari Blueprints:

Setup Azure account (trial account)
Setup Cloudbreak account (free account)
Understand Ambari Blueprints

Action on Azure Portal

First, login into your Azure portal and create a network manually.

Then create X509 certificate on your local host with a 2048-bit RSA key pair. You need to run the command shown below on your local machine. You can choose the names of these files, as you like.

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout azuretest.key -out azuretest.pem

As an example shown below, accept the default values at the prompt.

azure_1

In the directory where you executed the openssl command, you will see two files created as listed below.

-rw-r–r– 1 nsabharwal staff 1346 May 7 17:00 azuretest.pem –> We need this file to create credentials in cloudbreak.

-rw-r–r– 1 nsabharwal staff 1679 May 7 17:00 azuretest.key –> We need this to login into the host after cluster deployment.

To avoid bad permission and security compliance errors, chmod the files as show below:

chmod 400 azuretest.key

For example: ssh -i azuretest.key ubuntu@IP/FQDN

You may face an issue where use of .key file may ask for passphrase. In this case, you need to check openssl version.

Details

Check your openssl version and if it’s latest version then run the following and use azuretest_login.key to login

openssl rsa -in azuretest.key-out azuretest_login.key
openssl version

OpenSSL 0.9.8zc 15 Oct 2014

Latest version of openssl creates .key with

—–BEGIN PRIVATE KEY—–

Old openssl creates keys with RSA (we need this)

—–BEGIN RSA PRIVATE KEY—–

Action on Cloudbreak

Login to Cloudbreak portal and create Azure credential. Once you fill in the information and hit create credential, you will get a file from Cloudbreak that needs to be uploaded into the Azure portal.

azure_2

Save the file as azuretest.cert on your local machine.

Creating Blueprints on Azure

azure_3

Click Settings –> Manage Certificates then upload the bottom of the screen.

azure_4

There are 2 more actions that you must perform before creating your cluster on Azure.

In Cloudbreak

1) Create a template

You can change the instance type and volume type as per your setup.

azure_5

2) Create an Ambari Blueprint – You can grab sample Blueprints here (You may have to format the Blueprint in case there is any issue)

After successfully creating a template and Blueprint on, you are ready to deploy your cluster that reflects your Blueprint.

Deploying Cluster

From the Azure GUI, select the credential and hit create cluster

azure_6

azure_7

Create cluster window
Provide your Cluster name, select Region, Blueprint (you can choose default network)
Choose templates for cbgateway, master and slaves. You can define number of nodes in this section.
Click create and start cluster

azure_8

Handy Docker commands

To get FQDN from the Azure portal

ssh -i azuretest.key ubuntu@fqdn or ssh -i azuretest.key cloudbreak@fqdn

To change to sudo user

sudo su -

To list processes

docker ps

To get shell

docker exec -it <container id> bash

[root@azuretest ~]# docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

f493922cd629 sequenceiq/docker-consul-watch-plugn:1.7.0-consul “/start.sh” 2 hours ago Up 2 hours consul-watch

100e7c0b6d3d sequenceiq/ambari:2.0.0-consul “/start-agent” 2 hours ago Up 2 hours ambari-agent

d05b85859031 sequenceiq/consul:v0.4.1.ptr “/bin/start -adverti 2 hours ago Up 2 hours consul

[root@test~]# docker exec -it 100e7c0b6d3d bash

Learn More

Learn More

You can go to http://azure.microsoft.com/en-us/ and http://sequenceiq.com/cloudbreak/ to find more information.
Check out https://github.com/sequenceiq/cloudbreak for the latest information .

The post Easy Steps to Create Hadoop Cluster on Microsoft Azure appeared first on Hortonworks.

↧

Announcing Apache Ranger 0.5.0

June 23, 2015, 1:11 pm

≫ Next: Hotels.com Announces CORC 1.0.0

≪ Previous: Easy Steps to Create Hadoop Cluster on Microsoft Azure

As YARN drives Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. The Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a platform for centralized security policy administration across the core enterprise security requirements of authorization, audit and data protection.

On June 10th, the community announced the release of Apache Ranger 0.5.0. With this release, the community took major steps to extend security coverage for Hadoop platform and deepen its existing security capabilities. Apache Ranger 0.5.0 addresses over 194 JIRA issues and delivers many new features, fixes and enhancements. Among these improvements, the following features are notable:

Centralized administration, authorization and auditing for Solr, Kafka and YARN
Apache Ranger key management store (KMS)
Hooks for dynamic policy conditions
Metadata protection in Hive
Support queries for audit data stored in HDFS using Solr
Optimization of auditing at source
Pluggable architecture for Apache Ranger (Ranger Stacks)

This blog provides an overview of the new features and how they integrate with other Hadoop services, as well as provides a preview of focus areas that the community has planned for upcoming releases.

Centralized Administration, Authorization and Auditing for Solr, Kafka and YARN

Administrators can now use Apache Ranger’s centralized platform to manage access policies for Solr (collection level), Kafka (topic level) and YARN (capacity schedule queues). The centralized authorization and auditing capability add into what was previously available for HDFS, HBase, Hive, Knox and Storm. As a precursor to this release, Hortonworks security team worked closely with the community to build authentication support (Kerberos) and authorization APIs in Apache Solr and Apache Kafka.

Administrators can now apply security policies to protect queues in Kafka and ensure authorized users are able to submit or consume from a Kafka topic. Similarly, Ranger can be used to control query access at Solr collection level, ensuring sensitive data in Apache Solr is secured in production environments. Apache Ranger’s integration with YARN RM enables administrators to control which applications can submit to a queue and prevent rogue applications from using YARN.

Apache Ranger Key Management Store (KMS)

In this release, HDP takes a major step forward in meeting enterprises’ requirements for security and compliance by introducing transparent data encryption for encrypting data for HDFS files, combined with a Ranger embedded open source Hadoop KMS. Ranger now provides security administrators the ability to manage keys and authorization policies for KMS.

This encryption feature in HDFS, combined with KMS access policies maintained by Ranger, prevents rouge Linux or Hadoop administrators from accessing data and supports segregation of duties for both data access and encryption. You can find more details on TDE through this blog.

Hooks for dynamic policy conditions

As enterprises’ Hadoop deployments mature, there is a need to move from static role- based access control to access-based on dynamic rules. An example, would be to provide access based on time of the day (9am to 5pm), or geo (access only if logged in from a particular location) or even data values.

In Apache Ranger 0.5.0, community took the first step to move towards a true ABAC (attribute based access control) model by introducing hooks to manage dynamic policies, thereby providing a framework for users to control access based on dynamic rules. Users can now specify their own conditions and rules (similar to a UDF) as part of service definitions, and these conditions can vary by service (HDFS, Hive etc). In the future, based on community feedback, Apache Ranger might include some of the conditions out of the box.

Metadata Protection in Hive

Apache Ranger 0.5.0 provides the ability to protect metadata listing in Hive based on underlying permissions. This functionality is especially relevant for multi tenant environments where users cannot view other tenants’ metadata (tables, columns).

The following commands related to Hive metadata will now provide relevant information only based on user privileges.

Show Databases
Show Tables
Describe table
Show Columns

Support Queries for Audit Data Using Solr

Currently, Apache Ranger UI provides the ability to perform interactive queries against audit data stored in RDBMS. In this release, we are introducing support for storing and querying audit data in Solr. This functionality removes dependency on database for audit and provides users with visibility into Solr data using dashboards built on banana UI. We recommended that users enable audit writing for both Solr and HDFS, and purge data in Solr at regular intervals.

Optimization of Auditing at Source

Auditing all events or jobs in Hadoop generate high volume of audit data. Apache Ranger 0.5.0 provides the ability to summarize audit data at the source for given time period, by user, resource accessed and action, thereby reducing audit data volume and noise and impact on underlying storage for improved performance.

Pluggable Architecture for Apache Ranger (Ranger Stacks)

As part of this release, the Ranger community worked extensively to revamp the Apache Ranger architecture. As a result of this effort, Apache Ranger 0.5.0 now provides a pluggable architecture for policy administration and enforcement. Using a “single pane of glass,” end-users can configure and manage their security across all components of their Hadoop stack and extend it to their entire big data environment.

Apache Ranger 0.5.0 enables customers and partners to easily add a new “service” to support a new component or data engine. Based on JSON, this service is configurable.

Users can create custom service as plug-in to any data store, build and manage services centrally for their big data BI applications.

Preview of Features to Come

The Apache Ranger release would not have been possible without contributions from the dedicated community members who have done a great job understanding the needs of the user community and delivering them. Based on demand from the user community, we will continue to focus our efforts in three primary areas:

Global data classification, “tags,” based security policies
Expanding encryption support to HBase and Hive
Ease of installation and use, through better Apache Ambari integration
Read the Apache Ranger 0.5.0 Release Notes

The post Announcing Apache Ranger 0.5.0 appeared first on Hortonworks.

↧

Hotels.com Announces CORC 1.0.0

June 30, 2015, 12:30 am

≫ Next: Announcing Apache Falcon 0.6.1

≪ Previous: Announcing Apache Ranger 0.5.0

Hortonworks is always pleased to see new contributions come into the open-source community. We worked with our customer, Hotels.com, to help them develop libraries and utilities around Apache Hive, the Apache ORC format and Cascading. It’s great to see the results released for the community. In this guest blog, Adrian Woodhead, Big Data Engineering Team Lead at Hotels.com, discusses the CORC project.

Hotels.com is pleased to announce the open source release of Corc, a library for reading and writing files in the Apache ORC file format using Cascading. Corc provides a Cascading Scheme and various other classes that allow developers to access the full range of unique features provided by ORC from within Cascading applications. Corc is freely available on GitHub under the Apache 2.0 license.

Notable features

Column projection

The ability to read only the columns required by a job (as opposed to reading in all data and subsequently filtering out any unneeded columns) is a key feature of the ORC file format. Corc exposes this functionality to Cascading jobs so that a sub-set of Fields can be passed into a Tap and then only the respective columns on HDFS will be read. This can lead to significant performance improvements in Cascading applications as the amount of data read from HDFS is reduced.

Full support for ORC types

Corc provides the ability to read and write the full set of types supported by ORC and maps them to the standard Java types used by Cascading. Types supported include: STRING, BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, TIMESTAMP, DATE, BINARY, CHAR, VARCHAR, DECIMAL, ARRAY, MAP, STRUCT, and UNION. This allows Cascading applications to take advantage of ORC’s self-describing nature, indexes, and column encoding optimisations. Corc also provides an extension point so that these mappings can be customised.

Predicate pushdown

Corc provides the ability to access ORC’s underlying predicate pushdown functionality. This provides Cascading applications with the ability to skip stripes of data that do not contain pertinent values by supplying criteria to determine what data can be skipped. This in turn can lead to performance gains.

ACID data set support

Corc supports the reading of ACID datasets that underpin transactional Hive tables. For this to work effectively you must provide your own lock management and coordinate with Hive’s meta store. We intend to make this functionality available via changes to the cascading-hive project in the near future.

What’s Next

We aim to closely follow future developments in the ORC file format and expose new features as they are released. We will also closely monitor the upcoming 3.0.0 release of Cascading and ensure Corc can be used with this soon after it is released. We also intend to continue work on adding ACID support to Corc and related Cascading projects so that Cascading applications can seamlessly read and write data using Hive transactions.

Learn more

Corc – https://github.com/HotelsDotCom/corc/
Cascading – http://www.cascading.org/
ORC file format – http://orc.apache.org/ and https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

The post Hotels.com Announces CORC 1.0.0 appeared first on Hortonworks.

↧

Announcing Apache Falcon 0.6.1

June 30, 2015, 1:04 pm

≫ Next: Announcing Hortonworks Gallery

≪ Previous: Hotels.com Announces CORC 1.0.0

Early this year, Apache^TM Falcon^TM became a Top Level Project (TLP) in the Apache Software Foundation.

The project continues to mature as a framework for simplifying and orchestrating data lifecycle management in Hadoop by offering out-of-the-box data management policies. The Apache Falcon 0.6.1 release builds on this foundation by providing simplified mirroring functionality and a new user interface (UI).

The community worked very diligently to offer more than 150 product enhancements, and over 30 new features and improvements. Among these improvements, following stand out as particularly important

Intuitive Web-based User Interface
Replication of Hive Assets while Preserving Metadata
Simplified forms driven UI to create HDFS and Hive Replication

Improved Web-based User Interface

Falcon Views enables rich API functionality in an intuitive and streamlined web interface to create and manage data feeds, process, cluster and mirror entities and their instances. This release also features integrated search and lineage capabilities. Apache Falcon 0.6.1 removes the need for users to write XML scripts to create entities for feed process and clusters. Forms-driven management UI introduced in this release greatly improves user productivity and reduces errors. The UI also offers an interactive search interface with domain specific language (DSL).

falcon_1

falcon_2

falcon_3

Replication of Hive Assets While Preserving Metadata

Apache Falcon 0.6.1 now enables complete replication of Hive assets while preserving metadata, such as views, annotations and user roles. Starting with bootstrap process to set the baseline, Falcon orchestrates orderly replication of transactions in the proper sequence to the target.

Simplified forms driven UI to create HDFS and Hive Replication

Apache Falcon 0.6.1 features a simplified forms-driven UI to create HDFS and Hive replication to enhance productivity and accuracy.

Preview of Features to Come

The Apache Falcon release would not have been possible without contributions from the dedicated and talented community members who have done a great job understanding the needs of the user community and deliver them. Based on demand from the user community, we will continue to focus our efforts in three primary areas:

Integration with Apache Atlas
Smarter search capabilities
Improved dashboards

Downloads

The post Announcing Apache Falcon 0.6.1 appeared first on Hortonworks.

↧

Announcing Hortonworks Gallery

July 1, 2015, 12:53 pm

≫ Next: Available Now: HDP 2.3

≪ Previous: Announcing Apache Falcon 0.6.1

Drink from Elephant’s Well Of Knowledge

Developer success starts with open and reusable code, and a community that allows for both consumption of code and contribution of updates to the code base. This success engenders a thriving and evolving community.

To that end, today we are announcing the Hortonworks Gallery for developers. Located on GitHub, the Gallery brings together the Hortonworks’ Apache Hadoop code, Apache Ambari Views and extensions, as well as related resources into a single view for developers to use within the familiar context of Git and open source software.

The Hortonworks Gallery brings together all of the code, tutorials and sample apps that help new and experienced Hadoop developers reduce the time to success, whether it’s learning about Hadoop, Spark, Storm or other HDP components, or delivering apps for Internet of Things, predictive analytics or data warehousing solutions. Additionally, over the coming weeks we will be moving the source files for our Hortonworks Tutorials into GitHub for individuals and organizations to use, modify, and enhance under a Creative Commons license.

Hortonworks has been investing the resources of our team into creating a complete set of learning resources for getting started on HDP. By releasing these under the CC license on the Hortonworks Gallery, our goal is for the community and academia to build an even better and broader set of learning around Hadoop. And in the spirit of open collaboration, we encourage pull requests!

Share Elephant’s Wealth of Knowledge

Hortonworks Gallery GitHub projects are organized at http://hortonworks-gallery.github.io. As we add new projects, tutorials, sample applications and code, they will be added to the Hortonworks Gallery repo, all to make it easier to discover, clone, and consume the content in your own projects.

The Gallery is the latest addition to the rich set of Hortonworks’ resources for developers. Along with the galley, developers can

learn about Apache Hadoop, Spark and other HDP components at hortonworks.com;
try out code in the Hortonworks Sandbox;
explore the developer curriculum at Hortonworks University; and
show what they know with Hadoop Developer Certifications.

I want to thank all members of the team who contributed to this project.

For the latest information on our Hadoop resources for developers, visit http://developer.hortonworks.com.

The post Announcing Hortonworks Gallery appeared first on Hortonworks.

↧

Available Now: HDP 2.3

July 22, 2015, 8:55 am

≫ Next: Introducing Availability of HDP 2.3 – Part 2

≪ Previous: Announcing Hortonworks Gallery

We are very pleased to announce that Hortonworks Data Platform (HDP) Version 2.3 is now generally available for download. HDP 2.3 brings numerous enhancements across all elements of the platform spanning data access to security to governance. This version delivers a compelling new user experience, making it easier than ever before to “do Hadoop” and deliver transformational business outcomes with Open Enterprise Hadoop.

As we announced at Hadoop Summit in San Jose, there are a number of significant innovations as part of this release including:

Breakthrough User Experience for Operators, Developers, and Data Stewards
Enterprise Readiness: Enhancements to Security and Governance
Proactive Support with Hortonworks SmartSense™

HDP 2.3 represents the very latest innovation from across the Hadoop ecosystem. Literally, hundreds of developers have been collaborating with us to evolve each of the individual Apache Software Foundation (ASF) projects from the broader Apache Hadoop ecosystem. The various project teams have coalesced these new facets into a comprehensive and open Hortonworks Data Platform (HDP), delivering both new features and closing out a wide variety of issues across Apache Hadoop and its related projects.

In conjunction with the HDP 2.3 general availability, Apache Ambari 2.1 is now also generally available. Aside from delivering a breakthrough configuration and customization experience, Ambari 2.1 includes support for installing, managing and monitoring Apache Accumulo and Apache Atlas, along with expanded high-availability support for Apache Ranger and Apache Storm.

Here is the up-to-date view of all components and versions that comprise HDP 2.3:

Screen Shot 2015-07-21 at 11.11.59 PM

Thank you to everyone within and across the open source community who worked to deliver the staggering amount of innovation contained within these projects!

Delivering Transformational Outcomes

This release is a big step forward, and we’re excited that more and more companies are transforming their businesses using HDP’s unique capabilities. While many early adopters were drawn to Hadoop based on it’s ability to process and store data cost-effectively at scale, the continued innovation within the Hadoop ecosystem which makes up HDP now delivers so much more than simple cost savings through the use of commodity hardware and descriptive reporting. HDP is being used in an increasing number of mission critical environments and is fueling entirely new businesses based on analytics and data. Everyday HDP powers these new businesses, as Jim Walker, vice president of marketing at EverString, attests:

For EverString, HDP serves as the backbone of our Predictive Marketing business. Our company is the realization of an entire business fundamentally built on HDP, not simply an application on Hadoop. Our customers rely on us to deliver the true value of Hadoop as a service, and our success is predicated on the reliability of Hortonworks and enterprise readiness of HDP.

What’s Next?

This blog is first in a series of three posts. Look for the next two posts this week as we explore all the new capabilities of HDP 2.3.

Learn More

Download Sandbox
Register for Webinar: Introducing HDP 2.3 Webinar on Thursday, July 30, between 10:00 – 11:00 a.m. Pacific Time
Watch the HDP 2.3 Overview video

The post Available Now: HDP 2.3 appeared first on Hortonworks.

↧

Introducing Availability of HDP 2.3 – Part 2

July 24, 2015, 1:38 pm

≫ Next: Running Operational Applications (OLTP) on Hadoop using Splice Machine and Hortonworks

≪ Previous: Available Now: HDP 2.3

On July 22nd, we introduced the general availability of HDP 2.3. In part 2 of this blog series, we explore notable improvements and features related to Data Access.

We are especially excited about what these data access improvements mean for our Hortonworks subscribers.

Russell Foltz-Smith, Vice President of Data Platform, at TrueCar summed up the data access impact to his business using earlier versions of HDP, and his enthusiasm for the innovation in this latest release:

TrueCar is in the business of providing truth and transparency to all the parties in the car-buying process,” said Foltz-Smith. “With Hortonworks Data Platform, we went from being able to report on 20 terabytes of vehicle data once a day to doing the same every thirty minutes–even as the data grew to more than 600 terabytes. We’re excited about HDP 2.3.

SQL on Hadoop

SQL is the Hadoop user community’s most popular way to access data, and Apache Hive is the defacto standard for SQL on Hadoop. I spoke with many of our customers at Hadoop Summit in San Jose, and a recurring theme emerged. They asked us to push harder towards SQL 2011 analytic compliance.

While we started with HiveQL, a subset of the functions available within ANSI standard SQL, the request clearly highlights the need to improve the breadth of SQL semantics available to Hive.

In fact, one of the more satisfying, if not surprising, comments that we heard had to do with performance. We are hearing that the performance improvements made over the past few years through the Stinger Initiative have made such a significant difference that additional performance boosts can wait until the SQL breadth is improved.

As organizations move to use Hive & Hadoop, they do not want to perform “SQL rewrite” for existing applications being ported onto Hadoop. The effort to reshape queries and re-test them is expensive. With that in mind, Apache Hive 1.2 was released in late May and with HDP 2.3, it further simplifies SQL development on Hadoop with these new SQL features:

Support for UNION DISTINCT (HIVE-9039)
Interval datatypes within expressions (HIVE-9792, HIVE-5021)
Date/time related functions including: CURRENT_DATE(), CURRENT_TIMESTAMP() (HIVE-5472) and more support for the ISO 8601 format (HIVE-9298, HIVE-9564)

In addition, Hive continues to become more reliable and more scalable with:

Cross-Datacenter Replication powered by a direct integration between Hive and Falcon (FALCON-1188)
Grace Hash Join (HIVE-9277) lets you use high-performance, memory intensive hash joins without needing to worry about queries crashing due to running out of memory.
And the new Vectorized Hash Join (HIVE-10072) improves hash join performance up to 5x.

But, most importantly, we’ve added a number of new tools to make Hive easier to use, deploy and administer:

Hive Guided Configs in Ambari 2.1 simplifies setup and tuning Hive.
The Hive View lets you develop and run queries directly in your web browser and…
The integrated Tez Debugging View gives you detailed insight into jobs, helping you optimize and tune queries.

We will continue our focus on SQL breadth to help customers ease the transition of their existing analytic applications onto HDP and to make that transition as simple as possible.

Spark 1.3.1

HDP 2.3 includes support for Apache Spark 1.3.1. The Spark community continues to innovate at an extraordinarily rapid pace. Given our leadership in Open Enterprise Hadoop, we are eager to provide our customers with the latest and most stable versions of the various Apache projects that make up HDP.

We focused the bulk of our testing has focused on Spark 1.3.1 to ensure its features and capabilities provides the best experience on Apache Hadoop YARN. The Spark community released Spark 1.4.1 just last week. While it provides additional capabilities and improvements, we plan to test 1.4.1 to harden it and fix any issues before we graduate the technical preview version of Spark to GA with inclusion in HDP.

Some of the new features of Spark 1.3.1 release are:

DataFrame API (Tech Preview)
ML Pipeline API in python
Direct Kafka support in Spark Streaming

Spark is a great tool for Data Science. It provides data parallel machine learning (ML) libraries, and an ML pipeline API to facilitate machine learning across all the data easier and to deliver insights faster.

We also plan to provide a Notebook experience to make data science easier and more intuitive.

Recently we worked with Databricks to deliver full ORC support with Spark 1.4 and for the foreseeable future, we plan to focus on contributing to within the Spark community to enhance its YARN integration, security, operational experience, and machine learning capabilities. It is certainly a very exciting time for Spark and the community as a whole!

Stream Processing

As more devices and sensors join the Internet of Things (IoT), they emit growing streams of data in real time. The need to analyze this data drives adoption of Apache Storm as the distributed stream processing engine. HDP is an excellent platform for IoT — for storing, analyzing and enriching real-time data. Hortonworks is eager to help customers adopt HDP for their IoT use cases, and we made a big effort in this release to increase the enterprise readiness of both Apache Storm and Apache Kafka.

Further, we simplified the developer experience by expanding connectivity of other sources of data, including support for data coming from Apache Flume. Storm 0.10.0 is a significant step forward.

Here is a brief summary of all the stream processing improvements:

Enterprise Readiness: Security & Operations
- Security
  - Addressing Authentication and Authorization for Kafka — including integration with Apache Ranger (KAFKA-1682)
  - User Impersonation when submitting a Storm topology (STORM-741)
  - SSL support for Storm user interface, log viewer, and DRPC (Distributed Remote Procedure Call) (STORM-721)
- Operations
  - Foundation for rolling upgrades with Storm (STORM-634)
  - Easier deployment of Storm topologies with Flux (STORM-561)
- Simplification
  - Declarative Storm topology wiring with Flux (STORM-561)
  - Reduced dependency conflicts when submitting a Storm topology (STORM-848)
  - Partial Key Groupings (STORM-637)
  - Expanded connectivity:
    - Microsoft Azure Event Hubs Integration — working in conjunction with Microsoft and a solid demonstration of our continued partnership (STORM-583)
    - Redis Support (STORM-609, STORM-849)
    - JDBC/RDBMS integration (STORM-616)
    - Kafka-Flume integration (FLUME-2242)

Twitter recently announced the Heron project, which claims to provide substantial performance improvements while maintaining 100% API compatibility with Storm. The Heron project is based on Twitter’s private fork of Storm prior to Storm being contributed to Apache and before Storm’s underlying Netty-based transport was introduced.

The key point here is that the new transport layer has delivered dramatic performance improvements over the previous 0mq-based transport. The corresponding Heron research paper provides additional details regarding other architectural improvements made, but the fact that Twitter chose to maintain API compatibility with Storm is a testament to the power and flexibility of that API. Twitter has also expressed a desire to share their experiences and work with the Apache Storm community.

A number of concepts expressed in the Heron paper were already in the implementation stage within the Storm community even before it was published, and we look forward to working with Twitter to bring those and other improvements to Storm. We are also eager to continue our collaboration with Yahoo! for Storm at extreme scale.

While the 0.10.0 release of Storm is an important milestone in the evolution of Apache Storm, the Storm community is actively working on new improvements, both near and long term, continuously exploring the realm of the possible, and helping to accelerate a wide variety of IoT use cases being requested by our customers.

Systems of Engagement that Scale

The concept of Systems of Engagement has been attributed to author Geoffrey Moore. Traditional IT systems have mostly been Systems of Record that log transactions and provide the authoritative source for information. In these kinds of systems, the primary focus is on the business process and not the people involved. As a result, analytics becomes an after-thought of describing and summarizing the transactions and processes into neat reports labeled “Business Intelligence”.

In contrast to Systems of Record, Systems of Engagement are focused on people and their goal is to bring the analytics to the forefront — moving business intelligence from the back-office & descriptive mode into proactive, predictive, and ultimately prescriptive models.

hdp2.3_part2_1

The constantly-connected world powered by the web, mobile and social data has changed how customers expect to interact with businesses. Now they demand interactions that are relevant and personal. To meet this expectation, IT must move beyond the classic Systems of Record that store only business transactions and evolve into the emerging Systems of Engagement that understand users and are capable of delivering a context-rich and personalized experience.

Successful Systems of Engagement are those that manage to combine the massive volumes of customer interaction data with deep and diverse analytics. This allows Systems of Engagement to build customer profiles and give users an experience tailored to their needs through personalized recommendations. Of course, that means that Systems of Engagement must scale!

Hortonworks Data Platform gives developers the power to build scalable Systems of Engagement by combining limitless storage, deep analytics and real-time access in one integrated whole, rather than forcing developers to stitch these pieces together by hand.

hdp2.3_part2_2

Of course, all of this starts with HDFS as a massively-scalable data store. On this foundation a wide diversity of analytical solutions has been built, from Hive to Spark to Storm and many more.

Finally, applications need a way to get data out of Hadoop in real-time in a highly-available way. For this, we have Apache HBase and Apache Phoenix, which allow data to be read from Hadoop in milliseconds using a choice of NoSQL or SQL interfaces.

HBase development continues to focus on the key attributes of scalability, reliability and performance. Notable new additions in HDP 2.3 include:

Upgraded to Apache HBase 1.1.1.
API Stability: HBase 1.0+ stabilizes APIs and guarantees compatibility with future releases.
Performance: Multi-WAL substantially improves HBase write performance.
Multi-Tenancy: Provision one cluster for multiple apps with multiple queues and IPC throttling controls.
More reliable cluster scale-out.

Apache Phoenix is an ANSI SQL layer on HBase, which makes developing big data applications much easier. With Phoenix, complex logic like joins are handled for you and performance is improved by pushing processing to the server. Having a real SQL interface is a key advantage that HBase has other scalable database options.

Apache Phoenix continues to improve rapidly:

Upgraded to Apache Phoenix 4.4.
Increased SQL Support: UNION ALL, Correlated Subqueries, Date/Time Functions further simplify application development.
Phoenix / Spark connector: Lets you seamlessly integrate advanced analytics with data stored in Phoenix.
Custom UDFs: So you can embed your custom business logic in SQL queries.
Phoenix Query Server: Lets you query Phoenix from non-Java environments like .NET using a simple web-based protocol.
Query Tracing

HBase is also unique in that it is a true community-driven open source database and in 2015 we continue to see a vibrant and robust community of innovation in both HBase and Phoenix. In addition to strong contribution from Hadoop vendors we’ve seen tremendous community contribution from companies such as:

Bloomberg
Cask
eBay
Facebook
Intel
Interset
Salesforce
Xiaomi
Yahoo!

We at Hortonworks thank everyone who contributes to making HBase and Phoenix great.

HDP Search

More and more customers are asking about search with Hadoop and search is becoming a critical part in a number of our customer deployments. We see HDP Search being deployed in conjunction with HBase and Storm in increasing frequency. In HDP 2.3, HDP Search is powered by Solr 5.2.

Recent security authorization work allows Ranger to protect Solr collections. Solr now works seamlessly on a Kerberized cluster through enhancements made for authentication. Other critically important optimization work was completed as well. This includes allowing administrators to define the HDFS replication factor. Previously, the index size was 2x larger, but through additional rules that can be defined, replica shard, collection and creation can be controlled as desired. In addition, the speed of returning query results is nearly twice as fast when compared to Solr 4.x.

As customer demand for HDP Search increases, it also requires ease of use, enterprise readiness, and simplification. This release has pushed forward on all these fronts. We want to thank our partners at Lucidworks for the close collaboration and engagement on these innovations.

Final Thoughts on Data Access

As you can see, there has been a tremendous amount of work that has gone into each of these areas over the past six to eight months. The arrival of all these capabilities broadens the ability for organizations to build new, unique and compelling applications on top of HDP — with YARN at its core. We are truly excited by the possibilities and very thankful for all the contributions from the Apache community that fuel this innovation.

Learn More:

Download Sandbox
Register for Webinar: Introducing HDP 2.3 Webinar on Thursday, July 30, between 10:00 – 11:00 a.m. Pacific Time
Watch Arun Murthy’s Spark video

The post Introducing Availability of HDP 2.3 – Part 2 appeared first on Hortonworks.

↧

Running Operational Applications (OLTP) on Hadoop using Splice Machine and Hortonworks

July 28, 2015, 10:28 am

≫ Next: Introducing Availability of HDP 2.3 – Part 3

≪ Previous: Introducing Availability of HDP 2.3 – Part 2

On August 4th at 10:00 am PST, Eric Thorsen, General Manager Retail/CP at Hortonworks and Krishnan Parasuraman, VP Business Development at Splice Machine, will be talking about how Hadoop can be leveraged as a scale-out relational database to be the System of Record and power mission critical applications.

In this blog, they provide answers to some of the most frequently asked questions they have heard on the topic.

Hadoop is primarily known for running batch based, analytic workloads. Is it ready to support real-time, operational and transactional applications?

Although Hadoop’s heritage and initial success were in batch based applications and analytic workloads, today, the platform has evolved to support real-time, highly interactive applications. The introduction of HBase into the ecosystem enabled real-time, incremental writes on top of the immutable Hadoop file system. With Splice Machine, companies can now support ACID transactions on top of data resident in HBase. As a full-featured Hadoop RDBMS that supports ANSI standard SQL, secondary indexes, constraints, complex joins and highly concurrent transactions, Splice Machine database and the Hortonworks data platform enable enterprises to power real-time OLTP applications and analytics, especially as they approach big data scale.

How can enterprises, specifically in the Retail industry, take advantage of a Hadoop RDBMS?

With increasing number of channels and customer interactions across each one of them, retailers are looking for opportunities to better harness this data to drive real time decision-making – be it in personalizing their marketing activities and delivering targeted campaigns, or optimizing their assortment and merchandising decisions, or improving the efficiency of their supply chains.

A retail enterprise has multiple data repositories that require RDBMS capabilities but, at the same time, is challenged with the need for scaling those. For example, a Demand Signal Repository is a common System of Record that houses point of sale data, inventory information, forecasts, promotions and shipments. This data needs to be harmonized and maintained in a consistent state. It needs to support operational reporting such as stock-outs and also complex analytics such as forecasts. We also hear from those enterprises that their existing traditional databases such as Oracle, SQL Server or DB2 that house this data are unable to scale beyond a few terabytes and become too cumbersome to maintain. This clearly spells out the need for a data platform that can scale effortlessly to manage massive volumes of data and, at the same time, provide RDBMS capabilities that has feature function parity with their existing systems.

Can we run mixed workloads – transactional (OLTP) and analytical (OLAP) – on the same Hadoop cluster?

In retail, there are various processes that encompass both transactional and analytical workloads. For example, a campaign management system needs to ingest real-time customer data from multiple sources and potentially deliver personalized messages to those individuals. This is a highly transactional process with customer profile lookups and real-time updates. It requires concurrent system access that can scale effortlessly, especially during peak shopping seasons. That same system also needs to be able to run fairly complex analytics such as audience segmentation, look-alike modeling and offer optimization.

Retailers typically run the transactional process via a campaign management or CRM application on top of a traditional database such as Oracle or SQL Server and run their analytic processing on a different data warehouse or an MPP data mart. They had to maintain separate databases for these two different workloads and move data back and forth. With the Hadoop RDMS, they can run both the transactional (OLTP) and analytic workload (OLAP) on the same data platform, eliminating the need to duplicate data and deal with ETL bottlenecks. This also enables their entire process to scale-up affordably with increasing data volumes.

Can you give us an example of an enterprise that has modernized their data platform with Hadoop RDBMS and they ROI they have achieved?

A good example is Harte Hanks. They are replacing their Oracle RAC database powering their campaign management solution with Splice Machine Hadoop RDBMS. Harte Hanks is a global marketing services provider and serves some of the largest retailers in the market. They provide a 360 degree view of the customer thru a customer relationship management system and enable cross channel campaign analytics with real-time and mobile access. Their biggest challenge was that their customer queries were getting slower, in some cases over a half hour to complete. Expecting 30-50% future data growth, Harte Hanks was concerned that database performance issues would become increasingly worse. Harte Hanks evaluated whether to continue scaling up to larger and more expensive proprietary servers or to seek solutions that can affordably scale-out on commodity hardware. Splice Machine and Hadoop now support Harte Hank’s mixed workload applications (OLAP and OLTP). They have been able to gain a 75% cost saving with a 3-7x increase in query speeds.

Overall, they have experienced a 10-20x improvement in price/performance without significant application, BI or ETL rewrites.

The post Running Operational Applications (OLTP) on Hadoop using Splice Machine and Hortonworks appeared first on Hortonworks.

↧

Introducing Availability of HDP 2.3 – Part 3

July 29, 2015, 1:09 pm

≫ Next: Fault tolerant Nimbus in Apache Storm

≪ Previous: Running Operational Applications (OLTP) on Hadoop using Splice Machine and Hortonworks

Last week, on July 22nd, we announced the general availability of HDP 2.3. Of the three part blog series, the first blog summarized the key innovations in the release—ease of use & enterprise readiness and how those are helping deliver transformational outcomes—while the second blog focused on data access innovation. In this final part, we explain cloud provisioning, proactive support, and other general improvements across the platform.

Automated Provisioning with Cloudbreak

Since Hortonworks’ acquisition of SequenceIQ, the integrated team has been working hard to complete the deployment automation for public clouds including Microsoft Azure, Amazon EC2, and Google Cloud. We are pleased to deliver Cloudbreak 1.0 along with HDP 2.3. Support and guidance are available to all Hortonworks customers who have an active Enterprise Plus support subscription, and we’ve published an initial set of installation and administrative documentation.

Cloudbreak is a cloud agnostic tool for provisioning, managing and monitoring on-demand Hadoop clusters. For administrators, it provides scripting functionality to automate tasks. Through its easy user interface, administrators can manage services for any configuration.

Cloudbreak can be used to provision Hadoop across major cloud providers: Microsoft Azure, Amazon Web Service, and Google Cloud Platform. It enables efficient usage of cloud platforms via policy-based autoscaling that can expand and contract the cluster based on Hadoop usage metrics and defined policies. And, it provides centralized and secure user experience to Hadoop cluster through rich web interface as well as REST API and CLI shell across all cloud providers. It is fundamentally integrated with Apache Ambari and heavily leverages the Ambari Blueprints functionality allowing users to reliably and repeatedly stand-up clusters based on their needs.

hdp_2.3_1

While Cloudbreak’s primary role is to launch on-demand Hadoop clusters in the cloud, the underlying technology actually does more. It can, for example, launch on-demand Hadoop clusters in any environment that supports Docker – in a dynamic way. Because all the setup, orchestration, networking, and cluster membership are done dynamically, there is no need for a predefined configuration.

While we are focused initially on the public cloud deployment options and flexibility, we are excited about future possibilities of leveraging Docker and Cloudbreak to deliver the maximum deployment choice for our customers within public clouds and within their data centers.

Proactive Support with Hortonworks SmartSense™

As we’ve seen the tremendous appetite for the adoption of Hadoop over the past 2 years, we have also observed more and more mission critical applications and workloads being placed on top of Hadoop. Not surprisingly, our rapidly growing base of customers look to Hortonworks for guidance and best practices to minimize their operational risk and maximize their resources and staff for Hadoop operations. To meet that demand, we have developed Hortonworks SmartSense. It enriches our already world-class support offering for Hadoop by:

Providing proactive insights and recommendations to customers about their cluster utilization and its health.
Quickly and easily capturing log files and metrics for faster support case resolution.
Delivering ongoing recommendations, suggestions and analytics to proactively prevent configuration problems.

Today, we are delivering a new user experience for SmartSense via the Ambari Views Framework in addition to completing the integration of the corresponding recommendations through our support portal. The SmartSense View plugs seamlessly into Ambari and allows for Hadoop operators to easily configure and manage how the information is gathered from the cluster.

hdp_2.3_2

hdp_2.3_3

SmartSense’s capabilities, says Cheolho Minale, vice president of technology at The Mobile Majority, will allow his Hadoop team to optimize its HDP cluster’s ad performance:

At The Mobile Majority, we have been using Hortonworks Data Platform to optimize ad performance on behalf of our customers. We’re excited to look into Hortonworks SmartSense as a way to continuously optimize our HDP cluster as it grows over time.

This is only the beginning for Hortonworks SmartSense. We believe that we can share valuable insights with our customers as we gain a deeper understanding of how our customers use HDP within their HDP environments, how their performance and usage peaks and ebbs, and how they optimize their HDP clusters using Smart Sense.

General Platform Improvements

Finally, I wanted to wrap up the HDP 2.3 blog series with a set of selective improvements to key components of HDP. Each of these improvements makes a difference in terms of ease of use, enterprise readiness, and simplification. Notable enhancements made in this release that we haven’t yet touched on elsewhere are described below:

Apache Hadoop 2.7.0 was released back in April, and with HDP 2.3 we are shipping Hadoop 2.7.1. The engineering work completed as part of Hadoop 2.7.1 ensures that it is stable and ready-to-use. Across its many components, here are some notable enhancements:

YARN

Non-exclusive Node Labels – where applications are given preference for the Label they specify, but not exclusive access (YARN-3214). This allows for greater resource sharing within a single cluster and is particularly useful for organizations where workload types shift at different times of day. The non-exclusive label allows for those nodes that might be typically dedicated for interactive workloads during the day can now be used to support nightly batch processing as well.
Fair sharing across apps for same user same queue, per queue scheduling policies (YARN-3306). This allows for the same user to submit multiple queries within the same queue and then fairly share the resources allocated to her across the jobs she has submitted.

HDFS

Improve distcp efficiency: reduced time and processing power needed to mirror datasets across cluster (HDFS-7535, MAPREDUCE-6248)
Support for variable-length blocks (HDFS-3689)
Provide storage quotas per heterogeneous storage types (HDFS-7584)

PIG

Pig had a number of improvements as well, including the ability to call Hive UDFs directly from Pig.

Ability to call Hive UDFs from Pig (PIG-3294)
Dynamic Parallelism via Tez (PIG-4434)

SQOOP

Sqoop is used to move data from existing structured sources into and out of Hadoop. Two enhancements related to mainframe datasets and Netezza have been delivered in Sqoop 1.4.6:

Import sequential datasets from mainframe (SQOOP-1272) – eases movement of data between mainframes and HDFS. Thank you to Mariappan Asokan of SyncSort for the contribution!
Netezza enhancements: skip control codes, write logs to HDFS (SQOOP-2164)

OOZIE

Oozie is the defacto job scheduler for Hadoop. In Oozie 4.2.0, 2 additional actions have been added, increasing the ability for users to define workflows that include HiveServer2 and Spark. In addition, a key enhancement for stopping (both kill and suspend) jobs by their coordinator name has been added:

HiveServer2 action (OOZIE-1457)
Spark action (OOZIE-1983)
Stop (kill, suspend) and Resume jobs by coordinator name (and other filters) (OOZIE-2108)

What’s Next?

This year Hortonworks has focused on three key themes: ease of use, enterprise readiness, and simplification. We want to make HDP easy to use for all types of users. This means continuing to deliver breakthrough user experiences for cluster administrators, developers, and data workers, from data scientists to architects. We want to increase the adoption of HDP within the enterprise, and this means improving ease of operations, increasing security, and providing comprehensive data governance.

Lastly, as we bring all together the various Apache projects that make up HDP, we want to ensure that they work together in a seamless, integrated, and simple to use data processing platform.

We are excited about the progress we’ve made with the arrival of HDP 2.3 and we hope you enjoy the results of all of the open source developers within the community who made this possible.

Learn More

Learn about Hortonworks SmartSense
Download Sandbox
Register for Webinar: Introducing HDP 2.3 Webinar on Thursday, July 30, between 10:00 – 11:00 a.m. Pacific Time
Watch Arun Murthy’s Spark video

The post Introducing Availability of HDP 2.3 – Part 3 appeared first on Hortonworks.

↧

Fault tolerant Nimbus in Apache Storm

August 18, 2015, 10:17 am

≫ Next: Hortonworks Sandbox with HDP 2.3 is now available on Microsoft Azure Gallery

≪ Previous: Introducing Availability of HDP 2.3 – Part 3

Everyday more and more new devices—smartphones, sensors, wearables, tablets, home appliances—connect together by joining the “Internet of Things.” Cisco predicts that by 2020, there will be 50 billion devices connected to Internet of Things. Naturally, they all will emit streams of data, in short intervals. Obviously, these data streams will have to be stored, will have to be processed, and will have to be analyzed, in real-time.

Apache Storm is the scalable, fault-tolerant realtime distributed processing engine that allows you to the handle massive streams of data in realtime, in parallel, and at scale. Handling massive data streams at scale is one thing, handling in a fault tolerant way is another thing.

In this blog, I examine how Storm’s master component, Nimbus, handles failures by examining its fault tolerant architecture.

Problem Statement

Currently, only a single instance of Nimbus runs in a Storm cluster. Nimbus is stateless, and therefore, in most cases, the failure of Nimbus is transient. If your deployment has Nimbus running under supervision (e.g. using a watchdog process like supervisord), Nimbus can simply be restarted without major consequences. However, in certain scenarios like disk failures, restarting the Nimbus is not an option, because Nimbus can become unavailable. And during Nimbus’ unavailability, the following problems arise:

Existing topologies run unaffected because supervisors only talk to Nimbus when new tasks are assigned and Worker processes never talk to Nimbus.
No new topologies can be submitted. Existing topologies cannot be killed, deactivated, activated, or rebalanced.
If a supervisor node fails, task reassignment is not performed, resulting in performance degradation or topology failures.
When Nimbus is restarted, it will delete all existing topologies, as it does not have the topology code available locally. Users are required to resubmit their topologies, which results in downtime.

To solve these problems, you can now run Nimbus in primary and standby mode, thus ensuring that if the primary fails, one of the nodes in standby will take over. This failover and fault tolerant Nimbus mode architecture have obvious advantages:

It increases the overall availability of Nimbus.
It allows Nimbus hosts to leave and join the cluster transparently. After joining, a new Nimbus instance will go through the sequence of steps necessary to join the list of nodes that are eligible to become leaders. During failover no topology resubmissions are required.
It prevents loss of active topologies.

Fault Tolerant Nimbus Architecture

The architecture has three main parts: Leader election, distributed state storage, and leader discovery.

Leader Election

In order to elect a primary Nimbus, we have a ZooKeeper based leader election mechanism. In particular, we use Curator’s LeaderLatch recipe to perform the leader election. This scheme takes care of keeping the leader status in memory and re-queuing the node for leader lock in case the ZooKeeper connection is intermittently lost. Only an elected leader Nimbus can activate, deactivate, rebalance, kill topologies or perform reassignments for existing topologies. Only Nimbus nodes that have code for all the active topologies will contend for leader lock, and if a node that does not have code for all active topologies is chosen as leader, it will relinquish leadership.

To illustrate how a leader is elected during a failure, let’s assume a) four topologies are running b) three nimbus nodes c) code-replication factor is two d) the “leader nimbus has the code for all the topologies locally and e) each of the two non-leader nodes (nonLeader_1, nonLeader_2) has code for a subset of two topologies.

With these above assumptions, we can describe the Nimbus failover in the following steps:

Leader has catastrophic failure and becomes unavailable.
nonLeader_1 receives a notification from ZooKeeper indicating that it is now the new leader. Before accepting the leadership role, it first checks if it has the code available for all 4 active topologies. It realizes that it only has code for 2 topologies, so it relinquishes the lock and looks up in ZooKeeper to find out from where it can download the code for the missing topologies. This lookup returns entries for the leader Nimbus and nonLeader_2. It will try downloading from both as part of its retry mechanism.
nonLeader_2 code sync thread also realizes that it is missing code for 2 topologies, and follows the same missing topologies discovery process described above.

When this process is complete, at least one of the Nimbuses will have all the code locally, and will become the leader. The following sequence diagram illustrates how leader election and failover work with multiple components.

storm_nimbus_1

Distributed State Storage

Storm preserves most of its state in the ZooKeeper. When a user submits a topology jar, Storm stores it onto Nimbus’ local disk, not on Zookeeper because the jar may be large. This means that when running Nimbus in primary/standby mode (unless we are using a replicated storage for storing all topology jars), there is no way for a Standby node to become Primary.

A possible solution to this problem is to use a replicated storage. To avoid requiring that the users have a replicated storage in their cluster setup, Storm includes a pluggable Storage Interface. Out of the box, storm supports two implementations of replicated storage. One uses each Nimbus host’s local disks, and another uses HDFS as replicated storage.

The following steps describe topology code replication among nimbus hosts:

When a client submits a topology, the leader Nimbus calls the code distributor’s upload function, which will create a metafile locally on the leader Nimbus. The leader Nimbus writes new entries in ZooKeeper to notify all non-leaders Nimbus that they should download the newly submitted topology code.
The leader Nimbus waits until at least N non-leader nimbus nodes have the code replicated, or a user configurable timeout expires. N is the user configurable replication factor.
When a non leader Nimbus receives the notification that a new topology has been submitted, it downloads the meta file from the leader Nimbus, and then it downloads the code binaries by calling the code distributor’s download function with the metafile as input.
Once a non-leader node finishes downloading the code, it will write an entry in ZooKeeper to indicate that it is now one of hosts from where it is possible to download the code for this topology, in case the leader nimbus dies.

The leader Nimbus can then proceed with the usual tasks that are part of the submit technology action, e.g. assigning nodes for the newly activated topology and marking the topology as active.

storm_nimbus_2

Leader Discovery

We originally exposed a library that read leader information from ZooKeeper, and all clients used this library for Nimbus discovery. However, ZooKeeper is one of the main bottlenecks when scaling Storm to large clusters with many topologies. Therefore, our goal was to reduce the write load on ZooKeeper as much as possible. Robert Joseph Evans (Thank you!) from Yahoo! pointed out that in our approach each client connection will be considered as an extra write to ZooKeeper, which is undesirable. To overcome this, the Nimbus summary section was added to the existing cluster summary API. Any client can call this API on any Nimbus host to get the list of current Nimbus, and the leader among them. Nimbus hosts still read this information from ZooKeeper, but now they can cache this value for a reasonable amount of time, hence reducing the load on ZooKeeper.

Quick Start Guide

Set the values for the following configuration properties according to your configuration needs, and start Nimbus processes on multiple hosts.

codedistributor.class : String representing the fully qualified class name of a class that implements “backtype.storm.codedistributor.ICodeDistributor”. The default is set to “backtype.storm.codedistributor.LocalFileSystemCodeDistributor”. This class leverages local file system to store both meta files and code/configs.
min.replication.count : Minimum number of nimbus hosts where the topology code must be replicated before the leader Nimbus can mark the topology as active, and create assignments. The default is 1. For topologies that need high availability, we recommend this value to be set to floor(number_of_nimbus_hosts/2 + 1).
max.replication.wait.time.sec : Maximum wait time for the nimbus host replication to achieve the nimbus.min.replication.count. Once this time is elapsed nimbus will go ahead and perform topology activation tasks even if required nimbus.min.replication.count is not achieved. The default is 60 seconds. -1 indicates that Nimbus must wait forever and this is the value that topologies requiring high availability should use.
code.sync.freq.secs : running frequency of the Nimbus background thread responsible for syncing code for locally missing topologies. The default is 5 minutes.

Some Storm Facts and Features

Acknowledgements:

Special thanks to Robert Joseph Evans , Taylor Goetz, Sriharsha and the entire Storm Community, who helped with reviewing the design, code and testing the final feature.

The post Fault tolerant Nimbus in Apache Storm appeared first on Hortonworks.

↧

Hortonworks Sandbox with HDP 2.3 is now available on Microsoft Azure Gallery

September 4, 2015, 9:45 am

≫ Next: Where is Hadoop and YARN technology going in the next 10 years?

≪ Previous: Fault tolerant Nimbus in Apache Storm

We are excited to announce the general availability of Hortonworks Sandbox with HDP 2.3 on Microsoft Azure Gallery. Hortonworks Sandbox is already a very popular environment for developers, data scientists and administrators to learn and experiment with the latest innovations in Hortonworks Data Platform.

The hundreds of innovations span across Apache Hadoop, Kafka, Storm, Spark, Hive, Pig, YARN, Ambari, Falcon, Ranger and other components that make up HDP platform. We also provide tutorials to help you get a jumpstart on how to use HDP to implement an Open Enterprise Hadoop in your organization.

Every component is updated, including some of the key technologies we added in HDP 2.3.

This guide walks you through using the Azure Gallery to quickly deploy Hortonworks Sandbox on Microsoft Azure.

Prerequisite:

A Microsoft Azure account – you can sign up for an evaluation account if you do not already have one.

Guide

Start by logging into the Azure Portal with your Azure account: https://portal.azure.com/

Navigate to the MarketPlace

Search for Hortonworks. Click on the Hortonworks Sandbox icon.

To go directly to the Hortonworks Sandbox on Azure page navigate to http://azure.microsoft.com/en-us/marketplace/partners/hortonworks/hortonworks-sandbox/

This will launch the wizard to configure Hortonworks Sandbox for deployment.

You will need to note down the hostname and the username/password that you enter in the next steps to be able to access the Hortonworks Sandbox once deployed. Also ensure you select a Azure instance with size A4 or more for optimal experience.

Click Buy if you agree with everything on this page.

At this point it should take you back to the Azure portal home page where you can see the deployment in progress.

You can see the progress in more details by clicking on Audit

Once the deployment completes you will see this page with configuration and status of you VM. Again it is important to note down the DNS name of your VM which you will use in the next steps.

If you scroll down you can see the Estimated spend and other metrics for your VM.

Let’s navigate to the home page of your Sandbox by pointing your browser to the URL: http://<hostname>.cloudapp.net:8888 , where <hostname> is the hostname you entered during configuration.

By navigating to port 8080 of your Hortonworks Sandbox on Azure you can access the Ambari interface for your Sandbox.

If you want a full list of tutorial that you can use with your newly minted Hortonworks Sandbox on Azure go to http://hortonworks.com/tutorials.

HDP 2.3 leverages the Ambari Views Framework to deliver new user views and a breakthrough user experience for both cluster operators and data explorers.

Happy Hadooping with Hortonworks Sandbox!

The post Hortonworks Sandbox with HDP 2.3 is now available on Microsoft Azure Gallery appeared first on Hortonworks.

↧

Where is Hadoop and YARN technology going in the next 10 years?

September 28, 2015, 7:50 am

≫ Next: Magellan: Geospatial Analytics on Spark

≪ Previous: Hortonworks Sandbox with HDP 2.3 is now available on Microsoft Azure Gallery

In a world that creates 2.5 quintillion bytes of data every year, it is extremely cheap to collect, store and curate all the data you will ever care about. Data is de facto becoming the largest untapped asset. So how can organizations take advantage of unprecedented amounts of data? The answer is new innovations; and new applications. We are clearly entering a new era of modern data application

I would like to take the opportunity to share my Hadoop journey in the past 10 years, and discuss where I see the Hadoop technology going in the next decade.

To celebrate ten years of Hadoop with me, please go to http://hortonworks.com/10yearsofhadoop/ for more information.

The post Where is Hadoop and YARN technology going in the next 10 years? appeared first on Hortonworks.

↧

Magellan: Geospatial Analytics on Spark

October 20, 2015, 6:31 am

≫ Next: Apache Ambari Hackfest on a Serene Saturday

≪ Previous: Where is Hadoop and YARN technology going in the next 10 years?

Geospatial data is pervasive—in mobile devices, sensors, logs, and wearables. This data’s spatial context is an important variable in many predictive analytics applications.

To benefit from spatial context in a predictive analytics application, we need to be able to parse geospatial datasets at scale, join them with target datasets that contain point in space information, and answer geometrical queries efficiently.

Unfortunately, if you are working with geospatial data and big data sets that need spatial context, there are limited open source tools that make it easy for you to parse and efficiently query spatial datasets at scale. This poses significant challenges for leveraging geospatial data in business intelligence and predictive analytics applications.

This is the problem that Magellan sets out to solve. Magellan is an open source library for Geospatial Analytics that uses Apache Spark as the underlying execution engine. Magellan facilitates geospatial queries and builds upon Spark to solve hard problems of dealing with geospatial data at scale.

In this blog post, we will introduce the problem of geospatial analytics and show how Magellan allows users to ingest geospatial data and run spatial queries at scale.

To do so, we will analyze the problem of using Uber data to examine the flow of uber traffic in the city of San Francisco.

Mapping the flow of Uber traffic in San Francisco with Magellan

Uber has published a dataset of GPS coordinates of all trips within San Francisco.

Our goal in this example is to join the Uber dataset with the San Francisco neighborhoods dataset) to obtain some interesting insights into the patterns of Uber trips in San Francisco.

Magellan has both Scala and Python bindings. In this blog post we use the Scala APIs.
Magellan is a Spark Package, and can be included while launching the spark shell as follows:

            bin/spark-shell --packages harsha2010:magellan:1.0.3-s_2.10

The following imports are needed:

import magellan.{Point, Polygon, PolyLine}
import magellan.coord.NAD83
import org.apache.spark.sql.magellan.MagellanContext
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

First, we need to read the uber dataset. We assume the dataset has been downloaded and the path to the dataset is uber.path.
Let us create a case class to attach the schema to this Uber Dataset so we can use the DataFrame abstraction to deal with the data.

case class UberRecord(tripId: String, timestamp: String, point: Point)

Now we can read the dataset into a dataframe and cache the resulting dataframe.

val uber = sc.textFile(${uber.path}).map { line =>
val parts = line.split("\t" )
val tripId = parts(0)
val timestamp = parts(1)
val point = Point(parts(3).toDouble, parts(2).toDouble)
UberRecord(tripId, timestamp, point)
}.
repartition(100).
toDF().
cache()

This dataset contains the trip id, the timestamp and the latitude and longitude of each point on the trip coalesced into a Point data structure.

A Point is the simplest geometric data structure available in Magellan. It represents a two dimensional point, with x and y coordinates. In this case, as is standard in geospatial analysis, the x coordinate refers to the longitude and the y coordinate the latitude.

Since this dataset is not interesting in itself, we need to enrich this dataset by determining which neighborhood each of these points lie in.

To do so, we will convert the neighborhood dataset into a dataframe as well, assuming the dataset has been downloaded and the path to the dataset is neighborhoods.path.

This dataset is in what is known as the ESRI Shapefile format.

This is one of the most common formats in which geospatial data is stored. Magellan has a Data Source implementation that understands how to parse ESRI Shapefiles into Shapes and Metadata.

val magellanContext = new MagellanContext(sc)

val neighborhoods = magellanContext.read.format("magellan").
load(${neighborhoods.path}).
select($"polygon", $"metadata").
cache()

There are two columns in this DataFrame: a shape representing the neighborhood which happens to be polygonal, and metadata which is a map of String keys and String values.

Magellan has a Polygon data structure to capture the spatial geometry of a Polygon. A Polygon in Magellan stands for a Polygonal object with zero or more holes.
Map columns can be exploded into their keys and values to yield the following dataframe:

neighborhoods.select(explode($"metadata").as(Seq("k", "v"))).show(5)

+----------+--------------------+
|         k|                   v|
+----------+--------------------+
|neighborho|Twin Peaks       ...|
|neighborho|Pacific Heights  ...|
|neighborho|Visitacion Valley...|
|neighborho|Potrero Hill     ...|
|neighborho|Crocker Amazon   ...|
+----------+--------------------+

Now we are getting somewhere: we are able to parse the San Francisco neighborhood dataset, extract its metadata as well as the polygon shapes that represent each neighborhood. The natural next step is to join this dataset with the uber dataset so that each point on the uber trip can be associated with its corresponding neighborhood.

Here we run into an important spatial query: How do we compute whether a given point (uber location) lies within a given polygon (or neighborhood) ?

Magellan implements th as well as other spatial operators like intersects, intersection, contains, covers etc making it easy to use.
In Magellan, to join the Uber dataset with the San Francisco neighborhood dataset, you would issue the following Spark SQL query:

neighborhoods.
join(uber).
where($"point" within $"polygon").
select($"tripId", $"timestamp", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "neighborhood").
drop("k").
show(5)

+------+---------+------------+
|tripId|timestamp|neighborhood|
+------+---------+------------+
+------+---------+------------+

This is interesting: According to our calculation, the GPS coordinates representing the Uber dataset do not fall in any of the San Francisco neighborhoods. How can this be?

This is a good point to pause and think about coordinate systems. We have been using GPS coordinates for the Uber dataset, but haven’t verified the coordinate system that the San Francisco neighborhood dataset has been encoded in.

It turns out that most datasets published by the US governmental agencies use what is called State Plane coordinates.

Magellan supports translating between different coordinate systems by implementing a transformer interface which takes in Points and outputs Points.

This covers all conformal transformations which is the set of all transformations that preserve angles.
In particular, to translate between WGS84, the GPS standard coordinate system used in the Uber dataset, and NAD83 Zone 403 (state plane), we can use the following in built transformer:

val transformer: Point => Point = (point: Point) => {
val from = new NAD83(Map("zone" -> 403)).from()
val p = point.transform(from)
new Point(3.28084 * p.x, 3.28084 * p.y)
}

Here we have defined a new transformer that applies the NAD83 transformation for Zone 403 (Northern California) and further scales the points to have units in feet instead of meters.
This allows us to enhance the uber dataset by adding a new column, the scaled column representing the coordinates in the NAD83 State Plane Coordinate System:

val uberTransformed = uber.
withColumn("nad83", $"point".transform(transformer)).
cache()

Now we are ready to perform the join again:

val joined = neighborhoods.
join(uberTransformed).
where($"nad83" within $"polygon").
select($"tripId", $"timestamp", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "neighborhood").
drop("k").
cache()

joined.show(5)
 
+------+--------------------+--------------------+
|tripId|           timestamp|        neighborhood|
+------+--------------------+--------------------+
| 00002|2007-01-06T06:23:...|Marina           ...|
| 00006|2007-01-04T01:04:...|Marina           ...|
| 00008|2007-01-03T00:59:...|Castro/Upper Mark...|
| 00011|2007-01-06T09:08:...|Russian Hill     ...|
| 00014|2007-01-02T05:18:...|Mission          ...|
+------+--------------------+--------------------+

Ok, this looks much more reasonable!
One interesting question we are now ready to ask is: What are the top few neighborhoods where most Uber trips pass through?

joined.
groupBy($"neighborhood").
agg(countDistinct("tripId").
as("trips")).
orderBy(col("trips").desc).
show(5)

+--------------------+-----+
|        neighborhood|trips|
+--------------------+-----+
|South of Market  ...| 9891|
|Western Addition ...| 6794|
|Downtown/Civic Ce...| 6697|
|Financial Distric...| 6038|
|Mission          ...| 5620|
+--------------------+-----+

There are about 24664 trips for which we have neighborhood information, out of which close to 40% of the trips involve SOMA. Now if you are an Uber driver, you may just want to hang out around SOMA.
Breaking down this analysis by the neighborhood where trips originate reveals similar interesting insights.

+--------------------+-----+
|  start_neighborhood|count|
+--------------------+-----+
|South of Market  ...| 5697|
|Financial Distric...| 3542|
|Downtown/Civic Ce...| 3258|
|Western Addition ...| 2632|
|Mission          ...| 2332|
|Marina           ...| 1003|
|Nob Hill         ...|  960|
|Pacific Heights  ...|  853|
|Castro/Upper Mark...|  749|
|Russian Hill     ...|  512|
+--------------------+-----+

Out of 24664 trips, 5697 originate in SOMA. That is, 23% of all the Uber trips start in SOMA.
Another interesting question to ask is, what fraction of the Uber trips that originate in SOMA end up in SOMA.

+--------------------+-----+
|    end_neighborhood|count|
+--------------------+-----+
|South of Market  ...| 2259|
|Mission          ...|  911|
|Financial Distric...|  651|
|Downtown/Civic Ce...|  396|
|Western Addition ...|  380|
|Castro/Upper Mark...|  252|
|Potrero Hill     ...|  247|
|Nob Hill         ...|  101|
|Pacific Heights  ...|   56|
|Marina           ...|   51|
|Bernal Heights   ...|   48|
|North Beach      ...|   47|
|Haight Ashbury   ...|   45|
|Russian Hill     ...|   38|
|Chinatown        ...|   35|
|Noe Valley       ...|   32|
|Bayview          ...|   29|
|Treasure Island/Y...|   24|
|Golden Gate Park ...|   15|
|Inner Richmond   ...|   14|
+--------------------+-----+

That is, nearly 39% of all the trips that originate from SOMA end up in SOMA: as far as Uber is concerned, what happens in SOMA stays in SOMA!

As we see, once we add geospatial context to the Uber dataset, we end up with a fascinating array of questions we can ask about the nature of Uber trips in the city of San Francisco.

Summary

In this blog post, we have shown how to use Magellan to perform geospatial analysis on Spark.

Hopefully this short introduction has demonstrated how easy and elegant it is to incorporate geospatial context in your applications using Magellan.

In the next blog post, we will go under the hood to examine how Magellan leverages Spark SQL, Data Frames and Catalyst to provide elegant and simple user APIs while ensuring that spatial queries can execute efficiently.

The post Magellan: Geospatial Analytics on Spark appeared first on Hortonworks.

↧

Apache Ambari Hackfest on a Serene Saturday

October 21, 2015, 12:37 pm

≫ Next: Best practices in HDFS authorization with Apache Ranger

≪ Previous: Magellan: Geospatial Analytics on Spark

Hackathons, Hackfest, and Codefests have an initial air of invincibility. They challenge participants, even veterans—not if the attendees work together or if the community collaborates and innovates together. That air of invincibility quickly dissipates.

Last Saturday, because of such camaraderie and collaboration, a harmony of innovative ideas flourished and came to fruition at an Ambari Hackfest.

afs_1

Open Data Platform Initiative (ODPi) founding partners Hortonworks and Pivotal co-hosted and co-sponsored an Ambari Hackfest at the Pivotal site near the scenic Foothills in Palo Alto.

One attendee said, “This location is so beautiful, I saw a coyote, horses, and cows as I drove up Deer Creek Road. Now I’m ready to code.”

Our goal was to foster an environment not of competition but of collaboration, where Ambari team experts mentored attendees to implement their ideas, to use the Ambari’s extensible framework to create Views and Services.

With that goal in mind, the all-day Hackfest attracted more than 30 attendees, with groups of two, three or four working together, creating Ambari Views and Services.

Kicked-off with a short introductory lecture to Apache Ambari’s extensible framework, the attendees quickly and comfortably started hacking for five intense hours, taking short respite for lunch and snacks.

ahf_2

Of entries submitted at the end, judges chose three submission winners as complete, functional, useable, and presentable:

Ambari Cassandra Service by Greg Hill
Catalog Service for Ambari by Juanjo Marron & Tuong Truong
Ambari Service Deployer by Jesus Alvarez

Others, though incomplete, will eventually be submitted on Devpost. You can view some of the submissions here, with links to their sources on Github, and peruse some of pictures here and more here.

Team Hortonworks created a Hadoop Log Search, which contains a service and view for searching, querying, filtering, and displaying logs statistics (by combining Solr, Logstash, and Banana), while Pivotal team implemented a valuable extension Install View (by adding additional repos’ configuration) for the ambari agent to install rpms.

We want to thank our ODPi partners Pivotal for co-hosting and co-sponsoring this event. Like the ODPi and Apache Ambari Meetup held this summer where our ODPi partners shared with the community how they provision, manage, and monitor their large Hadoop clusters using Apache Ambari and its extensible framework, this Hackfest too was an exemplary showcase of the power and potential of Ambari’s extensible framework.

Also, we want to congratulate winners and thank all the attendees who participated in this first Apache Ambari Hackfest. All participants did a commendable job in just few hours of hacking. All participants’ contributions will be valuable additions to the community galleries for Ambari Views and Ambari Extensions and on Devpost.

Stay tuned as more weekend-long Hackathons are coming soon…

Resources

Visit Apache Ambari Page
Try Apache Ambari on HDP Sandbox
Read about Ambari’s Extensible Framework
Watch the Ambari Technical Operations Webinar
Get started with Ambari Workshop
Become an Ambari contributor.
Visit Hortonworks Public Gallery

The post Apache Ambari Hackfest on a Serene Saturday appeared first on Hortonworks.

↧