Hadoop on OpenStack Swift: experiments

Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).

We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a slide deck on the exact procedure.

But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using our Cloudfiles distribution (with patches) or changing our Hadoop code to use the new Rackspace one. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.

I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.

12 thoughts on “Hadoop on OpenStack Swift: experiments”

  1. Constantine,

    I’ve downloaded your Swift filesystem adapter as well as your cloudfiles distribution and built a standalone JAR that you can place on the HADOOP_CLASSPATH to allow simple file commands (using the hadoop fs shell) between Swift and HDFS. However, I can’t seem to get Hadoop’s distcp command to work. Have you had success using your code to perform a distributed copy between Swift and HDFS?

    Any insights would be most helpful. Thanks loads for the work you’ve done.


  2. Hi Constantine,

    Yes, I’ve looked at the presentation, however I’m trying a slightly different approach.

    I don’t want to have to modify our installed version of Hadoop, as it’s managed using JuJu on our internal Openstack cluster. Therefore, I’ve built standalone libraries containing the Swift file system (your code), including the cloud files classes and all dependencies. The various file system property values together with my Swift credentials are placed into a user/application configuration file, i.e. properties include fs.swift.impl, swift credentials, etc.)

    Now, adding this jar to the HADOOP_CLASSPATH allows use of all hadoop command line file commands; I can copy data between filesystems, create/delete files/directories, etc.

    For a Hadoop job that wishes to pull data directly from Swift, I’m a little less certain. The job’s jar file is built to also include the Swift file system jar and dependencies in a /lib subdirectory within the jar. Now, for distcp use, I agree, since I’m not building the swift extensions into our version of Hadoop directly, distcp can’t see the extensions (since the extension must also be seen by each task comprising the job). I’m just wondering if the approach I’m attempting works for a user’s job to allow direct Swift access on either input or output paths?


    1. Your assumptions seem perfectly sane. Theoretically it should work.
      But it could be that distcp uses some hdfs specific things to move the data. For example: data locality and direct node-node access. Swift obviously cannot support data locality interface directly (although I have an idea how to implement that) and there are no shortcuts between object storage nodes.

  3. Hi Constantine,

    Agreed. I’m writing a simple map/reduce job using identity mappers and reducers to see if I can have my input path point at Swift and output path point to HDFS. Let’s see what happens.


  4. Constantine,

    Quick followup. The simple job described previously is configured to read from Swift and copy to HDFS. The input file on Swift has 385 records. Running the job, only the first record of the file is read and copied to HDFS. All others are ignored.

    I’ve not dug into the file system mechanics too much. Any suggestions? I’m building against a cloud files distribution that supports Keystone v2.0 authentication. I’m wondering if range support is not implemented correctly (a SWAG). Likely I need to port the keystone support to either your version of cloudfiles or the latest Rackspace release.


    1. I think it’s better to port our Swift fs implementation to the new rackspace library.
      They have everything inside now (range support also), but the API is not the same as in our version.

      And I don’t quite understand the “input file records” what do you mean? There is nothing I can call a “record” in Swift architecture.

  5. Constantine,

    Simple mean lines in the data input file. Only reads up to the first /n in the file.

    I’ll grab the Rackspace code and have a look. All for now,


  6. Ross,

    I am working with the same configuration as yourself, with hadoop instances managed by juju on openstack. I am wondering if it would be possible for you to detail how you built the standalone JAR for copying data between filesystems? And if you’ve made any progress in regards to pulling data directly from Swift.

    1. Devon,

      Sorry for the late response.

      My code & build files are somewhat in disarray, and other priorities have pulled me in other directions. Have you been able to copy from Swift to HDFS yet? If not, once I get the chance, I can send you my Maven configs to build the stand alone jars.

      All I have working is the ability to copy from Swift into HDFS. Distributed copy (i.e. distcp) doesn’t work, as this is essentially a map/reduce job and I don’t have the swift file adapter working within a map reduce job yet. However, use of the hadoop command line (e.g. hadoop fs -cp) works using swift/keystone with hadoop (version 1.0.3).


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>