Azure Pipelines for Shogun

This year I've attended the NumFocus Summit 2018 as the representative of Shogun.ML project at the Microsoft Technical Center, NYC. I've met many of the amazing maintainers and core contributors of NumFocus Sponsored projects. During the summit I've not only learned how important is for an outsider to have a roadmap (or rather a wish-list) for a project---stay tuned for more on our github page---, but thanks to Steve Dower I've learned about Azure Pipelines (for DevOps).

Finally a CI system that allows me the same flexibility that buildbot does, which we use extensively for Shogun. It allows depending on jobs, conditional tasks, templating etc. And most importantly, it supports Linux (including Docker), Windows and macOS builds. Basically almost all the OS that we usually test for in a PR or commit - one thing is still missing: ARMv7 arch.

Currently in our CI mess, we use Travis for Linux and macOS tests and AppVeyor for Windows builds. They served quite well in the past, but for example in both case being able to define dependencies among jobs is something that I've missed for years. In Shogun.ML, as you might know, thanks to SWIG we support many different language interfaces (e.g. Python, R, Ruby, JVM, C# etc.). These interfaces are basically shared libraries, that depend on the C++ (core) library of Shogun.ML. This means that for being able to compile any of the interface libraries, one would need to have the C++ (shared) library available. With our CMake setup, one could simply compile the C++ library (or install the binary):

mkdir build && cd build
cmake ..
make install

Once the shared library is available using the INTERFACE_<LANGUAGE> cmake flag one could compile the language interface library using the precompiled C++ library, without actually requiring to re-compile the whole C++ library from scratch (note: major versions should be the same). For example to compile the Python language interface against a precompiled C++ library, you would need to run the following commands:

mkdir build-python && cd build-python
make install

The CMake script will try to detect libshogun on your system. If the script cannot find it, it will throw an error, in this case setting the prefix path of the location of libshogun could be set with -DCMAKE_INSTALL_PREFIX=<path>.

Now back to our CI problem: as in Travis defining dependencies between jobs is not possible, in all our CI jobs for all the interfaces we had to compile libshogun from scratch (see here). As you can see from the link this meant that we have 2 jobs for compiling and testing only libshogun (gcc and clang), and another 6 jobs to compile (and test) the language interfaces: Python, R, Java, C#, Ruby, Octave. But in case of all the 6 interface jobs we have to compile libshogun as well (not only the interface library). This is a waste of time and resources as we could actually use from one of the libshogun tasks the created shared library and just compile the interface library, if one could define a dependency between jobs. In Azure Pipelines with a simple dependsOn attribute I could define this dependency! The only tiny thing that I had to solve is to store somewhere the compiled shared libraries, which the SWIG jobs could access as well. Azure Pipelines has a storage for artifacts, that is more for release artifacts, but this served perfectly for me to store the shared libraries of the libshogun jobs for the SWIG jobs. I've created two small templates:

  • archive-deploy: simply takes a directory to be archived and uploads it to the artifact storage under artifactName.
  • download-extract: as it name says this template will download and artifact and extract its content to a specific location. Since there are two scenarios where we use this template, it got a bit more complex than one would expect in the first place. For the case that we explained till now, i.e. downloading an artifact that was generated by a triggering job (libshogun) this section of the script is the relevant one.

Using these templates, I've defined a libshogun job that compiles and tests the core C++ library using gcc and clang, and if all the tests passed, they would publish ccache-gcc and ccache-clang artifacts respectively. Once the libshogun jobs finished successfully they trigger the SWIG jobs, that will start with downloading the published libshogun artifacts and compile only the language interfaces.

Back to the second use-case of the download-extract template: on both our buildbot system and when developing shogun locally we extensively rely on ccache, which can significantly reduce compilation time of the library. Both in our Travis and AppVeyor jobs, we use ccache (or clcache) so that our CI jobs would be as fast as possible, since compiling shogun from scratch could take some time. Currently Azure Pipelines does not have an explicit task definition for adding ccache support to a job, hence I needed to define this myself, which was rather easy as I just needed to

  1. setup the ccache environment (CCACHE_DIR)
  2. archive and publish the ccache directory content if the tests passed successfully
  3. in the new job download the previously published ccache.

The archiving and publishing of ccache directory is exactly the same as archiving and publishing of libshogun. Downloading of the ccache artifact is different though. Remember, in case of SWIG interfaces the jobs are triggered by libshogun jobs. Hence we could use the specificBuildWithTriggering variable in the DownloadBuildArtifacts task to identify the right artifact to download. In case of CCache artifacts things are a bit different: this section of the template is responsible to specify that the jobs should try to download the latest successful job's ccache artifact. So far developing this pipelines was rather straightforward, only couple of google/github searches were required to figure out how all the tasks works. This was the point where I ran into troubles, that isn't yet fixed while writing this article. Namely, in case of using buildType: 'specific' and buildVersionToDownload: "latest" attributes in DownloadBuildArtifacts task, it will simply fail if: there was not a single pipeline that finished with success. Note, the both of the stressed words:

  • pipeline: not a single job, like libshogun, but the whole pipeline. So even though in case the compilation and test of the C++ library finished successfully, hence a ccache artifact is published, if any subsequent SWIG job (or the windows build) fails in the pipeline, this artifact will not be available for the DownloadBuildArtifacts task as it requires the whole pipeline to finish successfully.
  • success: if any of the tasks ends with warnings, the whole pipeline (Agent.JobStatus variable) will be marked SucceededWithIssues, which is not Succeeded.

Azure Pipelines Flow

Note that DownloadBuildArtifacts simply fails if the specified artifact is not available (see the definition above what counts to be 'available'). If a tasks fails, by default it'll simply stop the execution of the job itself. That is actually a problem until the very first ccache artifact is ever published successfully, as until then the the whole job will fail. Of course there's a way to keep executing a job if a task fails. This could be done by setting continueOnError attribute to true for the task. Note, that although the job will be executed even though task with continueOnError failed, the whole pipeline will be marked SucceededWithIssues. Now this is becoming a catch-22. Although all the jobs ran successfully and published their ccache artifacts, since the DownloadBuildArtifacts will cause that the JobStatus is actually SucceededWithIssues, no subsequent runs of the pipeline will ever pick up those artifacts. In fact, for now for being able to use the ccache artifacts, for once I had to disable the DownloadBuildArtifacts task, so that the pipeline would finish with Succeeded. To overcome this issue there could be two say to fix this:

  1. patch the DownloadBuildArtifacts so that it accepts artifacts from SucceededWithIssues jobs as well
  2. create a new task that basically downloads the latest artifact regardless of the pipeline's status that produced it

As for the results, let the numbers speak for themselves:

  • A pipeline without ccache ran for 1:03:32, that is more than an hour
  • A pipeline with ccache ran for 13 minutes and 28 seconds.

So having ccache added to our jobs, we can run the pipeline 4.7 times faster than without!

There are still two things missing to make this pipeline 100%:

  1. Finish up the MS Windows build, that is being blocked at the moment by:
  2. Extend the pipeline to support PRs, namely that we want to have a job that runs our clang-formatter script to check for style problems, and make all the other job depending on this.

Azure Pipelines basically took all the flexibility of BuildBot and uses YAML for describing jobs. In my past experience I have found so many times to be the best descriptor of pipeline jobs. The bonus, all of it Azure Pipeline is based on TypeScript, which in my opinion is the best of combination of both statically typed languages and interpreted languages.

i still don't see why my 18 years old sister would use your product?

or a story about how we failed to get accepted on a start-up accelerator program in europe.

well it has been some weeks now after we have received that although our project has been selected among the top 50 teams that applied to the startup accelerator program for this summer, we did not make it into the Top 10 teams.

It happens. of course. there are so many stories and blog posts out there on this topic, so why this will be different, i don’t know. you tell it on the end. although we feel—-naturally—-disappointed, the reasons behind writing this post is completely different. namely, procrastinating from another task of mine. and that i really wanted to do it.

so where do i start. right in the middle and hopefully it’ll unfold into something readable and understandable, otherwise it’s just another waste in the air.

So suddenly the email popped into my inbox, that we’ve been selected into the top 50 teams among all the applicants so they’d like to do an interview with us, via skype. let’s choose a date for it. No worries, let’s pick a date, and let’s try to do it. but hey, coool we’ve got in! top 50 teams… but wait, how many applicants were there at the first place, my question was? don’t know why but somehow i really wanted to know how many teams were there who applied just to see, as it’s a bit different if you make it into top 50 out of 100 or 1000. or make it into the top 50 teams among the 13 that have applied ;) well anyhow, without going any further, till today (2 weeks after the announcement of the teams and soon to be starting programme) they have not released any statistics, e.g. how many teams applied at the first place.

never mind, numbers are for… so the interview day came we’ve fired up skype, and did a 3 way interview. short intro for who we are, and then let’s get down the business: what do we want to create with our product. after a short description, of course the questions after questions came, more or less related about: who’s gonna use it? who’s our main target? and how are we going to make money out it? after describing (and of course already mentioned this in 2 previous emails, i.e. who’s our target and what are the possible application of it, and how it could make money) our ideas, the question came back again in a different form, now a bit more aggressively: there are already solutions like this (no there are NO SOLUTIONS like this just similar.what do i mean about similar? well in a sense that linux and windows are similar, since they are both operating systems, no? ;) so without being more sarcastic, i just wanna say that after trying out all the competition and writing a loooooooooong email why our system is different from all of those, and why it would be better than those, the person asks again why, why a customer would choose us. Well, i had two options here: option a) ask him whether he has read at all the emails we’ve been sending to them, where we very detailedly answered all these questions option b) tell nothing about the emails, and just repeat the whole thing we’ve written in the email.

i’ve went with option a+b: told him that this has been very well answered in one of our emails—that he assured he read it—and retold the hole thing again. Well, not so surprising he did not buy our answer. Still he kept on asking the same thing (i.e. why would somebody choose our product) in different ways. we’ve tried to answer and tell him that we have proposed various applications of our product, how to make it a business, and then who our target could be, but we thought that one of the thing of an accelerator program is that people there do know their business, and help us in developing a solid business plan. as we obviously did not and do not have a business plan, just various feasible ideas about it. and then came THE question. the ultimate question that has struck my ears so deeply that for more than a minute i couldn’t even react:

“i still don’t see why my 18 years old sister would use your product?”

hell yeah! that’s it! the ultimate question for all the start-up companies out there. so why? well while i was completely mute, my partner tried to tackle this question from left and right, but obviously because of the nature of this question, he just couldn’t answer in the way that would a) make any sense b) would satisfy him

in the meanwhile i was just sitting in front of my computer and started to wonder. why his sister would use an application where she can post max 140 characters and can subscribe to other people’s 140 characters stream as well. and why his sister would ever use a software where she can add her friends in a list as friends, and upload pictures and tag her friends and message them and so on. and then i’ve realised that maybe the answer to this question is bare simple: your sister DOES use a computer. she owns one, and works with it. and for sure his sister is not a 18 years old computer geek who cannot wait till the next release of the linux kernel and enjoy setting up and compiling her custom kernel. but still, she does own a computer and she for sure uses it on a daily bases.

anyhow, that interview just went baaaaaaaaaad. which happens, we are not so experienced in situations like this, especially that we are looking on problems in a from a very different perspective than he is.

the guy on the other side realised that something really went wrong with this conversation so he said that hey, let’s have another round, as maybe he is just really tired since he was doing this for the whole day! he will email us… but in any case, if we are going to be rejected we’ll get a very detailed feedback from them!

while we were waiting for the next round of 1 hour interview, we’ve talked a bit with my partner about the previous one and had the same conclusion: WTF? :> but hey, we’ve made it into the top 50 (again, out of how many?)!

so the second round arrived, where things were much more relaxed than the previous one. moreover, people were even quite excited about our idea. of course some issues like licensing came up, and the same question that they have already asked from us in emails, and we’ve answered it…. but never mind. we’ve repeated the whole thing again. then came the question: do you maybe have a demo? mmm yes we do have a demo, but it is in a very very alpha stage. as we were hoping that we can come up with a real beta while doing a start-up accelerator program. well no worries—-they said—-just send us a video where we can see that little demo, just to see where you guys at. we’ve agreed, worked on that demo quite a lot. on the end of our interview we’ve agreed that we’ll send the demo and a very detailed answer on some of the questions that has been raised during our conversation but could not be answered right away because of it’s nature.

2 days after we’ve sent the email with our answers and video based demo where they can see the basic functionality of our software.

and we started to wait and wait. but in the meanwhile we’ve already had the feeling that this is just not it. this program is just not the program that we thought it is. we have an idea, and some proof-of-conceptish demo, and ideas how we could monetize this whole thing. but we do NOT have an already existing user base, and certainly we do not have a working application that we could hand out for trying it out. and no we do not have a solid business plan either. just options, how it could be done. but somebody with more expertise should be able to tell us, which of these options are at all really feasible, and how exactly.

one week later we’ve got THE letter: bla bla bla great, great great, but…. there were better ones. It does happen. so we did not get into the top 10 teams. well, yeah it wasn’t a surprise for us of course. taking into the consideration the whole interview, and how it happened. well, next time some other place with some other people… maybe… hopefully.

all in all just two things: a) we are still waiting for the email with the detailed feedback on our project b) we are still waiting for the email from the person from the first interview to schedule another time for a re-interview.

maybe the next time…

matlab on cluster

it’s like 1.2.3. As i’ve said i’ve got 9 machines for creating a cluster with it. Last week i’ve put together a hadoop cluster and mahout on top of it. Yesterday i’ve installed matlab 2010b on those machines to be able to use parallel toolbox and run code on a matlab cluster. Well it was easier than i thought. Simply followed the manual by matlab, and TADA it all works very nicely. If you wanna know how to do it, just simply fill out this form and you’ll get the right manuals for your system.

getting on with hadoop

So yesterday i was given a little farm of servers (only 8 nodes) to do whatever i want with it. As lately i’m working on computer vision and machine learning algorithms that not only takes a lot of calculation time, but space as well (one of the dataset is 100gigs, the other is 200gigs), i’m getting into distributed computing more and more. Namely, hadoop and mahout.

All the 8 servers were running debian etch, so first things first i’ve decided to upgrade them to squeeze. As I don’t really trust dist-upgrade that much, I’ve decided not to skip lenny, so the chain of distribution upgrade was: etch -> lenny -> squeeze.

Although I’ve spent some time googling for a utility that could help me to do the upgrade at once on all 8 nodes, i couldn’t find any (if you know one please let me know for the next time). I could have written a script to do this job for me, but there were just waaaay to many questions from the dist-upgrade script each time, especially when i was upgrading from etch to lenny. To be honest the whole upgrade procedure went rather smoothly. never had to go down to the console to restart a machine, it always came back after a reboot. \o/

so finally hadoop got into the focus. I went with 0.21.0. It has quite a lot of silly bugs, like the start and stop scripts that won’t work unless you define the HADOOP_HOME env variable. Other than this, i’ve followed the short documentation from hadoop how to set up the cluster. I’ve ran into some troubles with permissions, but now finally they are up and running.

now it’ll be mahout’s turn. I’ve ran it already on my localhost for quite some time now, as I was using some of the algorithms implemented in it. but i’m really excited to see what it’ll do on a real cluster.

mahout has some great algorithms implemented in it, but for instance it does not have an SVM classifier yet. There were several intentions to implement it, like MAHOUT-227, MAHOUT-232 and MAHOUT-334 none of them are in a state that they could be included in the HEAD of the repository (see my patches for MAHOUT-232).

This motivated me to start implementing an SVM classifier from scratch for mahout. It’s still on it’s way and as you can see i was distracted by some little obstacles, but now it’s time to do it and write up the code and try to push it for HEAD. let’s see what will happen…

my other big interest is to bring computer vision algorithms in hadoop, as i’m working with heaps of images and most of the algorithms could be parallelized. some people have implemented one or two algorithms in mapreduce framework, but i would be great to see a framework like mahout (or opencv and itk from CV part) for hadoop.