Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Wednesday, 9 September 2015

Random lessons learnt from building large scale distributed systems, some of them on Azure.

This post is based on more than one project I’ve been involved in but it was triggered by a recent project where I helped build an IoT system that is capable of processing and storing tens of thousands of messages a second. The system was built on top of Azure Event Hubs, Azure Web Jobs, Azure Web Sites, HDInsight implementation of HBase, Blob Storage and Azure Queues.

As the title suggests it is a rather random set of observations but hopefully they all together form some kind whole :).

Keep development environment isolated, even in the Cloud

A fully isolated development environment is a must have. Without it developers step on each others toes which leads to unnecessary interruptions and we all know that nothing kills productivity faster than context switching. When you build an application that uses SQL Server as a backend you would not think about creating one database that is shared by all developers because it’s relatively simple to have an instance of SQL Server running locally. When it comes to services that live only in the Cloud this is not that simple and often the first obvious solution seems to be to share them within the team. This is the preferred option especially in places that are not big on automation.

Another solution to this problem is to mock Cloud services. I strongly recommend against it as it will hide a lot of very important details and debugging them in CI (or UAT or Production) is not an efficient way of working.

If sharing is bad then then each developer could get a separate instance of each Cloud service. This is great in theory but it can get expensive very quickly. E.g. the smallest HBase cluster on Azure consists of 6 VMs.

The sweet spot seems to be a single instance of a given Cloud service with isolation applied at the next logical level.

  • HBase – table(s) per developer
  • Service Bus  - event hub per developer
  • Azure Search – index(es) per developer
  • Blob  Storage – container per developer
  • Azure Queues – queue(s) per developer

Each configuration setting that is developer specific can be encapsulated in a class so the right value is supplied to the application automatically based on some characteristic of the environment in which the application is running, e.g. machine name.

Everything can fail

The number of moving parts increases with scale and so they are chances of one or more of them failing. This means that every logical component of the system needs to be able to recover automatically from an unexpected crash of one or more of its physical components. Most of PaaS come with this feature out of the box. In our case the only part we had to take care of was the processing component. This was relatively easy as Azure lets you continuously run N instances of a given console app (Web Job) and restarts them when they crash or redeploys them when the underlying hardware is not healthy.

When the system recovers from a failure it should start where it stopped which means that it possibly needs to retry the last operation. In a distributed system transactions are not available. Because of that if an operation modifies N resources then all of them should be idempotent so retries don’t corrupt the state of the system. If there is a resource that doesn’t provide such a guarantee then it needs to be modified as the last one. This is not a bullet proof solution but it works most of the time.

  • HBase – writes are done as upserts
  • Event Hub  - checkpointing is idempotent
  • Blob Storage – writes can be done  as upserts
  • Azure Queues – sending messages is NOT idempotent so the de-duplication needs to be done on the client side

Sometimes failure is not immediate and manifests itself in a form of an operation running much longer than it should. This is why it is important to have relatively short timeouts for all IO related operations and have a strategy that deals with them when the occur. E.g. if latency is important then a valid strategy might be to drop/log data currently being processed and continue. Another option is to retry the failed operation. We found that the default timeouts in the client libraries where longer than what were comfortable with.  E.g. default timeout in HBase client is 100 seconds.

Don't lose sight of the big picture

It is crucial to have access to an aggregated view of logs from the whole system.  I can’t imagine having tail running for multiple components and doing the merge in my head. Tools like Seq are worth every cent you pay for them. Seq is so easy to use that we used it also for local development. Once the logging is setup I recommend spending 15 minutes a day watching logs. Well instrumented system tells a story and when the story does not make sense you know you have a problem. We have found several important but hard to track bugs just by watching logs.

And log errors, always log errors. A new component should not leave your dev machine unless its error logging is configured.

Ability to scale is a feature

Every application has performance requirements but they are rarely explicitly stated and then the performance problems are discovered and fixed when the application is already in Production. As bad as it sounds in most cases this is not a big deal and the application needs only a few tweaks or we can simply throw more hardware at it.

This is not the case when the application needs to deal with tens of thousands of messages a second. Here performance is a feature and needs to be constantly checked. Ideally a proper load test would be run after each commit, the same way we run CI builds, but from experience once a day is enough to quickly find regressions and limit the scope of changes that caused them.

Predictable load testing is hard

Most CI builds run a set of tests that check correctness of the system and it doesn’t matter whether the tests run for 0.5 or 5 or 50 seconds as long as all assertions pass. This means that the performance of the underlying hardware doesn’t affect the outcome of the build. This is not true when it comes to load testing where slower than usually hardware can lead to false negatives which translate to wasted time spent on investigating problems that don’t exist.

In ideal world the test would run on isolated hardware but this is not really possible in the Cloud which is a shared environment. Azure is not an exception here. What we have noticed is that using high spec VMs and running load test at the same time of the day in the morning helped keep the performance of the environment consistent. This is a guess but it looks like the bigger the VM the higher the chance for that VM to be the only VM on its host. Even with all those tweaks in place test runs with no code changes would differ by around 10%. Having load testing setup from the beginning of the project helps spot outliers and reduce the amount of wasted time.

Generating enough of correct load is hard

We started initially with Visual Studio Load Test but we didn’t find a way to fully control the data it uses to generate load. All we could do was to generate all WebRequests up front which is a bit of problem at this scale. We couldn’t re-use requests as each of them had to contain different data.

JMeter doesn’t suffer from this problem and was able to generate enough load from just one large VM. The templating language is a bit unusual but it is pretty powerful and at the end of the day this is what matters the most.

JMeter listeners can slow down significantly the load generator. After some experimentation we settled on Summary Report and Save Responses to a file (only failures) listeners. They gave us enough information and had very small impact on the overall performance of JMeter.

The default batch file limits JMeter heap size to 512MB which is not a lot. We simply removed the limit and let the JVM to pick one for us. We used 64 bit JVM which was more than happy to consume 4GB of memory.

Don’t trust average values

Surprising number of tools built for load testing shows average as the main metric. Unfortunately average hides outliers and it is a bit like looking at 3D world through 2D glasses. Percentiles is a much better approach as they show the whole spectrum of results and help make an informed decision whether it’s worth investing more in performance optimization. This is important because performance work never finishes and it can consume infinite amount of resources.

Feedback cycle on performance work is looooooooooooong

It is good to start load testing early as in most cases it is impossible to make a change and re-run the whole load test locally. This means each change will take quite a bit of time to be verified (in our case it was 45 minutes) . In such a case it is tempting to test multiple changes at once but it is a very slippery slope. In a complex system it is hard to predict the impact a given change will have on each component so making assumptions is very dangerous. As an example, I sped up a component responsible for processing data which in turn put more load on the component responsible for storing data which in turn made the whole system slower. And by change I don’t always mean code change, in our case quite a few of them were changes to the configuration of  Event Hub and HBase.

Testing performance is not cheap

Load testing might require a lot resources for relatively short period of time so generally it is a good idea to automatically create them before the test run and then automatically dispose them once the test run is finished. In our case we needed the resources for around 2h a day. This can be done with Azure though having to use three different APIs to create all required resources was not fun. I hope more and more services will be exposed via Resource Manager API.

On top of that it takes a lot of time to setup everything  in a fully automated fashion but if the system needs to handle significant load to be considered successful then performance related work needs to be part of the regular day-to-day development cycle.

Automate ALL THE THINGS

I’m adding this paragraph for completeness but I hope that by now it is obvious that automation is a must when building a complex system. We have used Team City, Octopus Deploy and a thousand or so lines of custom PowerShell code. This setup worked great.

Event Hubs can scale

Event Hubs are partitioned and the throughput of each partition is limited to 1k messages a second. This means that Event Hub shouldn’t be perceived as a single fat pipe that can accept any load that gets thrown at it. For example, an Event Hub that can in theory handle 50k messages a second might start throwing exceptions when the load reaches 1k messages a second because all messages happen to be sent to the same partition. That’s why it’s very important to have as even distribution of data that is used to compute PartitionKey as possible. To achieve that try a few hashing algorithms. We used SHA1 and it worked great.

Each Event Hub can have up to 20 independent consumer groups. Each consumer group represents a set of pointers to the message stream in the Event Hub. There is one pointer per partition. This means that there can be only one Event Hub reader per partition per consumer group.

Let’s consider an Event Hub with 50 partitions. If the readers need to do some data processing then a single VM might not be enough to handle the load and the processing needs to distributed. When this happens the cluster of VMs needs to figure out which VM will run readers for which partitions and what happens when one of the readers or/and VMs disappears. Reaching consensus in a distributed system is a hard problem to solve. Fortunately, Microsoft provided us with Microsoft.Azure.ServiceBus.EventProcessorHost package which takes care of this problem. It can take couple of minutes for the system to reach consensus but other than that it just works.

HBase  can scale

HBase is a distributed data base where data is stored in lexicographical order. Each node in the cluster stores an ordered subset of the overall data. This means that data is partitioned and HBase can suffer from the same problem as the one described earlier in the context of Event Hubs. HBase partitions data using row key that is supplied by the application. As long as row keys stored at more or less the same time don’t share the same prefix then the data should not be sent to the same node and there should be no hots spots in the system.

To achieve that the prefix can be based on a hash of the whole key or part of it. The successful strategy needs to take into account the way the data will be read from the cluster. For example, if the key has a form of PREFIX_TIMESTAMP_SOMEOTHERDATA and in most cases the data needs to be read based on a range of dates then the prefix values needs to belong to a predicable and ideally small set of values (e.g. 00 to 15). The query API takes start and end key as input so to read all values between DATE1 and DATE2 we need to send a number of queries to HBase which equals the number of values the prefix can have. In the sample above that would be 16 queries.

HBase exposes multiple APIs. The one that is the easiest to use is REST API called Stargate. This API works well but its implementation on Azure is far from being perfect. Descriptions of most of the problems can be found on the GitHub page of the SDK project. Fortunately, Java API works very well and it is actually the API that is recommended by the documentation when the highest possible level of performance is required. From the tests we have done Java API is fast, reliable and offers predictable latency. The only drawback is that it is not exposed externally so HBase and its clients need to be part of the same VNet. 

BTW Java 8 with lambdas and streams is not that bad :) even though the lack of var keyword and checked exceptions are still painful.

Batch size matters a lot

Sending, reading  and processing data in batches can significantly improve overall throughput of the system. The important thing is to find the right size of the batch. If the size is too small then the gains are not big and if the size too big then then a change like that can actually slow down the whole system. It’s taken us some time to find the sweat spot in our case.

  • Single Event Hub message has a size limit of 256 KB which might be enough to squeeze multiple application messages in it. On top of that Event Hub accepts multiple messages in a single call so another level of batching is possible. Batching can also be specified on the read side of the things using EventHostProcessor configuration.
  • HBase accepts multiple rows at a time and its read API can return multiple rows at a time. Be careful if you use its REST API for reading as it will return all available data at once unless the batch parameter is specified. Another behaviour that took as by surprise is that batch represents the number of cells (not rows) to return. This means a single batch can contain incomplete data. More details about the problem and possible solution can be found here.

Web Jobs can scale

Azure makes it very easy to scale up or down the number of running instances. The application code doesn’t have to be aware of this process but it might if it is beneficial to it. Azure notifies jobs before it shuts them down by placing a file in their working folder. Azure Web Jobs SDK comes with useful code that does the heavy lifting and all our custom code needed to do is to respond to a notification. We found this very useful because Event Hub readers lease partitions for a specific amount of time so if they are not shutdown gracefully the process of reaching distributed consensus takes significantly more time.

Blob Storage can scale

We used Blob Storage as a very simple document database. It was fast, reliable and easy to use.

Sunday, 6 May 2012

Quick IO performance test of VirtualBox, Parallels and VMware

I’ve been using VirtualBox for over a year and I’m pretty happy with it but I wanted to see if new releases of VMware and Parallels can give me even better IO performance. This is a very simple and coarse test but it clearly indicates that I should give VMware a try.

Test setup:

  • Guest: Windows 7 64bit SP1
  • Guest: 8 CPUs and 4GB of RAM
  • Guest: Vendor specific additions installed
  • VMware: Workstation 8.0.3
  • Parallels: Workstation 6.0.13950
  • VirtualBox: 4.1.12
  • CrystalDiskMark 3.0.1c was run 3 times for each app + host

Thursday, 22 December 2011

YOW 2011 - loose thoughts

Talks I attended at this year conference were focused on Agile, functional programming and low level performance. It turned out to be quite an interesting mix.

Marry Poppendieck talked about ways of supporting continuous design. One of the advices was to create an environment that facilities fair fights. To achieve this you need to hire good, diverse people. Their diversity let them solve complex problems because they can approach them from multiple angles. Once you have a team like that leave them alone and simply keep supporting them.

I haven’t learnt much new at Agile Architecture workshop delivered by Rebecca Wirfs-Brock but it was good to hear that I’m not some kind of weirdo that demands impossible :).

Simon Peyton Jones delivered 2 amazing and full of passion sessions about functional programming. I’ve done a bit of this type of programming before but I was surprised by Haskell type system and its ability to ensure lack of side effects at the compile time. On top of that on my way back to Sydney I had a good chat with Tony Morris who told me about a project where he used functional programming to create a composable data access layer. Composablity and strict control over side effects is enough to push Haskell to the top of my list of languages to play with :).

Inspired by Coderetreat Brisbane organized by Mark Ryall I’ve decided to use Conway’s Game of Life as a problem that I will try to solve every time I learn a new language. It’s worked for CoffeeScript and I hope it will work Haskell.

The end of the conference was filled by performance tips from .NET (Joel Pobar and Patrick Cooney) and Java (Martin Thompson) land. Both sessions emphasized that computation is cheap and memory access is slow. By simply keeping all required data in L1/L2 CPU cache you can cut the execution time by half. Obviously this was presented using a micro benchmark but still it is something that it’s worth keeping in mind.

Functional languages rely on immutable data structures to isolate processing and in this way control side effects. Unfortunately this means that a lot of memory is pushed around and this has a significant influence on the overall performance. Again, it’s all about trade-offs :).

Martin talked about “mechanical sympathy” which is boils down to having at least basic understanding of hardware your software runs on. Without it your software not only doesn’t take full advantage of the underlying hardware but often works against it which has severe impact on the overall performance. It’s one of those advices that most of us won’t use on daily basis as most our infrastructure is in the Cloud but it’s good to keep it in mind.  

We’ve been told multiple times that “free lunch is over” and CPUs are not getting any faster. Martin proved that this is not correct. It’s true that we are not getting more MHz, we are actually getting less MHz than we used to but CPUs get faster because every year engineers design better chips. The message Martin is trying to send is that we should prove/disprove assumptions based on evidence and not beliefs.

All in all it was another good conference and I will try to attend it next year.

Saturday, 11 December 2010

YOW 2010 - loose thoughts

It doesn’t happen often that nearly every single talk at a conference is great and on top of that half of them are actually funny. That’s YOW 2010 for you summarized in one sentence :).

Justin Sheehy explained how to quickly narrow down the choice of database technologies that might be useful in a particular case. His method is based on a simple matrix of operations requirement (local, single server, distributed, etc) by data model (relational, column families, key/value, etc). Once this is done and there are only a few solutions on the table a more sophisticated and time consuming research can be conducted to choose the right solution. Every single NoSQL solution is different and a generic split SQL/NoSQL doesn’t really make sense. It’s all about tread-offs. It’s amazing how often this simple fact needs to be reminded.

Eric Evans talk was focused on the idea of bounded contexts. In other words a single enterprise model is an anti-pattern and is one of software engineering fallacies. Eric mentioned also a few disadvantages of doing big design upfront (AKA let’s build a great framework that less skilled devs can use) and postponing the initial release for long time. Nothing really new but it was well delivered.

Gregor Hohpe talked about trade-off decisions that Google had to make to be able to reach its current scale. He covered the whole spectrum of optimizations from data access at the disk level to minimize heat generation to skipping some longer than expected running parts of map reduce executions to make sure results are delivered in timely manner. When I asked Gregor if Google uses regular Pub/Sub or transactions he said that if there is a technology out there Google has built something on top it :). Just use the right tool for the job.

Second day started with Erik Meijer explaining coSQL (AKA NoSQL). It was a funny presentation about what NoSQL really is and how it relates to SQL. They both complement each other even in a mathematical sense hence the co part of coSQL. Additionally co is more positive than no and this makes Erik happy :).

Jim Webber talked passionately about how much he hates dislikes ESBs and how rarely ESB is the right tool for the job. His presentation was extremely funny but still full of useful information. The main point was that a custom built system can be cheaper (but not cheap) and less risky to deploy than an out of the box ESB which often requires a substantial up-front cost.

Dave Farley took us to the world of <1ms latency and speed of 100k per second. According to Dave this is achievable on commodity servers. The main enabler seems to be lack of synchronization, keeping as few threads per core as possible, keeping all the data in memory and keeping methods very short. 1 CPU can execute 1 billion instructions a second. That’s a lot and as long as we don’t waste it today hardware should be more than enough for needs of most consumers. The main message was that we underestimate what we can get from today hardware. I suppose this is only partially true because nowadays we rarely deploy apps on real hardware. In most cases all we see is a VM that shares the host with Gazillion of other VMs. This might the main reason why the perception of the current hardware capabilities is skewed.

After the conference there were 2 days of workshops. I spent the first day with Ian Robinson and Jim Webber learning about REST. What I believed constituted a fully blown RESTfull service was actually a very basic RESTfull service that scores only 1 out of 3 points in Richardson maturity model. Each of the levels has its place but obviously the higher you get the more you take advantage of the Web and that’s the whole purpose of using REST. REST is CRUDish as it mostly relies on GET, POST, PUT and DELETE. My initial thought was that this is very limiting but then it turned out that it doesn’t have to be. The same applies to lack of transactions. This can be worked around with proper structure of resources, meaningful response codes and proper use of HTTP idioms. Another important thing to keep in mind is that domain model shouldn’t be exposed directly. What you want to expose instead are resources that represent client – server interactions (use cases). In most cases O(resources) > O(domain classes) – notation by Jim Webber :). The Web is inherently based on polling (request/response) thus REST is not suitable for apps which require low latency. In this case you might want to use Pub/Sub.

The next day I attended a workshop with Corey Haines. This was a true hands-on workshop. I spent at least half a day writing code retreats, code katas and coding dojos. Going back to the very basics was surprisingly refreshing. I spent two 45 minutes long sessions constantly refactoring maybe 15 lines of code until most of if statements were gone and code read properly. You wouldn’t do this at work but the whole point of the exercise was to actually go over the line and try to come up with best possible code without feeling the time pressure.

At last but not least, the attendees were fantastic and every coffee/lunch break was full of valuable conversations.

I had an amazing time and YOW 2010 is the best conference I’ve ever been to.

Monday, 30 June 2008

A setting that can boost performance of any heavily network-dependent application

By default .NET allows only 2 connections to a given network address per AppDomain. In most cases this works fine but if your app makes a couple of dozens network calls a second then this value might be too small and it might actually cause a bottleneck that is very hard to diagnose. I decided to increase the value of this setting to the value that is recommend by Microsoft (number_of_cores x 12) and one of my services speeded up significantly. Having said that I have to stress that there is no guarantee this setting will work in your case. Remember, measure, measure and once again measure when you optimize.
And the setting is:
 <system.net>
<connectionManagement>
<add address="*" maxconnection="96"/>
</connectionManagement>
</system.net>

Tuesday, 15 January 2008

LINQ query versus compiled LINQ query

Rico Mariani changed his position a few months ago but he still manages to provide his Performance Quizzes from time to time. Today he explains the difference between LINQ queries and compiled LINQ queries. It's obvious that one should cache as much as possible because in most cases is better to pay a high price once then a slightly reduced one many times. But the interesting part of the post is the way how Microsoft came up with their current implementation of compiled LINQ queries. Nearly every good performance solution is a matter of realizing that being very generic doesn't pay off. Crazy about performance topics? Go and solve Rico latest quiz. Once you are done check if you are right.

Tuesday, 17 July 2007

DTrace - an interesting tracing solution

I've just come across an interesting article about DTrace which is a tracing solution developed by SUN. I still need to dive into it deeper but it's key features are:
  • zero overhead when it's turned off
  • C based query language that allows you to query available probes
  • it allows you to analyze the whole system at very low(kernel level operations) and/or high level (number of garbage collections)
  • it gathers probes only when it's safe for the system
  • it's built-in into Solaris 10 and will be part of the next operating system from Apple called Leopard
It's been awarded many times and it really looks great because it provides very precise picture of the system.

Wednesday, 16 May 2007

64 bits doesn't come for free

Nothing comes for free. This is obvious but I still see a lot of people thinking that 64 bits architecture is going to solve all their performance problems which is not true. Maoni is explaining this in terms of .NET.

Sunday, 13 May 2007

SQL Server and lock escalation

A few weeks ago Kevin Kline gave a talk in Dublin about SQL Server performance and how to make the most of it. The talk was very interesting because Kevin touched a few times on SQL Server internals. The most surprising one was related to how SQL Server escalates locks. Kevin mentioned that if SQL Server has acquired around 4000 locks within a table then it escalates them into a table level lock. What is even more more surprising is the fact that this value is hardcoded. I've tried a few time to prevent SQL Server from escalating locks and I've always failed. Now I know why :)