Wednesday, January 24, 2024

Scheduling tasks and sharing state with streams

Recently we built a system that needs to perform 2 tasks. Taks 1 runs every 15 minutes, task 2 runs every 2 minutes. Task 1 kicks off some background jobs (an upload to BigQuery), task 2 checks upon the results of these background jobs and does some cleanup when they are done (delete the uploaded files). The two tasks need to share information back and forth about what jobs are running in the background.

Now think to yourself (ignore the title for now 😉), what would be the most elegant way to implement this? I suspect that most developers will come with a solution that involves some locking, synchronisation and global state. For example by sharing the information through a transactional database, or by using a semaphore to prevent the two tasks from running at the same time plus sharing information in a global variable. This is understandable, most programming environments do not provide better techniques for these kinds of problems at all!

However, if your environment supports streams and has some kind of scheduling, here are two tricks you can use: one for the scheduling of the tasks, the second for sharing information without a global variable.

Here is an example for the first written in Scala using the ZIO streams library. Read on for an explanation.

import zio._ import zio.stream._ def performTask1: Task[Unit] = ??? def performTask2: Task[Unit] = ??? // An enumeration (scala 2 style) for our task. sealed trait BusinessTask object Task1 extends BusinessTask object Task2 extends BusinessTask ZStream.mergeAllUnbounded()( ZStream.fromSchedule(Schedule.fixed(15.minutes)).as(Task1), ZStream.fromSchedule(Schedule.fixed(2.minutes)).as(Task2) ) .mapZIO { case Task1 => performTask1 case Task2 => performTask2 } .runDrain

We create 2 streams, each stream contains sequential numbers, emitted upon a schedule. As you can see, the schedule corresponds directly with the requirements. We do not really care for the sequential numbers, so with stream operator as we convert the stream's emitted values to a value from the BusinessTask enumeration.

Then we merge the two streams. We now have a stream that emits the two enumeration values at the time the corresponding task should run. This is already a big win! Even when the two schedules produce an item at the same time, the tasks will run sequentially. This is because by default streams are evaluated without parallelism.

We are not there yet though. The tasks need to share information. They could access a shared variable but then we still have tightly coupled components and no guarantees that the shared variable is used correctly.

Also, wouldn't it be great if performTask1 and performTask2 are functions that can be tested in isolation? With streams this is possible.

Here is the second part of the idea. Again, read on for an explanation.

case class State(...) val initialState = State(...) def performTask1(state: State): Task[State] = ??? def performTask2(state: State): Task[State] = ??? ZStream.mergeAllUnbounded()( ZStream.fromSchedule(Schedule.fixed(15.minutes)).as(Task1), ZStream.fromSchedule(Schedule.fixed(2.minutes)).as(Task2) ) .scanZIO(initialState) { (state, task) => task match { case Task1 => performTask1(state) case Task2 => performTask2(state) } } .runDrain

We have changed the signatures of the performTask* methods. Also, the mapZIO operator has been replaced with scanZIO. The stream operator scanZIO works much like foldLeft on collections. Like foldLeft, it accepts an initial state, and a function that combines the accumulated state plus the next stream element (of type BusinessTask) and converts those into the next state.

Stream operator scanZIO also emits the new states. This allows further common processing. For example we can persist the state to disk, or collect custom metrics about the state.

Conclusion

Using libraries with higher level constructs like streams, we can express straightforward requirements in a straightforward way. With a few lines of code we have solved the scheduling requirement, and showed an elegant way of sharing information between tasks without global variables.

Sunday, November 26, 2023

Discovering scala-cli while fixing my digital photo archive

Over the years I built up a nice digital photo library with my family. It is a messy process. Here are some of the things that can go wrong:

  • Digital cameras that add incompatible exif metadata.
  • Some files have exif tag CreateDate, others DateTimeOriginal.
  • Images shared via Whatsapp or Signal do not have an exif date tag at all.
  • Wrong rotation.
  • Fuzzy, yet memorable jpeg images wich take 15MB because of their resolution and high quality settings.
  • Badly supported ancient movie formats like 3gp and RIFF AVI.
  • Old movie formats that need 3 times more disk space than h.265.
  • Losing almost all your photos because you thought you could copy an Iphoto library using tar and cp (hint: you can’t). (This took a low-level harddisk scan and months of manual de-duplication work to recover the photos.)
  • Another low-level scan of an SD card to find accidentally deleted photos.
  • Date in image file name corresponds to import date, not creation date.
  • Weird file names that order the files differently than from creation date.
  • Images from 2015 are stored in the folder for 2009.
  • etc.

I wrote countless bash scripts to mold the collection into order. Unfortunately, to various success. However, now that I am ready to import the library into Immich (please, do sponsor them, they are building a very nice product!), I decided to start cleaning up everything.

So there I was, writing yet another bash script, struggling with parsing a date response from exiftool. And then I remembered the recent articles about scala-cli and decided to try it out.

The experience was amazing! Even without proper IDE support, I was able to crank out scripts that did more, more accurately and faster than I could ever have accomplished in bash.

Here are some of the things that I learned:

  • Take the time to learn os-lib.
  • If the scala code gets harder to write, open a proper IDE and use code completion. Then copy the code over to your .sc file.
  • One well placed .par (using scala-parallel-collections) can more than quadruple the performance of your script!
  • You will still spend a lot of time parsing the output from other programs (like exiftoool).
  • Scala-cli scripts run very well from Github actions as well.

Conclusions

Next time you open your editor to write a bash file, think again. Perhaps you should really write some scala instead.

Sunday, October 8, 2023

Dependabot, Gradle and Scala

Due to a series of unfortunate circumstances we have to deal with a couple of projects that use Gradle as build tool at work. For these projects we wanted automatic PR generation for updated dependencies. Since we use Github Enterprise, using Dependabot seems logical. However, this turned out to be not very straightforward. This article documents one way that works for us.

As we were experimenting with Dependabot, we discovered the following rules:

  1. The scala version in the artifact name must not be a variable.
  2. A variable for the artifact version is fine, but it must be declared in the same file in the ext block.
  3. Versions should follow the Semver specification.
  4. You must not use Gradle’s + version range syntax anywhere, Maven’s version range syntax is fine.

In our projects the scala version comes from a plugin. In addition, we sometimes need to cross build for different scala versions, very much at odds with rule no. 1. We solved this with a switch statement.

With these rules and constraints we discovered that the following structure works for us and Dependabot:

ext { jacksonVersion = '2.15.2' scalaTestVersion = '3.0.8' } dependencies { switch(scalaMainVersion) { case "2.12": implementation "com.fasterxml.jackson.module:jackson-module-scala_2.12:$jacksonVersion" testImplementation "org.scalatest:scalatest_2.12:$scalaTestVersion" break case "2.13": implementation "com.fasterxml.jackson.module:jackson-module-scala_2.13:$jacksonVersion" testImplementation "org.scalatest:scalatest_2.13:$scalaTestVersion" break default: break } // implementation 'com.example:library:0.8+' // Don't do this implementation 'com.example:library:[0.8,1.0[' // This is fine }

It took 3 people a month to slowly discover this solution (thank you!). I hope that you, dear reader, will spend your time more productive.

Thursday, April 20, 2023

Zio-kafka hacking day

Not long ago I contacted Steven (committer of the zio-kafka library) to get some better understanding of how the library works. April 12, not more than 2 months later I am a committer, and I was sitting in a room together with Steven, Jules Ivanic (another committer) and wildcard Pierangelo Cecchetto (contributor), hacking on zio-kafka.

The meeting was an idea of Jules who was ‘in the neighborhood’. He was traveling from Australia for his company (Conduktor). We were able to get a nice room in the Amsterdam office of my employer (Adevinta). Amsterdam turned out to be a nice middle ground for Steven, me and Pierangelo. (Special thanks to Go Data Driven who also had place for us.)

In the morning we spoke about current and new ideas on how to improve the library. Also, we shared detailed knowledge on ZIO and what Kafka expects from its users. After lunch we started hacking. Having someone to start an ad hoc discussion turned out to be very productive; we were able to move some tough issues forward.

Here are some highlights.

PR #788 — Wait for stream end in rebalance listener is important to prevent duplicates during a rebalance process. This PR was mostly finished for quite some time, but many details made the extensive test suite fail. We were able to solve many of these issues.

In the area of performance we implemented an idea to replace buffering (pre-fetching a fixed number of polls), with pre-fetching based on the stream’s queue size. This resulted in PR #803 — Alternative backpressure mechanism.

We also laid the seeds for another performance improvement implementation: PR #908 — Optimistically resume partitions early.

These last two PRs showed great performance improvements bringing us much closer to direct usage of the Java Kafka client. All 3 PRs are now in review.

All in all it was a lot of fun to meet fellow enthusiasts and hack on the complex machinery that is inside zio-kafka.

Sunday, January 29, 2023

Kafka is good for transport, not for system boundaries

In the last years I have learned that you should not run Kafka as a system boundary. A system boundary in this article is the place where messages are passed from one autonomy domain to another.

Now why is that? Let’s look at two classes of problems: connecting to Kafka and the long feedback loop. To prove my points, I am going to bore you with long stories from my personal experience. You may be in a different situation, YMMV!

Problem 1: Connecting to Kafka is hard

Compared to calling an HTTP endpoint, sending messages to Kafka is much much harder.

Don’t agree? Watch out for observation bias! During my holiday we often have long high-way drives through unknown countries. After looking at a highway for several hours non-stop, you might be inclined to believe that the entire country is covered by a dense highway network. In reality though, the next highway might be 200km away. A similar thing can happen at work. My part of the company offers Kafka as a service. We also run several services that invariable use Kafka in some way. We have deep knowledge and experience. It would be easy to think that Kafka is simple for everyone. However, for the rest of the company this Kafka thing is just another far away system that they have to integrate with and knowledge will be spotty and incomplete.

Let’s look at some of the problems that you have to deal with.

Partitioning is hard

It is easier to deal with partitioning problems when you control both the producer and the broker. We once had a problem where our systems could not keep up with the inflow of Kafka messages for one of the producers. The weird thing is that most of the machines were just idling. The problem grew slowly, so it took us some time before we realized it was caused by some partitions having most of the traffic. Producers of Kafka events do not always realize the effect of wrongly chosen key values. When many messages have the same key they end up in the same partition. It took some time before we got across that they needed to change the message key.

When you run an HTTP endpoint, spreading traffic and partitioning is handled by the load-balancer and is therefore under control of the receiver and not the sender.

Cross network connections are hard

Producers and the Kafka brokers need to have the same view of the network. This is because the brokers will tell a producer to which broker (by DNS name or IP address) it needs to connect to for each partition. This might go wrong when the producers and brokers use a different DNS server, or when they are on networks with colliding IP address ranges. Getting this right is a lot easier when you’re running everything in a single network you control.

This is not a problem with HTTP endpoints. Producers only need 1 hostname and optionally an HTTP proxy.

We didn’t talk about authentication and encryption yet. Kafka is very flexible; it has many knobs and settings in this area and the producers have to be configured exactly right or else it just won’t work. And don’t expect good error messages. Good documentation and cooperation is required to make this work across different teams.

With HTTP endpoints, encryption is very well-supported through https. Authentication is straight forward with HTTP’s basic authentication.

Problems that have been solved

Just for completeness here are some problems from around 2019 that have since been solved.

Around 2019 Kafka did not support authentication and TLS out of the box. Crossing untrusted networks was quite cumbersome.

Also around that time you had to be very careful about versioning. The client and server had to be upgraded in a very controlled order. Today this looks much better; you can combine almost any client and server version.

The default partitioner would give slow brokers more, instead of less work. This has been solved a few months ago.

Problem 2: Long feedback loop

When messages are being given to you via Kafka, you can not reject them. They are send and forget, the producer no longer cares. Dealing with invalid messages is now your responsibility.

In one of our projects we used to set invalid messages apart and offer Slack alerts so that the producers knew they had to look at the validation errors. Unfortunately, it didn’t work well. The feedback loop was simply too long and the number of invalid messages stayed high.

Later we introduced an HTTP endpoint in which we reject invalid messages with a 400 response. This simple change was nothing less than a miracle. For every producer that switched the vast majority of invalid messages disappeared. The number of invalid messages has remained very low since then.

Because we were able to reject invalid messages the feedback loop shortened and became much more effective.

Conclusions

Kafka within your own autonomy domain can be a great solution for message transport. However, Kafka as a boundary between autonomy domains will hurt.

Footnotes

  1. Though at high enough volume, HTTP is not easy either; you’ll need proper connection pooling and an endpoint that accepts batches or else deploy a huge server park.
  2. Many load balancers offer sticky sessions which is a weak form of partitioning.
  3. We suffered both.
  4. When your authentication settings are wrong, the Kafka command line tools tell you that by showing an OutOfMemoryError. My head still hurts from this one.
  5. Though unfortunately, many architects will make this complex by using oauth or other such systems.
  6. Most invalid messages could be fixed with a few minutes of coding time.

Sunday, December 4, 2022

ZIO service layer pattern

While reading about ZIO-config in 2.0.4, the following pattern to create services caught my eye. I am copying it here for easy lookup. Enjoy.

val myLayer: ZLayer[PaymentRepo, Nothing, MyService] = ZLayer.scoped { for { repo <- ZIO.service[PaymentRepo] config <- ZIO.config(MyServiceImpl.config) ref <- Ref.make(MyState.Initial) impl <- ZIO.succeed(new MyServiceImpl(config, ref, repo)) _ <- impl.initialize _ <- ZIO.addFinalizer(impl.destroy) } yield impl }

Saturday, November 5, 2022

Speed up ZIOs with memoization

TLDR: You can do ZIO memoization in just a few lines, however, use zio-cache for more complex use cases.

Recently I was working on fetching Avro schema's from a schema registry. Avro schema's are immutable and therefore perfectly cacheable. Also, the number of possible schema's is limited so cache evictions are not needed. We can simply cache every schema for ever in a plain hash-map. So, we are doing memoization.

Since this was the first time I did this in a ZIO based application, I looked around for existing solutions. What I wanted is something like this:

def fetchSchema(schemaId: String): Task[Schema] = { val fetchFromRegistry: Task[Schema] = ??? fetchFromRegistry.memoizeBy(key = schemaId) }

Frankly, I was a bit disappointed that ZIO does not already support this out of the box. However, as you'll see in this article, the proposed syntax only works for simple use cases. (Actually, there is ZIO.memoize but that is even simpler and only caches the result for a single ZIO instance, not for any instance that gives the same value.)

Let's continue anyway and implement it ourselves.

The idea is that method memoizeBy first looks in a map using the given key. If the value is not present, we get the result from the original Zio and store it in the map. If the value is present, it will be used and the original Zio is not executed.

A map, yes, we need also need to give the method a map! The map might be used and updated concurrently. I choose to wrap an immutable map in a Ref, but you could also use a ConcurrentMap.

Here we go:

import zio._ import scala.collection.immutable.Map implicit class ZioMemoizeBy[R, E, A](zio: ZIO[R, E, A]) { def memoizeBy[B](cacheRef: Ref[Map[B, A]], key: B): ZIO[R, E, A] = { for { cache <- cacheRef.get value <- cache.get(key) match { case Some(value) => ZIO.succeed(value) case None => zio.tap(value => cacheRef.update(_.updated(key, value))) } } yield value } }

That is it, just a few lines of code to put in some corner of your application.

Here is a full example with memoizeBy using Service Pattern 2.0:

import org.apache.avro.Schema import zio._ import scala.collection.immutable.Map trait SchemaFetcher { def fetchSchema(schemaId: String): Task[Schema] } object SchemaFetcherLive { val layer: ZLayer[Any, Throwable, SchemaFetcher] = ZLayer { for { // Create the Ref and the Map: schemaCacheRef <- Ref.make(Map.empty[String, Schema]) } yield SchemaFetcherLive(schemaCacheRef) } } case class SchemaFetcherLive( schemaCache: Ref[Map[String, Schema]] ) extends SchemaFetcher { def fetchSchema(schemaId: String): Task[Schema] = { val fetchFromRegistry: Task[Schema] = ... // Use memoizeBy to make fetchFromRegistry more efficient! fetchFromRegistry.memoizeBy(schemaCache, schemaId) } }

Discussion

Note how we're using the default immutable Map. Because it is immutable, all threads can read from the map at the same time without synchronization. We only need some synchronization using Ref, to atomically replace the map after a new element was added.

When multiple requests for the same key come in at roughly the same time, both are executed, and both lead to an update of the map. This is not as advanced as e.g. zio-cache, which detects multiple simultaneous requests for the same key. In the presented use case this is not a problem and very unlikely to happen often anyway.

Can we improve further? Yes, we can! If you look at method fetchSchema in the example, you see that a fetchFromRegistry ZIO is constructed, but we do not use it when the value is already present. And even worse, the value already being present is the common case! This is not very efficient. If efficiency is a problem, another API is needed. Zio-cache does not have this problem. In zio-cache the cache is aware of how to look up new values (it is a loading cache). So here is a trade off: efficiency with a more complex API, or highly readable code.

Using zio-cache

For completeness, here is (almost) the same example using zio-cache:

import org.apache.avro.Schema import zio._ import zio.cache.{Cache, Lookup} trait SchemaFetcher { def fetchSchema(schemaId: String): Task[Schema] } object ZioCacheSchemaFetcherLive { val layer: ZLayer[SomeService, Throwable, SchemaFetcher] = ZLayer { for { someService <- ZIO.service[SomeService] // the fetching logic can use someService: fetchFromRegistry: String => Task[Schema] = ??? // create the cache: cache <- Cache.make( capacity = 1000, timeToLive = Duration.Infinity, lookup = Lookup(fetchFromRegistry) ) } yield ZioCacheSchemaFetcherLive(cache) } } case class ZioCacheSchemaFetcherLive( cache: Cache[String, Throwable, Schema] ) extends SchemaFetcher { def fetchSchema(schemaId: String): Task[Schema] = { // use the loader cache: cache.get(schemaId) } }

We now need a reference to fetchFromRegistry while constructing the layer. This complicates the code a bit; we can no longer define fetchFromRegistry in the case class. In the example we pull a SomeService so that we can put the definition of fetchFromRegistry into the for comprehension and stick to Service Pattern 2.0. Perhaps we should completely move it to another service so that we can write lookup = Lookup(someService.fetchFromRegistry). That, I'll leave as an exercise to the reader.

Conclusion

For simple use cases like fetching Avro schema's, this article presents an appropriately light weight way to do memoization. If you need more features such as eviction and detection of concurrent invocations, I recommend zio-cache.

Update 2024-01-24

Here is a version of cachedBy that only fetches a value once, even when two fibers request it concurrently. The second fiber is semantically blocked until the first fiber has produced the value.

import zio._ object ZioCaching { implicit class ZioCachedBy[R, E, A](zio: ZIO[R, E, A]) { def cachedBy[B](cacheRef: Ref[Map[B, Promise[E, A]]], key: B): ZIO[R, E, A] = { for { newPromise <- Promise.make[E, A] actualPromise <- cacheRef.modify { cache => cache.get(key) match { case Some(existingPromise) => (existingPromise, cache) case None => (newPromise, cache + (key -> newPromise)) } } _ <- ZIO.when(actualPromise eq newPromise) { zio.intoPromise(newPromise) } value <- actualPromise.await } yield value } } }