Big Data Meanderings

Monday, December 21, 2015

scala, existential type and the old precog blog post

There was a nice scala article on the precog blog before precog was purchased. The blog seems to no longer be available.

I recently had a need to compose some REST services and needed a configuration object that represented the configuration properties for each REST layer. I used the precog existential type approach for this. Since the blog is no longer avaliable, I thought I would write down what I did and offer it up as a replacement for the original precog blog entry.

The approach is nothing more than what was described in a paper on the expression problem awhile ago here.

The basic idea is that you could use an existential type to define the configuration parameters needed for different layers in your "REST" service. While the traditional cake solutions may restrict your flexibility, a small variation may be helpful in certain circumstances. While I won't claim that all situations call for the approach that precog used in the original blog, I am reproducing it, to some degree here.

The Problem

The basic problem to solve is how to create a configuration object that is the union of all configuration information needed for each component in your application. While you could and probably should use typesafe's config library to read configuration information from flatfiles deployed with your application, you want a strongly typed representation of the configuration information to use in your layers.

For example if you have two layers with configuration parameters:

trait Layer1 {
  val l1param1: String
  val l1param2: Int
  ...
}
trait Layer2 {
  val l2param1: String
  val l2param2: String
  ...
}

you could envision creating a config object like

trait Layer1 {
  class Layer1Config(l1param1: String, l1param2: Int)
  val l1config: Layer1Config
  ...
}
trait Layer2 { 
  class Layer2Config(l2param1: String, l2param2: String)
  val l2config: Layer2Config
  ...
}

and so on to create your configuration objects. If you now mix these together into a REST service trait

trait REST extends Layer1 with Layer2 {
  val l1config: ...
  val l2config: ...
}

The fundamental problem is expressed as the "expression" problem: How do you extend the configuration information in the dimension of data variants as well as new processors. The precog blog, I seem to recall, was thinking about data extensibility. The other aspect of the "expression" problem is how to extend these things independently of other modules. At some point, for real-word sized problems the approach above (using separate config objects all scattered about) will not work especially as the trait start becoming mixed together. While this may not be apparent in this simple "configuration" example, its a real problem in software.

The Adaptable Type

The story goes something like this. Suppose you want each of the components to have a configuration object.

trait Configuration { 
  type Config
  def config: Config
}

Now, we want each layer to be forced to create a configuration object

trait Layer1 extends Configuration {
  type Config <: Layer1Config

  trait Layer1Config { 
     val l1param1: String
     val l2param2: int
   }
}

trait Layer2 extends Configuration { 
   type Config <: Layer2Config

  trait Layer2Config { 
     val l2param1: String
     val l2param2: String
  }
}

Now we want to combine these layers into a REST client that can be instantiated (hence we make it a class):

class REST extends Layer1Config with Layer2Config with (...more layers here...) {
   type Config = config.type

  object config extends Layer1Config with Layer2Config {
     val l1param1: String = ...
     val l1param2: Int = ...
     val l2param1: String = ...
     val l2param2: String = ...
   }
}

By defining an object in the REST layer that integrates the configuation information from the upper layers, we can finally declare, at the end of the world, the type for the Config object. A path dependent type is used. Since config is an object, its type is available to declare the type Config type required by the Config trait.

Essentially, each layer does not know about the other layer's configuration information but we can combine them together at exactly the point we need the configuration centralized into one place.

You may scratch your head because it may seem like its just easier to have each configuration object declared int the REST layer and in some ways you are right for some problems. It's also easy to see that if you use subclasses of the layer, such asLayer3 extends Layer2 then standard OO techniques for extending the Config object can be used. In other words, we can extend downward through a subclass, which is fairly common, and we can extend by combining together independent extensions, which is less common and harder to do in some languages.

That's it! I used this recently, but could not find the precog blog anymore so wrote this one. I use typesafe's config library to grab some of these parameters from configuration files but I also sometimes just type them in since recompiliation is easy and meets the needs for some deployments.

There are other ways to formulate your problem and other techniques but this one comes in handy once in awhile in a variety of areas.

Sunday, November 15, 2015

Mathematica 10 geo mapping

I needed to plot some data based on the state code that was contained in my data. But to useGeoRegionValuePlot I need to have a mapping between my state abbreviation and the full state name.

To use GeoRegionValuePlot you need data in the form:

GeoRegionValuePlot[{Entity["AdministrativeDivision", {"Texas", "UnitedStates}] -> 10, 
  Entity["AdministrativeDivision", {"Virginia", "UnitedStates}]->20}]

You can see this form of input if you take one of the GeoRegionValuePlot examples and convert the cell to InputForm so its not displaying in StandardForm. InputForm gives you the textual form of what is displayed in the input cell as if you typed it in yourself.

Hence, given a set of data like:

yourdata = <| "TX"->10, "VA"->20|>

you can use the following to generate the mappping:

(* Create a map from state abbreviations to state full names *)
abbrevmapping = 
 Association @@ ((AdministrativeDivisionData[{#, "UnitedStates"}, 
        "StateAbbreviation"] -> #) & /@ 
    CountryData[Entity["Country", "UnitedStates"], "Regions"])

And then your plotting is as easy as:

counts = 
  KeyMap[Entity["AdministrativeDivision", {abbrevmapping[#], "UnitedStates"}] &, yourdata] // Normal;
GeoRegionValuePlot[counts]

Thursday, October 8, 2015

Creating your microservices REST endpoint

Microservices are a frequent topic these days because they promise to help reduce the impact of complex problems in a services oriented architecture.

If we skip through the whole story around architecture, devops and what's good for world, we often arrive at a few architecture and design decisions that need to be thought through to implement services, and in particular, micro-services.

Operation Instrumentation

One of the areas is around operational instrumentation, monitoring and management. Microservices need to be instrumented with counters and other measurement methods to allow devops to monitor the microservice. Counting how many rest calls, how long they took and other such measurements are also needed to help implement throttling and SLAs, if those are important to your microservice. Common libraries for this include:

metrics
finagle (includes metrics but is also a library with reactive, service development support)
qbit (includes metrics but like finagle, is also a library with reactive, service development support)

and various ways to aggregate microservice metrics such is a a statsd server. You can then visualize the processing via graphite or gangala. New companies are pushing to be a SaaS or on-prem solution, such as AppDynamics.

Interestingly enough, the instrumentation concept is very similar to that of trying to measure mobile device activities. For example, in order to obtain front end UI application insight into usage behavior, you instrument your javascript so that when a page is loaded or a view changed, a counter is incremented. In many cases, that counter is sent to a receiving server and put into a nice dashboard. Adobe (analytics), google (analytics) and Microsoft have mobile management software stacks and capabilities to do exactly this. Overall though, its really quite basic functionality.

Endpoint

The other area that needs to be thought through is your REST (or SOAP) endpoint. Today, with lambda architectures being the rage, you need endpoints that can plug into a variety of different backends to support batch and real-time processing.

Choices for the endpoint are varied and include:

akka rest & streams
spray.io (which is really built on akka)
finagle
play
finch (on top of finagle)
remotely (see one of the scala talks)
scalaz-streams
qbit
netty, grizly and a bunch of other lower-level plumbing libraries
...anything else that can mimic an HTTP server...

You really only need to have support to create an HTTP server and process HTTP requests. In some cases the microservice will directly access a database or create a message to send to a kafka queue or some other clever messaging architecture.

The entry point into your environment may be an API gateway, similar to the one that Amazon (AWS) provides or a load balancer system that has some routing smarts or even a combination of the two for large enterprises (a load balancer/proxy that proxies out to an API gateway). Note that a load balancer may need access to operation metrics to do a good job.

Load balancers and API gateways include:

HAProxy (general purpose load balancer)
tyk (API gateway)
strongloop (API gateway)
LoopBack (API gateway built on node.js)
nginx (can do both but is not full featured for everything)
apache httpd (see nginx's note)
openrepose (throttling, monitoring)
Amazon's API Gateway & Microsoft's Azure service of the same (managed service)
...custom...you can take an endpoint solution and build a simple load balancer/api gateway...(custom)

There is a big list of software packages, mostly opensource, listed on microservices list of stuff on gihtub.

Note: My software package examples are a bit scala and JVM biased. As you would expect, the Microsoft Azure cloud PaaS also has equivalent functionality.

Sunday, September 27, 2015

spark 1.5, scala 2.11, data import with csv and converting strings to java.sql.Timestamp

When importing data from CSV files in spark, you may need to convert the textual data to a different type.

I had an especially hard time with date time values. The spark SQL module supports java.sql.Timestamp. So the question becomes, how does one create a java.sql.Timestamp from a string given the need to specify its format explicitly?

The spark-datetime package uses joda time and a special serializable SparkDateTimeUDT (composed of a long millis and a TZ) to make date time serializable. That is good but spark-datetime only supports 2.10 out of the box and I wanted to use 2.11--I had compiled spark 1.5 to support scala 2.11.

In java 1.8, the java.time package matches most of joda time, but DateTimeFormatter is not serializable so when you try to define a date time formatter for parsing:

val  f = DateTimeFormatter.ofPattern("dd-MMM-yyyy HH:mm:ss")                                                                                                                                            
val toDT=udf{ (v:String) => java.sql.Timestamp.valueOf(LocalDateTime.parse(v, f))}

you get the dreaded "not serializable" error when you try to use it in a sql query via sqlContext.sql("...").

However, the real issue is that a DateTimeFormatter is not serializable when it used outside the closure that is created and shipped off to the executors running on the other nodes. When you build-up the code that runs your logic, it needs to be serializable so that the spark infrastructure can get the code to the nodes for you to use.

The answer is quite easy though you may sacrifice performance to get it to work. Here's what you can define that can be serialized:

sqlContext.udf.register("toTimestamp", (v:String) => {java.sql.Timestamp.valueOf(LocalDateTime.parse(v, DateTimeFormatter.ofPattern("dd-MMM-yyyy HH:mm:ss"))) })

The DateTimeFormatter is still used but it is used in the closure. We can test that it works by doing a quick test:

scala> sqlContext.sql("""select toTimestamp("01-Jan-2015 00:00:00")""").show()
+--------------------+
|                 _c0|
+--------------------+
|2015-01-01 00:00:...|
+--------------------+


scala> sqlContext.sql("""select toTimestamp("01-Jan-2015 00:00:00")""").printSchema()
root
 |-- _c0: timestamp (nullable = true)

Friday, April 24, 2015

scala PrettyPrinter

I typically need to output some descriptive information along with my analytical programs. I use a simple scala pretty printer for most of it. It's on github.

Tuesday, March 24, 2015

coloring graph vertices in Mathematica using VertexLabelStyle

The graph capabilities of Mathematica seem to use a slightly different programming model than the rest of Mathematica. I was loading a graphml file from some information I had and I wanted to label the vertices with small blue text. I want to use VertexLabelStyle but I could not figure out how to use it based on the documentation and searching.

Here's how you can use it:

g = Import["mygraph.graphml"];
g = SetProperty[g, {VertexLabels->Placed["Name", {StatusArea, Above}], VertexLabelStyle -> Directive[Blue, FontSize->8]}]

That's about it. The key is to recognize that VertexLabels and VertexLabelStyle are both properties and options. But they are properties for the graph, and options for other elements such as vertices. This took me a long time to figure out because of the "property" nature of manipulating graphs. I think the property model is used based on the historical development of the graph manipulation features. A Graph is a just a data structure (which in Mathematica, just means a head symbol) with contents that have specific meanings. For example, the first value in the Graph "structure" is the vertices, the 2nd are the edges, etc. You have to use the property model to set various aspects of the "data structure".

Sunday, March 22, 2015

typesafe slick and config for data processing applications

I use slick 3.0.0 in my applications. A big improvement in 3.0 is the introduction of database actions that allow you to decouple the creation of a "query" from running it using an API that allows you to more easily choose async/sync patterns. It also includes a streaming interface in order to bound memory usage.

My applications are often a bit smaller than a large enterprise application. I typically load data into a database and then pull it back out for analysis. Slick makes this easy. Here's how I typically setup my application.

I create layers for the database module. The layers include:

a basic config for the profile and database
a schema layer
a queries layer
other layers that are application design/architecture specific e.g. integrating db calls into a stream based processing library

Here's what they look like for a hypothetical application that needs to work with application "events":

trait HasDBConfig[P <: JdbcProfile] {
  val config: slick.backend.DatabaseConfig[P]
}

trait AppSchemas[P <: JdbcProfile] {
  self: HasDBConfig[P] =>

  def entityName(name: String) = name.toUpperCase()
  type ID = Long
  import config.driver.api._

  implicit val MyTimestampTypeMapper =
    MappedColumnType.base[LocalDateTime, java.sql.Timestamp](
      java.sql.Timestamp.valueOf(_),
      _.toLocalDateTime)

  class AppEvents(tag: Tag) extends Table[AppEvent](tag: Tag, entityName("AppEvents")) {
    def id = column[ID]("ID", O.AutoInc, O.PrimaryKey)
    def eventId = column[Int]("EVENTID")
    def message = column[Option[String]]("MESSAGE")

    def * = (id.?, eventId, message) <> (AppEvent.tupled, AppEvent.unapply)
  }

trait LogQueries[P <: JdbcProfile] {
  self: AppSchemas[P] with HasDBConfig[P] =>

  import slick.dbio.DBIO
  import config.driver.api._
  import config.driver.DriverAction

  lazy val Events = TableQuery[AppEvents]

  def delete(events: Seq[AppEvent]) =
    Events.filter(e => e.id inSet events.map(_.id).flatten[ID].toSet).delete

....more queries here....
}

case class DataModule[P <: JdbcProfile](val config: DatabaseConfig[P]) extends AppSchemas[P] with MyQueries[P] with HasDBConfig[P]

I use a case class for DataModule to make instantiation easy and make sure that any args that are needed in the traits are available immediately. I also used self types above but you can use inheritance.

My apps have a fixed number of databases they can use. I'll usually define a class that has a val with the proper data module:

case class DatabaseModules(val configsource: Config = ConfigFactory.load()) {

  if (!configsource.hasPath("driver"))
    throw new IllegalArgumentException("Invalid configuration. No driver property specified. Use JVM option -Dconfig.trace=loads to view config settings")

  val DatabaseAccess = configsource.getString("driver") match {
    case "slick.driver.H2Driver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.H2Driver]("", configsource))
    case "slick.driver.MysqlDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.MySQLDriver]("", configsource))
    case "slick.driver.PostgresDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.PostgresDriver]("", configsource))
    case x@_ =>
      throw new IllegalArgumentException(s"Invalid configuration. Unknown driver: $x. Use JVM -Dconfig.trace=loads to view config settings.")
  }
}

I do not need fancy user messages for the app so I'll just take the exception. I believe an exception is appropriate here because if I returned an Option or Try I would have to have the rest of my application test for it. Since I do not want my application to continue if there is a problem with the configuration, this works for me.

The DatabaseModules class takes a config. My config looks like:

driver = "slick.driver.H2Driver$"
db {
  url = "jdbc:h2:tcp://localhost/~/dbs/appeventsdb;MULTI_THREADED=1"
  password = sa
  user = sa
  connectionTimeout=2000
}

In my application, you can use the data module:

    val config = ConfigFactory.parseFile(Paths.get("appevents.conf").toFile).withFallback(ConfigFactory.load())
    val DataModule = DatabaseModules(config).DatabaseAccess
    val moreDbStuff = SomeDBClassWithStuffInIt(DataModule)

    import DataModule._
    import DataModule.config.driver.api._
    import DataModule.config.db
    ...
    import moreDbStuff._

That's about it. You could easily extend this to allow prod/dev/test type configurations. Note that there are some issues in slick 3.0.0 RC1 that require the db sub-object for the moment.