Big Data Meanderings: 2015

Monday, December 21, 2015

scala, existential type and the old precog blog post

There was a nice scala article on the precog blog before precog was purchased. The blog seems to no longer be available.

I recently had a need to compose some REST services and needed a configuration object that represented the configuration properties for each REST layer. I used the precog existential type approach for this. Since the blog is no longer avaliable, I thought I would write down what I did and offer it up as a replacement for the original precog blog entry.

The approach is nothing more than what was described in a paper on the expression problem awhile ago here.

The basic idea is that you could use an existential type to define the configuration parameters needed for different layers in your "REST" service. While the traditional cake solutions may restrict your flexibility, a small variation may be helpful in certain circumstances. While I won't claim that all situations call for the approach that precog used in the original blog, I am reproducing it, to some degree here.

The Problem

The basic problem to solve is how to create a configuration object that is the union of all configuration information needed for each component in your application. While you could and probably should use typesafe's config library to read configuration information from flatfiles deployed with your application, you want a strongly typed representation of the configuration information to use in your layers.

For example if you have two layers with configuration parameters:

trait Layer1 {
  val l1param1: String
  val l1param2: Int
  ...
}
trait Layer2 {
  val l2param1: String
  val l2param2: String
  ...
}

you could envision creating a config object like

trait Layer1 {
  class Layer1Config(l1param1: String, l1param2: Int)
  val l1config: Layer1Config
  ...
}
trait Layer2 { 
  class Layer2Config(l2param1: String, l2param2: String)
  val l2config: Layer2Config
  ...
}

and so on to create your configuration objects. If you now mix these together into a REST service trait

trait REST extends Layer1 with Layer2 {
  val l1config: ...
  val l2config: ...
}

The fundamental problem is expressed as the "expression" problem: How do you extend the configuration information in the dimension of data variants as well as new processors. The precog blog, I seem to recall, was thinking about data extensibility. The other aspect of the "expression" problem is how to extend these things independently of other modules. At some point, for real-word sized problems the approach above (using separate config objects all scattered about) will not work especially as the trait start becoming mixed together. While this may not be apparent in this simple "configuration" example, its a real problem in software.

The Adaptable Type

The story goes something like this. Suppose you want each of the components to have a configuration object.

trait Configuration { 
  type Config
  def config: Config
}

Now, we want each layer to be forced to create a configuration object

trait Layer1 extends Configuration {
  type Config <: Layer1Config

  trait Layer1Config { 
     val l1param1: String
     val l2param2: int
   }
}

trait Layer2 extends Configuration { 
   type Config <: Layer2Config

  trait Layer2Config { 
     val l2param1: String
     val l2param2: String
  }
}

Now we want to combine these layers into a REST client that can be instantiated (hence we make it a class):

class REST extends Layer1Config with Layer2Config with (...more layers here...) {
   type Config = config.type

  object config extends Layer1Config with Layer2Config {
     val l1param1: String = ...
     val l1param2: Int = ...
     val l2param1: String = ...
     val l2param2: String = ...
   }
}

By defining an object in the REST layer that integrates the configuation information from the upper layers, we can finally declare, at the end of the world, the type for the Config object. A path dependent type is used. Since config is an object, its type is available to declare the type Config type required by the Config trait.

Essentially, each layer does not know about the other layer's configuration information but we can combine them together at exactly the point we need the configuration centralized into one place.

You may scratch your head because it may seem like its just easier to have each configuration object declared int the REST layer and in some ways you are right for some problems. It's also easy to see that if you use subclasses of the layer, such asLayer3 extends Layer2 then standard OO techniques for extending the Config object can be used. In other words, we can extend downward through a subclass, which is fairly common, and we can extend by combining together independent extensions, which is less common and harder to do in some languages.

That's it! I used this recently, but could not find the precog blog anymore so wrote this one. I use typesafe's config library to grab some of these parameters from configuration files but I also sometimes just type them in since recompiliation is easy and meets the needs for some deployments.

There are other ways to formulate your problem and other techniques but this one comes in handy once in awhile in a variety of areas.

Sunday, November 15, 2015

Mathematica 10 geo mapping

I needed to plot some data based on the state code that was contained in my data. But to useGeoRegionValuePlot I need to have a mapping between my state abbreviation and the full state name.

To use GeoRegionValuePlot you need data in the form:

GeoRegionValuePlot[{Entity["AdministrativeDivision", {"Texas", "UnitedStates}] -> 10, 
  Entity["AdministrativeDivision", {"Virginia", "UnitedStates}]->20}]

You can see this form of input if you take one of the GeoRegionValuePlot examples and convert the cell to InputForm so its not displaying in StandardForm. InputForm gives you the textual form of what is displayed in the input cell as if you typed it in yourself.

Hence, given a set of data like:

yourdata = <| "TX"->10, "VA"->20|>

you can use the following to generate the mappping:

(* Create a map from state abbreviations to state full names *)
abbrevmapping = 
 Association @@ ((AdministrativeDivisionData[{#, "UnitedStates"}, 
        "StateAbbreviation"] -> #) & /@ 
    CountryData[Entity["Country", "UnitedStates"], "Regions"])

And then your plotting is as easy as:

counts = 
  KeyMap[Entity["AdministrativeDivision", {abbrevmapping[#], "UnitedStates"}] &, yourdata] // Normal;
GeoRegionValuePlot[counts]

Thursday, October 8, 2015

Creating your microservices REST endpoint

Microservices are a frequent topic these days because they promise to help reduce the impact of complex problems in a services oriented architecture.

If we skip through the whole story around architecture, devops and what's good for world, we often arrive at a few architecture and design decisions that need to be thought through to implement services, and in particular, micro-services.

Operation Instrumentation

One of the areas is around operational instrumentation, monitoring and management. Microservices need to be instrumented with counters and other measurement methods to allow devops to monitor the microservice. Counting how many rest calls, how long they took and other such measurements are also needed to help implement throttling and SLAs, if those are important to your microservice. Common libraries for this include:

metrics
finagle (includes metrics but is also a library with reactive, service development support)
qbit (includes metrics but like finagle, is also a library with reactive, service development support)

and various ways to aggregate microservice metrics such is a a statsd server. You can then visualize the processing via graphite or gangala. New companies are pushing to be a SaaS or on-prem solution, such as AppDynamics.

Interestingly enough, the instrumentation concept is very similar to that of trying to measure mobile device activities. For example, in order to obtain front end UI application insight into usage behavior, you instrument your javascript so that when a page is loaded or a view changed, a counter is incremented. In many cases, that counter is sent to a receiving server and put into a nice dashboard. Adobe (analytics), google (analytics) and Microsoft have mobile management software stacks and capabilities to do exactly this. Overall though, its really quite basic functionality.

Endpoint

The other area that needs to be thought through is your REST (or SOAP) endpoint. Today, with lambda architectures being the rage, you need endpoints that can plug into a variety of different backends to support batch and real-time processing.

Choices for the endpoint are varied and include:

akka rest & streams
spray.io (which is really built on akka)
finagle
play
finch (on top of finagle)
remotely (see one of the scala talks)
scalaz-streams
qbit
netty, grizly and a bunch of other lower-level plumbing libraries
...anything else that can mimic an HTTP server...

You really only need to have support to create an HTTP server and process HTTP requests. In some cases the microservice will directly access a database or create a message to send to a kafka queue or some other clever messaging architecture.

The entry point into your environment may be an API gateway, similar to the one that Amazon (AWS) provides or a load balancer system that has some routing smarts or even a combination of the two for large enterprises (a load balancer/proxy that proxies out to an API gateway). Note that a load balancer may need access to operation metrics to do a good job.

Load balancers and API gateways include:

HAProxy (general purpose load balancer)
tyk (API gateway)
strongloop (API gateway)
LoopBack (API gateway built on node.js)
nginx (can do both but is not full featured for everything)
apache httpd (see nginx's note)
openrepose (throttling, monitoring)
Amazon's API Gateway & Microsoft's Azure service of the same (managed service)
...custom...you can take an endpoint solution and build a simple load balancer/api gateway...(custom)

There is a big list of software packages, mostly opensource, listed on microservices list of stuff on gihtub.

Note: My software package examples are a bit scala and JVM biased. As you would expect, the Microsoft Azure cloud PaaS also has equivalent functionality.

Sunday, September 27, 2015

spark 1.5, scala 2.11, data import with csv and converting strings to java.sql.Timestamp

When importing data from CSV files in spark, you may need to convert the textual data to a different type.

I had an especially hard time with date time values. The spark SQL module supports java.sql.Timestamp. So the question becomes, how does one create a java.sql.Timestamp from a string given the need to specify its format explicitly?

The spark-datetime package uses joda time and a special serializable SparkDateTimeUDT (composed of a long millis and a TZ) to make date time serializable. That is good but spark-datetime only supports 2.10 out of the box and I wanted to use 2.11--I had compiled spark 1.5 to support scala 2.11.

In java 1.8, the java.time package matches most of joda time, but DateTimeFormatter is not serializable so when you try to define a date time formatter for parsing:

val  f = DateTimeFormatter.ofPattern("dd-MMM-yyyy HH:mm:ss")                                                                                                                                            
val toDT=udf{ (v:String) => java.sql.Timestamp.valueOf(LocalDateTime.parse(v, f))}

you get the dreaded "not serializable" error when you try to use it in a sql query via sqlContext.sql("...").

However, the real issue is that a DateTimeFormatter is not serializable when it used outside the closure that is created and shipped off to the executors running on the other nodes. When you build-up the code that runs your logic, it needs to be serializable so that the spark infrastructure can get the code to the nodes for you to use.

The answer is quite easy though you may sacrifice performance to get it to work. Here's what you can define that can be serialized:

sqlContext.udf.register("toTimestamp", (v:String) => {java.sql.Timestamp.valueOf(LocalDateTime.parse(v, DateTimeFormatter.ofPattern("dd-MMM-yyyy HH:mm:ss"))) })

The DateTimeFormatter is still used but it is used in the closure. We can test that it works by doing a quick test:

scala> sqlContext.sql("""select toTimestamp("01-Jan-2015 00:00:00")""").show()
+--------------------+
|                 _c0|
+--------------------+
|2015-01-01 00:00:...|
+--------------------+


scala> sqlContext.sql("""select toTimestamp("01-Jan-2015 00:00:00")""").printSchema()
root
 |-- _c0: timestamp (nullable = true)

Friday, April 24, 2015

scala PrettyPrinter

I typically need to output some descriptive information along with my analytical programs. I use a simple scala pretty printer for most of it. It's on github.

Tuesday, March 24, 2015

coloring graph vertices in Mathematica using VertexLabelStyle

The graph capabilities of Mathematica seem to use a slightly different programming model than the rest of Mathematica. I was loading a graphml file from some information I had and I wanted to label the vertices with small blue text. I want to use VertexLabelStyle but I could not figure out how to use it based on the documentation and searching.

Here's how you can use it:

g = Import["mygraph.graphml"];
g = SetProperty[g, {VertexLabels->Placed["Name", {StatusArea, Above}], VertexLabelStyle -> Directive[Blue, FontSize->8]}]

That's about it. The key is to recognize that VertexLabels and VertexLabelStyle are both properties and options. But they are properties for the graph, and options for other elements such as vertices. This took me a long time to figure out because of the "property" nature of manipulating graphs. I think the property model is used based on the historical development of the graph manipulation features. A Graph is a just a data structure (which in Mathematica, just means a head symbol) with contents that have specific meanings. For example, the first value in the Graph "structure" is the vertices, the 2nd are the edges, etc. You have to use the property model to set various aspects of the "data structure".

Sunday, March 22, 2015

typesafe slick and config for data processing applications

I use slick 3.0.0 in my applications. A big improvement in 3.0 is the introduction of database actions that allow you to decouple the creation of a "query" from running it using an API that allows you to more easily choose async/sync patterns. It also includes a streaming interface in order to bound memory usage.

My applications are often a bit smaller than a large enterprise application. I typically load data into a database and then pull it back out for analysis. Slick makes this easy. Here's how I typically setup my application.

I create layers for the database module. The layers include:

a basic config for the profile and database
a schema layer
a queries layer
other layers that are application design/architecture specific e.g. integrating db calls into a stream based processing library

Here's what they look like for a hypothetical application that needs to work with application "events":

trait HasDBConfig[P <: JdbcProfile] {
  val config: slick.backend.DatabaseConfig[P]
}

trait AppSchemas[P <: JdbcProfile] {
  self: HasDBConfig[P] =>

  def entityName(name: String) = name.toUpperCase()
  type ID = Long
  import config.driver.api._

  implicit val MyTimestampTypeMapper =
    MappedColumnType.base[LocalDateTime, java.sql.Timestamp](
      java.sql.Timestamp.valueOf(_),
      _.toLocalDateTime)

  class AppEvents(tag: Tag) extends Table[AppEvent](tag: Tag, entityName("AppEvents")) {
    def id = column[ID]("ID", O.AutoInc, O.PrimaryKey)
    def eventId = column[Int]("EVENTID")
    def message = column[Option[String]]("MESSAGE")

    def * = (id.?, eventId, message) <> (AppEvent.tupled, AppEvent.unapply)
  }

trait LogQueries[P <: JdbcProfile] {
  self: AppSchemas[P] with HasDBConfig[P] =>

  import slick.dbio.DBIO
  import config.driver.api._
  import config.driver.DriverAction

  lazy val Events = TableQuery[AppEvents]

  def delete(events: Seq[AppEvent]) =
    Events.filter(e => e.id inSet events.map(_.id).flatten[ID].toSet).delete

....more queries here....
}

case class DataModule[P <: JdbcProfile](val config: DatabaseConfig[P]) extends AppSchemas[P] with MyQueries[P] with HasDBConfig[P]

I use a case class for DataModule to make instantiation easy and make sure that any args that are needed in the traits are available immediately. I also used self types above but you can use inheritance.

My apps have a fixed number of databases they can use. I'll usually define a class that has a val with the proper data module:

case class DatabaseModules(val configsource: Config = ConfigFactory.load()) {

  if (!configsource.hasPath("driver"))
    throw new IllegalArgumentException("Invalid configuration. No driver property specified. Use JVM option -Dconfig.trace=loads to view config settings")

  val DatabaseAccess = configsource.getString("driver") match {
    case "slick.driver.H2Driver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.H2Driver]("", configsource))
    case "slick.driver.MysqlDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.MySQLDriver]("", configsource))
    case "slick.driver.PostgresDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.PostgresDriver]("", configsource))
    case x@_ =>
      throw new IllegalArgumentException(s"Invalid configuration. Unknown driver: $x. Use JVM -Dconfig.trace=loads to view config settings.")
  }
}

I do not need fancy user messages for the app so I'll just take the exception. I believe an exception is appropriate here because if I returned an Option or Try I would have to have the rest of my application test for it. Since I do not want my application to continue if there is a problem with the configuration, this works for me.

The DatabaseModules class takes a config. My config looks like:

driver = "slick.driver.H2Driver$"
db {
  url = "jdbc:h2:tcp://localhost/~/dbs/appeventsdb;MULTI_THREADED=1"
  password = sa
  user = sa
  connectionTimeout=2000
}

In my application, you can use the data module:

    val config = ConfigFactory.parseFile(Paths.get("appevents.conf").toFile).withFallback(ConfigFactory.load())
    val DataModule = DatabaseModules(config).DatabaseAccess
    val moreDbStuff = SomeDBClassWithStuffInIt(DataModule)

    import DataModule._
    import DataModule.config.driver.api._
    import DataModule.config.db
    ...
    import moreDbStuff._

That's about it. You could easily extend this to allow prod/dev/test type configurations. Note that there are some issues in slick 3.0.0 RC1 that require the db sub-object for the moment.

typesafe config in small data processing applications

I use typesafe's config for my applications. Typically, my applications involve a database connection used for data analysis. That means that each application really does not have a fixed, one and only one db connection. I need to specify different connections as I cycle through different database types for the analysis.

I use typesafe's config for specifying the database connection and other application information. I want some defaults for various parameters, but I also require a mandatory config file for the database that is provided by the user. To do this, I use a slightly different form of the standard ConfigFactory.load()pattern:

   val config = ConfigFactory.parseFile(Paths.get("myapp.conf").toFile).withFallback(ConfigFactory.load())

That helps me ensure that an application config file is called while still relying on the default machinery of the config module to find internal config files e.g. my app's default parameters or akka actors.

To force it to be mandatory, I require myapp.conf to define parameters and throw an exception (I don't need commercial grade user messages for this) if a setting is not found:

case class MyConfig(config: Config) {
  val myArg = config.getString("myConfigProperty")
}

Monday, March 16, 2015

spark 1.3 log settings

I do not like all the log info being dumped to the console when I start the spark shell or other programs. I've changed the log settings in conf/log4j.properties to the below values:

log4j.rootCategory=INFO,app
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.app=org.apache.log4j.FileAppender
log4j.appender.app.file=app.log
log4j.appender.app.append=true
log4j.appender.app.layout=org.apache.log4j.PatternLayout
log4j.appender.app.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Each time I run, a file app.log is created that captures the log output. This setting appends to the previous log file but you could change that by setting append=false.

Wednesday, March 11, 2015

compiling spark 1.3 RC3 and GA on fc21/linux

When I built spark 1.3 RC3 on FC21, I needed to make a few adjustments.

First, I had installed scala 2.11.6 from scala-lang.org
Modified pom.xml and commented (using ) out the compiler plugins for quasi-quotes, which are now built into scala 2.11's libraries unlike in 2.10. Check out sql/catalyst/pom.xml.
Excluded kafka, which does not have maven presence, for 2.11, from the build command line.
Ran dev/change-version-to-2.11.sh

I built spark with:

mvn -Phadoop-2.4 -Phive -Pyarn -Pscala-2.11 -pl \!external/kafka,\!external/kafka-assembly,\!examples -DskipTests clean package

The list of modules you can exclude are listed in the pom directly under the modules section. "-pl" is a maven command line option to exclude a module (that's an el not a one).2

You may want to use -Phadoop-provided if you are going to run on yarn directly as the AM in that deployment model will already contain the hadoop jars you need. I included yarn so I could run on yarn, but startup is very slow with anything hadoop so you may want to just use the spark master model for everything.

Update for GA:

In the GA release, it appears that they have set a flag to exclude quasi-quotes for scala 2.11. The build did not work for me so I still had to comment out the dependency in the sql/catalyst/pom.xml file. The kafka modules are suppose to be only used for the scala-2.10 profile, however, the pom.xml did not work. Essentially, I still had to do everything that I listed above even for GA.

Note for 1.5.x
You need to read the instructions at: http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211.

Spark now comes with its own maven distribution so you always use the version that the spark team uses. Look for build/mvn. I had an older maven install so I had to set the M2_HOME variable to the explicit spark distribution maven directory to get the spark supplied maven to run correctly.