Big Data Meanderings: March 2015

Tuesday, March 24, 2015

coloring graph vertices in Mathematica using VertexLabelStyle

The graph capabilities of Mathematica seem to use a slightly different programming model than the rest of Mathematica. I was loading a graphml file from some information I had and I wanted to label the vertices with small blue text. I want to use VertexLabelStyle but I could not figure out how to use it based on the documentation and searching.

Here's how you can use it:

g = Import["mygraph.graphml"];
g = SetProperty[g, {VertexLabels->Placed["Name", {StatusArea, Above}], VertexLabelStyle -> Directive[Blue, FontSize->8]}]

That's about it. The key is to recognize that VertexLabels and VertexLabelStyle are both properties and options. But they are properties for the graph, and options for other elements such as vertices. This took me a long time to figure out because of the "property" nature of manipulating graphs. I think the property model is used based on the historical development of the graph manipulation features. A Graph is a just a data structure (which in Mathematica, just means a head symbol) with contents that have specific meanings. For example, the first value in the Graph "structure" is the vertices, the 2nd are the edges, etc. You have to use the property model to set various aspects of the "data structure".

Sunday, March 22, 2015

typesafe slick and config for data processing applications

I use slick 3.0.0 in my applications. A big improvement in 3.0 is the introduction of database actions that allow you to decouple the creation of a "query" from running it using an API that allows you to more easily choose async/sync patterns. It also includes a streaming interface in order to bound memory usage.

My applications are often a bit smaller than a large enterprise application. I typically load data into a database and then pull it back out for analysis. Slick makes this easy. Here's how I typically setup my application.

I create layers for the database module. The layers include:

a basic config for the profile and database
a schema layer
a queries layer
other layers that are application design/architecture specific e.g. integrating db calls into a stream based processing library

Here's what they look like for a hypothetical application that needs to work with application "events":

trait HasDBConfig[P <: JdbcProfile] {
  val config: slick.backend.DatabaseConfig[P]
}

trait AppSchemas[P <: JdbcProfile] {
  self: HasDBConfig[P] =>

  def entityName(name: String) = name.toUpperCase()
  type ID = Long
  import config.driver.api._

  implicit val MyTimestampTypeMapper =
    MappedColumnType.base[LocalDateTime, java.sql.Timestamp](
      java.sql.Timestamp.valueOf(_),
      _.toLocalDateTime)

  class AppEvents(tag: Tag) extends Table[AppEvent](tag: Tag, entityName("AppEvents")) {
    def id = column[ID]("ID", O.AutoInc, O.PrimaryKey)
    def eventId = column[Int]("EVENTID")
    def message = column[Option[String]]("MESSAGE")

    def * = (id.?, eventId, message) <> (AppEvent.tupled, AppEvent.unapply)
  }

trait LogQueries[P <: JdbcProfile] {
  self: AppSchemas[P] with HasDBConfig[P] =>

  import slick.dbio.DBIO
  import config.driver.api._
  import config.driver.DriverAction

  lazy val Events = TableQuery[AppEvents]

  def delete(events: Seq[AppEvent]) =
    Events.filter(e => e.id inSet events.map(_.id).flatten[ID].toSet).delete

....more queries here....
}

case class DataModule[P <: JdbcProfile](val config: DatabaseConfig[P]) extends AppSchemas[P] with MyQueries[P] with HasDBConfig[P]

I use a case class for DataModule to make instantiation easy and make sure that any args that are needed in the traits are available immediately. I also used self types above but you can use inheritance.

My apps have a fixed number of databases they can use. I'll usually define a class that has a val with the proper data module:

case class DatabaseModules(val configsource: Config = ConfigFactory.load()) {

  if (!configsource.hasPath("driver"))
    throw new IllegalArgumentException("Invalid configuration. No driver property specified. Use JVM option -Dconfig.trace=loads to view config settings")

  val DatabaseAccess = configsource.getString("driver") match {
    case "slick.driver.H2Driver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.H2Driver]("", configsource))
    case "slick.driver.MysqlDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.MySQLDriver]("", configsource))
    case "slick.driver.PostgresDriver$" =>
      DataModule(DatabaseConfig.forConfig[slick.driver.PostgresDriver]("", configsource))
    case x@_ =>
      throw new IllegalArgumentException(s"Invalid configuration. Unknown driver: $x. Use JVM -Dconfig.trace=loads to view config settings.")
  }
}

I do not need fancy user messages for the app so I'll just take the exception. I believe an exception is appropriate here because if I returned an Option or Try I would have to have the rest of my application test for it. Since I do not want my application to continue if there is a problem with the configuration, this works for me.

The DatabaseModules class takes a config. My config looks like:

driver = "slick.driver.H2Driver$"
db {
  url = "jdbc:h2:tcp://localhost/~/dbs/appeventsdb;MULTI_THREADED=1"
  password = sa
  user = sa
  connectionTimeout=2000
}

In my application, you can use the data module:

    val config = ConfigFactory.parseFile(Paths.get("appevents.conf").toFile).withFallback(ConfigFactory.load())
    val DataModule = DatabaseModules(config).DatabaseAccess
    val moreDbStuff = SomeDBClassWithStuffInIt(DataModule)

    import DataModule._
    import DataModule.config.driver.api._
    import DataModule.config.db
    ...
    import moreDbStuff._

That's about it. You could easily extend this to allow prod/dev/test type configurations. Note that there are some issues in slick 3.0.0 RC1 that require the db sub-object for the moment.

typesafe config in small data processing applications

I use typesafe's config for my applications. Typically, my applications involve a database connection used for data analysis. That means that each application really does not have a fixed, one and only one db connection. I need to specify different connections as I cycle through different database types for the analysis.

I use typesafe's config for specifying the database connection and other application information. I want some defaults for various parameters, but I also require a mandatory config file for the database that is provided by the user. To do this, I use a slightly different form of the standard ConfigFactory.load()pattern:

   val config = ConfigFactory.parseFile(Paths.get("myapp.conf").toFile).withFallback(ConfigFactory.load())

That helps me ensure that an application config file is called while still relying on the default machinery of the config module to find internal config files e.g. my app's default parameters or akka actors.

To force it to be mandatory, I require myapp.conf to define parameters and throw an exception (I don't need commercial grade user messages for this) if a setting is not found:

case class MyConfig(config: Config) {
  val myArg = config.getString("myConfigProperty")
}

Monday, March 16, 2015

spark 1.3 log settings

I do not like all the log info being dumped to the console when I start the spark shell or other programs. I've changed the log settings in conf/log4j.properties to the below values:

log4j.rootCategory=INFO,app
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.app=org.apache.log4j.FileAppender
log4j.appender.app.file=app.log
log4j.appender.app.append=true
log4j.appender.app.layout=org.apache.log4j.PatternLayout
log4j.appender.app.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Each time I run, a file app.log is created that captures the log output. This setting appends to the previous log file but you could change that by setting append=false.

Wednesday, March 11, 2015

compiling spark 1.3 RC3 and GA on fc21/linux

When I built spark 1.3 RC3 on FC21, I needed to make a few adjustments.

First, I had installed scala 2.11.6 from scala-lang.org
Modified pom.xml and commented (using ) out the compiler plugins for quasi-quotes, which are now built into scala 2.11's libraries unlike in 2.10. Check out sql/catalyst/pom.xml.
Excluded kafka, which does not have maven presence, for 2.11, from the build command line.
Ran dev/change-version-to-2.11.sh

I built spark with:

mvn -Phadoop-2.4 -Phive -Pyarn -Pscala-2.11 -pl \!external/kafka,\!external/kafka-assembly,\!examples -DskipTests clean package

The list of modules you can exclude are listed in the pom directly under the modules section. "-pl" is a maven command line option to exclude a module (that's an el not a one).2

You may want to use -Phadoop-provided if you are going to run on yarn directly as the AM in that deployment model will already contain the hadoop jars you need. I included yarn so I could run on yarn, but startup is very slow with anything hadoop so you may want to just use the spark master model for everything.

Update for GA:

In the GA release, it appears that they have set a flag to exclude quasi-quotes for scala 2.11. The build did not work for me so I still had to comment out the dependency in the sql/catalyst/pom.xml file. The kafka modules are suppose to be only used for the scala-2.10 profile, however, the pom.xml did not work. Essentially, I still had to do everything that I listed above even for GA.

Note for 1.5.x
You need to read the instructions at: http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211.

Spark now comes with its own maven distribution so you always use the version that the spark team uses. Look for build/mvn. I had an older maven install so I had to set the M2_HOME variable to the explicit spark distribution maven directory to get the spark supplied maven to run correctly.