December 31, 2013

2013 in review

This blog has been pretty quiet since April, but that's because there's been an amazing amount of stuff happening. So to sum up the year since then:

  • May 4: Attended my brother's college graduation from the University of Michigan.

  • May 7: Celebrated 2-year anniversary of marrying my wonderful wife.

  • May 10: Defended my thesis. (And passed!)

  • May 11: Started a cross-country drive with my wife to San Francisco.

  • May 11-18: Saw Akron, OH (Swensons!!), experienced Wisconsin squeaky cheese curds, saw the sun rise and set in the Badlands, walked the lava flats and climbed a cinder cone at Craters of the Moon (thanks to the wonderful suggestion of my former high school physics teacher), saw beautiful, amazing sights at Yellowstone, including my first bison calf and Artist's Point (seriously, not to be missed).

  • May 18: Arrived in San Francisco, moved into a temporary place.

  • May 21: Started work at Dropbox, Inc. as a software engineer.

  • May 26: Attended the wedding of two of my closest friends.

  • June 11: Moved into a permanent apartment in the Castro.

  • July 9: Attended my first professional conference, but as one of the people helping out rather than as an attendee.

  • September ??: My graduate school's board of trustees met and approved degrees granted over the summer, including mine. (Whoohoo! Now I'm actually a Doctor.) I got my diploma in the mail, but they asked for it back, saying they misspelled my name (they forgot to include my middle name). Maybe it's a trick...

  • September 22-27: Took a trip to Boston to help Dropbox with on-campus recruiting.

  • October 7: 10 year anniversary of starting to date above-mentioned-wife.

  • November 22: Wielded a hammer and anvil for the first time at the Crucible in the East Bay.

  • December 31: Realized it's New Year's Eve and scrambled to write this post.

It's kind of strange now to be on the West Coast after a lifetime in the East, especially in combination with no longer being a student (after 22 years of schooling!–enough is enough). Life is simultaneously more sedate in California and more exciting for being out of school, especially now that I'm working at a fast-changing tech company. I'm pretty fortunate to be extremely happy with where I am at the end of this year, professionally and personally, and I'm quite looking forward to the next year and what it'll bring.

April 30, 2013

Physician career satisfaction vs. salary

My friend @lesterleung posted a link to this really interesting report by Medscape on physician compensation in the USA.

The whole set of slides is pretty interesting, but I was curious in particular how much career satisfaction correlated with salary, at least if you broke physicians up into different specialties. I plotted each specialty's average response to "if you had to do it all over again, would you choose your own specialty again?" against their average salary in 2012. The line of best fit isn't too bad:

Mean 2012 salary explains 27% of variation in % willing to repeat specialty

Mean 2012 salary does a half-decent job explaining a chunk of the variation in specialty self-satisfaction.

March 29, 2013

Quick BibTeX capitalization-preserving one-liner

While writing my thesis, I realized that I need to preserve the capitalization of some words like proper or gene names in the titles of references that I imported from the Pubmed database, but having to manually comb through my BibTeX file and surround everything with curly braces was not my idea of fun. vim to the rescue!

This regex is a bit hairy, but it works and it's idempotent. Every once in a while, I just rerun the command to make sure all the capitalizations in the titles are "protected". One side effect that I didn't feel was worth correcting is that it add braces around the first word of every title, but it seems like a relatively harmless effect that isn't worth the effort to fix, since it can be hairy to eliminate the false negatives for names appearing at the beginning.

Anyway, here is the one-liner:


Currently, I limited its substitutions to the title field, but you can substitute any other field name for Title (note it appears twice) as long as that field is all in one line, as it is with my BibTeX file.

March 26, 2013

Switching from R to Python

I've lately switched from R to use Python and its companion scientific libraries, such as numpy, scipy, pandas, and matplotlib, and after a few months being really immersed in it, I have to agree with John Cook, "I'd rather do math in a general-purpose language than try to do general-purpose programming in a math language."

Some of the advantages I've seen in using Python for my data analysis have been:

  • Increased speed (BLAS/LAPACK routines of course run the same speed)
  • Better memory usage
  • A sane object system
  • Less "magic" in the language and standard library syntax (YMMV, of course)

R's main advantages are its huge library of statistical packages, including a great graphing package in ggplot, its nice language introspection, and its convenient (in a relative sense) syntax for working with tabular data. I have to admit, the former is quite the advantage, but I've found replacements for many things packages, and I'm content to write ports for the few esoteric things that I've needed, like John Storey's qvalue package.

R has better language introspection due to its basis in Scheme, allowing for wholesale invention of new syntax and domain-specific languages, but Python's language introspection is good enough for what I use day-to-day. I can monkey-patch modules and objects, do dynamic method dispatching, access the argspec of functions, walk up live stack traces, and eval strings if I really, really need to. I can live with only having that much power.

For tabular data, pandas is not a bad replacement for R data frames. There's an initial hurdle in learning about how indexes work and interact with the rest of the data frame, but once you get over that, it isn't so bad. Of course, in some respects "getting over that" is the worst part of pandas, because the documentation is more a set of tutorials rather than a complete API documentation, but after some trial and error learning, I feel like I've gotten to the point where I can be about as productive in pandas munging data sets as I was in R. More so, in fact, since for working with raw strings and numbers, I can always fall back on Python's native data structures and numpy, which are pretty great. In the past, even when I was using R for most of my analysis, I would still do data processing, clean up, and munging in Python. Now, I don't have to switch my mental and computer contexts anymore, and I gain the advantages I listed above.

The documentation issues can really bite you, though. numpy and scipy are great, but the lack of documentation in pandas combined with its attempts to be cute and magical make it a huge pain to debug sometimes. For example, running the following contrived example gives a difficult to understand error:

>>> import pandas
>>> x = pandas.DataFrame({'a': range(6), 'b': range(3,9),
...     'c': [0,0,1,1,2,2]})
>>> def f(df):
...     a = df['a'] + 'string'
...     raise Exception('Error')
>>> x.groupby('c').aggregate(f)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 1591, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 1644, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 1669, in _aggregate_item_by_item
    result[item] = colg.aggregate(func, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 1309, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 1391, in _aggregate_named
    output = func(group, *args, **kwargs)
  File "<stdin>",
line 2, in f
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 470, in __getitem__
    return self.index.get_value(self, key)
  File "[...]/lib/python2.7/site-packages/pandas/core/",
line 678, in get_value
    return self._engine.get_value(series, key)
  File "engines.pyx", line 81, in pandas.lib.IndexEngine.get_value
  File "engines.pyx", line 89, in pandas.lib.IndexEngine.get_value
  File "engines.pyx", line 135, in pandas.lib.IndexEngine.get_loc
KeyError: 'a'

KeyError?? The data frame obviously has a column labeled a, so what's going on? Well, if you delve into the source code with a debugger, pandas catches any and all exceptions when calling f and just tries to execute it in many completely different ways until something works, even if that's not what you'd want. The actual exception then gets thrown far past where the original bug is (that is, the exception being thrown in f itself), and any helpful information from that exception is completely masked out.

In this sense, pandas is really non-Pythonic, in that it feels very magical and non-explicit. It would be better for pandas to have multiple methods covering different ways of dispatching functions on the DataFrame than to have one function that can potentially misinterpret what you're trying to do. To quote from the Zen of Python, "explicit is better than implicit".

But that's a small cost to pay for a dramatically better, faster language to program in.

February 27, 2013

Publication Bias

I thought this post by Justin Esarey was a fun exploration of "publication bias", which is the bias academics feel in favor of publishing new and interesting results over negative results:

As far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.

Check it out, it's pretty neat.

Β» Browse all blog posts