casedCamels and underscores

Ah, coding styles. A favorite holy war. Are your braces one true or K&R? Is your indenting tabbed or spaced? Are your variables underscored or camelCased? In my case, respectively K&R, spaced and camelCased — but that's just my preference. It's not like any one style is objectively better than any other. Let's all just get along, people.

Unless you're CompSci researchers in Maryland, in which case you should do a study:

A family of studies investigating the impact of program identifier style on human comprehension is presented. Two popular identifier styles are examined, namely camel case and underscore. The underlying hypothesis is that identifier style affects the speed and accuracy of comprehending source code.

Essentially, they're asking if getValue() is easier to read than get_value(), which somehow, despite decades of flame wars, has never been formally asked. Here's the really interesting part:

If reading source code and reading natural language are substantially the same undertakings, then a significant body of foundational research on natural language can be used as a basis for program com-prehension studies.

Whether myValue is better than my_value is interesting enough on its own, but the real value of the study is in tying the comparison to existing principles for reading in general. If reading code is like reading prose, programmers could theoretically adapt and use the same techniques to improve code comprehension that teachers use to improve prose comprehension. Maybe you're awesome and reading others' code comes naturally to you; I'm not, and it doesn't, so I'll take all the help I can get.

The researchers expected that existing natural language comprehension principles would apply to code comprehension; specifically, that text with some kind of spacing between words (like_this) is easier to read than un-spaced text (likeThis). They ran an interdependent series of experiments on about 170 college-age programmers (mostly male, roughly even mix of style preferences), gauging response time and correctness. All the experiments were modeled after natural language predecessors, which fits with the overall attempt to correlate prose and code comprehension:

  1. Find an identifier in a cloud of similar identifiers
  2. Find all the occurrences of an identifier in a code fragment
  3. Answer SAT-style questions about a prose snippet displayed with underscores or camelCasing.
  4. Same concept as #3, but with code instead of prose.
  5. Track eye movements while reading and verbally summarizing code fragments.

In general, the results were unexpected. Naming style didn't really affect comprehension for experienced programmers, while beginners seemed to do better with camel case than underscores (the opposite of prose comprehension):

In particular, the visual effort for short identifier names (e.g., rowSum) appears to be greater when using underscores. One possible explanation is that programmers chunk such short phrases into one concept because they are common concepts in the problem or solution domain (e.g., rowSum, xAxis). Thus, the use of underscore gets in the way of understanding these identifiers....the result of comprehension is a program plan with the specifics of the syntactic features filtered out rather than a rote memorization of the exact program text.

The theory is that since code already has a fairly structured form, comprehension is less about reading the words and more about recognizing the conceptual units. (Those of us who use i, j and k for array indices probably have no trouble swallowing that idea.) The studies demonstrate handily that reading code is not a parallel for reading prose — which, though it means we don't get to automatically apply all the cool natural language research to what we do, opens up massive potential for future research.