Code repositories are in constant change. New code is added and removed introducing new features, resolving bugs and preformance issues. Interested in that, I thought of figuring the life of a line. How long is a line expected to "live" before it is deleted or changed?

Thanks to GitHub we now have a large amount of this available. The results are interesting but unsurprising. The probability that a line will change is inversely proportional to the time it has lived. The longer the line has lived, the less probable to change.

The ant life of a line

The cassandra life-of-a-line



The results have been generated by a combination of git diff and git blame, ignoring any changing whitespace, using the jgit API.

Given there results, it is not a coincidence that things like BugCache work well.

As a part of my PhD I have collected a large corpus of GitHub Java projects and at some point I decided to run a topic model (LDA) on the identifier names, after splitting them with a simple regex separating camelCase and names_with_underscores. The resulting topics are interesting. In the parenthesis is my own interpretation of the topic.

0 (images, graphics) 0.5 image color width height font style g graphics size y paint draw bounds rect rectangle background x chart awt
1 (gui, swing) 0.5 j panel layout add set label button swing javax text box pane group awt java border component size constraints
2 (emf) 0.5 e package is notification object feature eclipse emf new create instance class ecore reference id impl msgs basic org
3 0.5 type object py generic self primitive objects lookup from subtype call convert create parameterized not actual implemented raw union
4 0.5 event state listener change changed events listeners mouse remove update is fire on add drag focus drop question transition
5 0.5 tag my metadata com psi null get delegate intellij not util tags project openapi nullable is jetbrains sc virtual
6 (tokenizers?) 0.5 token jj tokens cur cursor kind scanner st tokenizer active consume add choice scan first r char literal parse
7 0.5 get edit eclipse org policy adapter i diagram label command figure element create direct visual request gmf parser part
8 (math) 0.5 integer number int double long of big float to decimal one short boolean value math curve two zero opengamma
9 0.5 properties edition props object e views get editing repository part event instance component bot value org references eef filter
10 0.5 rule access group follow state bit get in grammar call input size stack re assignment la exception x recognition
11 0.5 status get id env region the packet quest player com sm object npc gameserver send inner skill instance utility
12 (email internet) 0.5 uri account get person string contact dto email id space customer mime attachment mail amount external note payment scheme
13 (threads) 0.5 thread run remote start sync runnable stop do timeout exception interrupted time current wait system delay e sleep lookup
14 0.5 provider configuration org registry get identifier string factory locator admin sf stub information net plan patient util carbon cfg
15 0.5 types element add part uml edit target constraint flow cn composite specification type pin realization event interaction dependency signal
16 0.5 base get process val generator gen generate workflow track standard org util bridge audit processing dc generation ad capture
17 (messaging jms) 0.5 message session msg messages body consumer org exchange topic create send destination jms apache camel payload string broker exception
18 0.5 result instance operation definition cs exp case validate ocl chain org object issue null operations definitions instances pivot collection
19 0.5 arg args function member pointer native arguments usage main argument single is members groovy non check checker apply util
20 0.5 name string names description equals desc full qualified required descriptions named canonical substring digester starts suffix rename unique deprecated
21 (file processing) 0.5 file dir io directory files java folder string exception path exists filename jar get delete util zip e absolute
22 (repositories scm revision control) 0.5 version repository get org artifact strategy execution maven css dependency repo apache string branch jcr git dependencies scm hudson
23 (ast tools) 0.5 binding reference unit declaration validation compiler validator ast completion scope type problem internal is fragment org source compilation decl
24 0.5 target filter ref owner card uuid effect add filters mage ability alfresco association cycle targets refs aspect filtered constants
25 (linear algebra)0.5 x y point gl d z math texture points geo tile vector vertex matrix angle vec scale coord center
26 0.5 helper get as org import transform string endpoint transformer export cf sheet mule ct me family to country create
27 (i/o, connsole) 0.5 stream input out output system io in println exception java print read err write close e array os buffered
28 (maps) 0.5 map key put hash util java object keys string contains values remove integer default hashtable clear collections maps empty
29 (trees?) 0.5 node s tree nodes root mps operations jetbrains concept retval quoted behavior child adaptor from smodel create leaf util
30 0.5 value field values is boolean quickfix default not found of underlying security no leg required fields encoded desc or
31 (time) 0.5 time date format calendar locale of day millis zone year month timestamp get duration period to util interval java
32 (file reading) 0.5 line reader writer io java exception write string read buffered print lines indent close format util pw tools csv
33 (server internet) 0.5 server address channel port host socket connection exception net ip transport connector addr protocol inet java io connect ssl
34 (android media) 0.5 app settings profile activity media preference os device preferences intent string android audio setting extra pref receiver prefs tag
35 (security) 0.5 key security pair get exception algorithm public java digest cert certificate signature asn private md cipher spec sp secret
36 (exception handling) 0.5 exception error check not illegal argument ex e no found invalid throwable cause null errors unsupported valid err runtime
37 (svn) 0.5 path root get src copy svn is to dest revision diff paths include string depth merge relation dst relative
38 (vector ops) 0.5 max num min size vector sequence total random math sample matrix real sum per score double seq weight interval
39 0.5 resource annotation org jboss resources annotations java pdf javax feed deployment as ejb dictionary archive example phase inject retention
40 (wiki) 0.5 content search api get string results title language site custom wiki org identity details summary creator author created vo
41 (string processing) 0.5 string append to sb str equals length substring case of trim ignore with s buffer strings utils lower parse
42 0.5 page selenium runtime replace variables for load and at save click wait second to screen text is present source
43 (sloppy? tests) 0.5 a b v n d r k s h w u c g tmp j q times ii hi
44 (tests/ unit tests) 0.5 test assert equals true junit mock null expected suite not fail false case with org tests create before framework
45 0.5 context editor cell ctx create get set style provider default macro collection operation role mps manager contexts jetbrains constant
46 0.5 (serialization, json) source json object module string serializer array foo to exception make modules serialization java mapper from sources serialize codehaus
47 0.5 component container form ui application org get render faces components renderer comp detail apache interceptor exoplatform markup app wicket
48 0.5 command player get cmd plugin sender send chat bukkit org commands message color world args ignore string case equals
49 0.5 task job graph get edge history work progress tasks id tracker status priority vertex trigger scheduler attempt edges dart
50 0.5 attribute constants attributes schema attr op enum string get ldap attrs any naming add default att dn equals enumeration
51 (eclipse plugins) 0.5 selection eclipse swt org text control grid ui composite layout set viewer label i dialog widget button shell page
52 (gui elements) 0.5 action menu window icon tab item display bar selected tool get show actions popup ui gui add title is
53 0.5 override org impl abstract common selector create warnings suppress xtext inject predicate immutable iterable named collect injector bind jvm
54 (android graphics) 0.5 m get flag flags animation bitmap android mode drawable count stage touch scroll span to is width height size
55 (searching) 0.5 o stack print trace term sort compare comparator get to lucene sorted analysis solr analyzer string apache object doc
56 (databases) 0.5 query sql connection statement string exception rs execute def conn update driver set jdbc close stmt java create ps
57 (web services) 0.5 xml namespace q prefix javax java org apache axis writer lang stream element name uri converter reader soap bind
58 0.5 set new old add use original auto clone allow sets orig hidden clear only existing fib cloned allowed boolean
59 (memory management?) 0.5 info get step memory is stat chunk used bucket to cp infos free rc mem steps size gc ci
60 0.5 core i project eclipse org bundle plugin monitor java internal get marker workspace runtime progress platform delta ui status
61 (db table) 0.5 table column row db record database col columns get rows book cursor join tables index alias count jooq records
62 0.5 data get ds dataset structure format string wire bs nc set cd marshal initial vis boolean cayenne submission ucar
63 (cloud) 0.5 domain engine storage network net vm get virtual cloud machine statistics id snapshot volume api jclouds core sourceforge com
64 0.5 id by com exception portal liferay util portlet kernel find order ids system model finder java long comparator company
65 0.5 f start end pos offset position char length ch character is at word segment chars begin mark curr zz
66 0.5 class entity loader dao interface clazz classes hibernate get persistence many super javax find em is criteria java load
67 (collections) 0.5 list add array util collection all java size empty remove collections linked to arrays contains as clear lists warnings
68 0.5 builder descriptor build extension from get com google protobuf java new has extensions generated default descriptors string proto accessor
69 (rdf semantic web) 0.5 model property get prop object create owl models rdf default util edu properties ontology triple add jena factory hp
70 (science) 0.5 get is equals atom bond cdk single instance openscience chem molecule mol fld reaction atoms interfaces flag contains smiles
71 (hadoop/ clusters) 0.5 apache org conf hadoop fs get exception cluster io writable job mapper counter configuration facet split master num format
72 0.5 client com proxy google callback gwt shared async ui bus console place on widget rpc presenter dev override user
73 0.5 item block par var items world entity mod material stack slot add id blocks infi minecraft inventory j random
74 0.5 meta field fields that org t apache set protocol is thrift exception id agent get success struct string this
75 (parallel) 0.5 entry count lock mode queue pool entries size executor concurrent util timer future atomic get java initial capacity long
76 (transaction) 0.5 manager get transaction environment holder commit register open global tx org pre pm mgr persistent create rollback xa txn
77 (java beans caching) 0.5 service cache local persistence set get services reference com bean finder layout portal cached resource liferay kaleo tracker org
78 0.5 code mapping visit visitor constant accept instruction pc label reg stack program opcode mappings local const acc insn codes
79 (spring) 0.5 factory config bean controller springframework support org beans get spring management create relationship managed javax neo jmx pipe utils
80 0.5 l res get stats limit de peer download transfer asset upload tr azureus actor dm registration torrent core piece
81 0.5 user group role password get permission auth string cms login username groups security authentication access principal users subject dao
82 0.5 child parent get handle children report spec birt metrics metric add util design measure hierarchy create decorator is dimension
83 0.5 c options option script scope comment contents js yy opt string default comments get bug cx scriptable has javascript
84 (url ops) 0.5 url string pattern link html match matcher get browser matches regex java net util links compile find malformed br
85 0.5 method parameter param java params parameters lang return call methods constructor library invoke invocation modifier static object reflect class
86 0.5 obj template get other category order hash code product ret equals object section java price elem prime util cat
87 0.5 location get temp uid serial raw serializable route loc to direction lat version stop lon distance string locations from
88 (logging) 0.5 log logger debug level get e enabled is logging processor util error wrapper factory j warn org java apache
89 (parsing?) 0.5 parser parse left right ast top symbol java cup rules stack runtime push at pop side php sym pull
90 (android views) 0.5 view on android text id r set activity dialog layout intent override button adapter click item listener widget content
91 0.5 handler store org get resolver sax tuple drools working handle handlers memory partition sink cdo activation fact descr default
92 (iterators) 0.5 next current iterator last has first sub it iter is to previous size remove prev head more empty from
93 0.5 simple feature layer org get multi features complex geometry ac coordinate create geoid catalog geotools factory band xs crs
94 0.5 expression expr range variable statement literal condition operator or if in formatter for binary var body evaluator eval qualifier
95 (documents) 0.5 element text document doc elements dom get elem create el w root xpath nl org nuxeo rice inline kuali
96 (indexes array operations)0.5 i index length j size array count idx of at arr indices system arraycopy bounds tmp new indexes indexed
97 (web protocols) 0.5 request response http servlet header web javax org exception post req headers string get apache rest status ws cookie
98 (buffers) 0.5 buffer byte read bytes buf length write size len array offset int encoding io to charset off exception bits
99 (game graphics) 0.5 t p frame game move s jc sprite board ai robot pp translate util frames tp throwable play ent

You recognize many interesting topics. Some of the topics seem very project specific (e.g. eclipse) while others represent nice concepts, such as database-related code, error handling etc. These results are similar to the LSA analysis of code in this paper.

This could be used to visualize codebased and source code files. Maybe in a few more things too. For a vaguely related paper on LDA in StackOverflow see here.

I've decided to create a blog in my personal page. It should be nothing big or frequent.


Let's see how it goes....