13 Jul 2015
Mulling over incarceration rates among females prompts some interesting questions. Over time, how many released female convicts relapse into their former behavior? How do such rates compare against their male counterparts? Are younger females more prone to relapse than older ones? And so on. Scores of papers have been published over the past few years and with more or less the same conclusions. Happily, The Bureau of Justice Statistics has some really fine data analysis tools enabling exploration of such data. Though a tad dated (this data comes from 1994) and only expressed in percentages (understandably so), there some very good insights that can be derived from this data.
The following are the packages used while plotting the data.
library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(gridExtra)
Next I import the data that the BJS site provides as a downloadable file. This tool provides data on five variables describing recidivist behavior in 3 years following the release of female prisoners in the US in 1994. The tool combines data obtained from across 15 state prisons making it fairly comprehensive.
The first dataset reports cumulative data as measured monthly, over 3 years.
rec_female_c = read.csv("recidivism_all_female_cumulative.csv",header=TRUE)
rec_female2_c <- melt(rec_female_c, id.vars=1:1)
rec_female3_c<-rec_female2_c[rec_female2_c$Months %% 3==0,]
p1<-ggplot(rec_female3_c, aes(x = factor(Months), y=value, fill = variable)) +
geom_bar(width = 1, color="dark grey", stat="identity") + scale_fill_manual(values = rev(brewer.pal(5, "OrRd")))+ guides(fill=FALSE)+xlab("Time (in months)")+ylab("Value (Cumulative, in %)")
In the next dataset, I retrieve monthly percentages from the prior cumulative data, in order to express reconviction rates by month.
rec_female = read.csv("recidivism_all_female.csv",header=TRUE)
rec_female2 <- melt(rec_female, id.vars=1:1)
rec_female3<-rec_female2[rec_female2$Months %% 3==0,]
p2<-ggplot(rec_female3, aes(x = factor(Months), y=value, fill = variable)) +
geom_bar(width = 1, color="dark grey", stat="identity") + scale_fill_manual(values = rev(brewer.pal(5, "OrRd")))+coord_polar()+xlab("Quarter")+ylab("Value (in %)")
The first graph represents cumulative percentages of female former inmates relapsing into repeat-offender behavior over time. As can be seen,
- By the end of the first year, around 35% of the former inmates (female) have been re-arrested. This number rises to ~57% by the end of three years.
- 18% of the rearrests eventually lead to reimprisonment by the end of the third year.
- The rearrest numbers rise rapidly in the first 15 months, gradually tapering down by the third year.
The second graph is better suited to understanding quarterly recidivism volumes. As is clear from the graph most relapses occur right at the beginning and gradually peter out over time.
grid.arrange(p1,p2, nrow=1)
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

Furthermore, instructive insights arise when the data is sliced by different variables. The following graphs do just that - The cumulative re-arrest rates for different cohorts are compared against rearrest rates for the overall group.
g1<-ggplot(rec_female_allvar, aes(Months)) +
geom_line(aes(y = Rearrest_All_Male,colour="Rearrested(M)"),size=0.75) +
geom_line(aes(y = Rearrest_All_Female,colour="Rearrested(F)"), size=0.75) +
geom_line(aes(y = Rearrest_all,colour="Rearrested(All)"), size=0.75) +
scale_colour_manual(name='', values=c('Rearrested(M)'='seagreen', 'Rearrested(F)'='steelblue4', 'Rearrested(All)'='gray40')) +
theme(legend.position="bottom")+
coord_cartesian(xlim=c(1,37))+
xlab("Month") + ylab("Cumulative %age")
It is clear right away that:
Graph 1: The rearrest rates are higher among men(68.5% in 3 months) as compared to women(57.7% in 3 months). Also note that the overall rearrest rates are highly skewed towards the male population indicating that the number of male inmates are far higher than the number of female inmates.
Graph 2: Previously imprisoned females are far more likely to get rearrested as compared to women who haven't been imprisoned before.
Graph 3: In the same vein, female former inmates with less than or equal to 2 previous arrests are far less likely to be rearrested, as against women arrested more than twice in the past. The contrast is especially stark here as by the end of the third year, 63.2% of women with more than 2 arrest records had been rearrested and only 29.9% of the other group had been rearrested.
Graph 4: Women who have previously spent less than a year in prison are marginally more likely to get rearrested, compared to women who have previously spent more than a year in prison.
Graph 5: Age doesn't appear to be much of a differentiator as can be seen in the graph where the cumulative rearrest rates for the groups below 25 years and above 25 years are almost in consonance.
Graph 6: Again unsurprisingly, males below 25 years(75.7% in 3 years) are more likely to get rearrested compared to females below 25 years of age(59.1% in 3 years).
grid.arrange(g1,g2,g3,g4,g5,g6,ncol=2)

None of the interpretations are very surprising or counter-intuitive, except that I had presumed there would be somewhat more differentiation based on age. However, it does demonstrate the marked differences in recidivism due to factors such as gender, previous arrests, etc.
Resources:
1. The Bureau of Justice Statistics
2. The BJS Prisoner Recidivism Tool
21 Jun 2015
Metal Bands - Analysis II: Mapping the origins of metal bands
Continuing with my analysis on metal bands, this time I look at the provenance of these bands as I attempt to graph - by country - where most of our metal music originates from. Posts such as this one have already elaborated on this question. I merely attempt to replicate the findings with the data I obtained from this site. In this post, I use the rworldmap package to help me represent the number of metal bands by country on a world map.
The following code shows all the packages this post utilizes. Using the same data that I had scraped for the earlier analysis, I now have a dataset detailing the country of origin of 77,317 bands.
library(reshape2)
library(rworldmap)
library(RColorBrewer)
colourPalette <- brewer.pal(9,'YlOrRd')
tables<-readRDS(file="ppndata.Rda")
df1 = read.csv("heavy.csv", header = TRUE)
df2 = read.csv("black.csv", header = TRUE)
df3 = read.csv("death.csv", header = TRUE)
Next, I scrape the current population (as on 18th June, 2015) from this site which gives me the population by country. As expected, some of the names needed to be tweaked so as to allow the merging of the two data frames, all the steps of which are listed below. Once the datasets have been merged, I create a new column that stores the number of bands per capita for each country. A cursory look at this data shows such extreme outliers that I have to take the logarithmic values for this data to be reasonably represented on a graph.
ppndat<-tables
colnames(ppndat)<-gsub('\\t','',colnames(ppndat))
colnames(ppndat)<-gsub('\\n','',colnames(ppndat))
colnames(ppndat)<-gsub(',','',colnames(ppndat))
colnames(ppndat)[3]<-"Population"
colnames(ppndat)[2]<-"country"
ppndat$Population<-gsub(',','',ppndat$Population)
ppndat$Population<-as.numeric(as.character(ppndat$Population))
ppndat$country<-as.character(ppndat$country)
ppndat$country[ppndat$country=="United States of America"]<-"United States"
ppndat$country[ppndat$country=="U. K."]<-"United Kingdom"
df_all<-rbind(df1,df2,df3)
country<-(df_all$Country)
dat<-sort(table(country),decreasing=TRUE)
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat<-melt(dat)
dat$country<-trim(gsub("\\b([a-z])([a-z]+)", "\\U\\1\\L\\2" ,dat$country, perl=TRUE))
final<-merge(dat,ppndat,by="country")
final$Pop1<-final$Population/1000000
final$num<-log(final$value)
final$npc<-log(final$value/final$Pop1)
spdf <- joinCountryData2Map(final, joinCode="NAME", nameJoinColumn="country", verbose='FALSE')
## 122 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 122 codes from the map weren't represented in your data
Now I can move on to plotting the data on a world map. First I graph the number of metal bands by the country of origin. As can be seen, the United States leads with 3107 bands, followed by Germany(1460 bands), Italy(886 bands), United Kingdom(810 bands), Brazil(781 bands) and so on.
mapParams<-mapCountryData(spdf, nameColumnToPlot="npc", catMethod="pretty",numCats=20,colourPalette=colourPalette,missingCountryCol="light grey", addLegend=FALSE)
do.call(addMapLegend
,c(mapParams
,legendLabels="all"
,legendWidth=0.5
,legendIntervals="data"
,legendMar = 2))

The above graph kind of highlights why we need to redo the analysis with the numbers drawn per capita; some of the leading countries are also the countries with very high populations and countries like Finland, Norway etc don't figure anywhere near the top. So now I use the bands-per-capita column to redraw the same graph. The graph below does just that.
This time the graph does the expected, the countries Finland, Sweden, Denmark, Norway figure at the top, with Greece making a surprise entry at the number 3 spot.
The end result of this analysis did not surprise me much - most of the bands I've listened to so far happen to originate from Northern Europe. However, I did have a few surprises while doing this analysis; I was glad to see India featured so high up the order in the number of metal bands. Also, I was surprised to find Greece wedged in right between the Scandinavian nations in the bands per capita rank order. I should give them a listen sometime.
References:
1. Encyclopaedia Metallum
2. http://www.reddit.com/r/Metal/comments/q2flb/mapofcountriesbymetalbandsper_capita/
3. http://www.worldometers.info/world-population/population-by-country/
15 Jun 2015
Metal Bands - Analysis I: What's in a metal band's name?
Metal bands, since forever, have appeared to follow particularly homogenous themes, esp. when it comes to naming themselves. Metal band names - to my mind - have a tendency to be excessively gory, stark and graphic like Cannibal Corpse, Morbid Angel or even just Gojira (my favourite, BTW). So what are the most commonly (mis?)used words by metal bands? Do different kinds of metal bands such as death, black etc. elect to follow different motifs in naming themselves? The Encyclopaedia Metallum is a good place to find out. I had previously heard plenty of good things about this site, but it was only when I started to analyse the data within that I realized really how comprehensive this site was. Seriously, if you are a metal fan you absolutely must check this site out.
I start the analysis by attempting to read the genre pages pertaining to heavy, black and death metal. Grabbing the HTML from the webpages using the XML package proved to be of no use as the data was displayed using javascript. I mulled using RSelenium or phantomjs but eventually opted to take the lazy way out by getting the JSON version of the data from this URL. I particularly liked this method because it let me right at the data, without having to deal with the accompanying cruft that comes with scraping the HTML from a website. However if you ever need to scrape data from pages that utilize Javascript, any of the aforementioned packages would be a good way to go.
Once I was done reading the data, I saved these into three separate dataframes - one each for heavy, black and death metal and loaded the required packages. The data comprises of 14,932 heavy metal bands, 26,368 black metal bands, and 36,017 death metal bands.
library(reshape2)
library(ggplot2)
library(gridExtra)
## Loading required package: grid
df1 = read.csv("heavy.csv", header = TRUE)
df2 = read.csv("black.csv", header = TRUE)
df3 = read.csv("death.csv", header = TRUE)
I would obviously want to ignore such words as the, in and so on, so I make a list of stopwords to eliminate before proceeding with the analysis.
myStopwords<-c("a","is", "to", "by", "i", "the", "in", "of", "for", "de", "from", "my", "no", "and", "metal")
pattern <- paste0("\\b(", paste0(myStopwords, collapse="|"), ")\\b")
Next, I take all the heavy metal band names, collapse them into a single row, convert them to lowercase and remove all the stopwords. Then I split them again into rows, each containing a single word, so it becomes easy for me to find the word frequency. Some of these rows contain empty strings so I remove all such rows as shown below. Finally, I use the sort function to get the top 15 commonly used words sorted in the descending order. I repeat the same for black and death metal separately. You can have a look at the words in the output below.
#heavy metal
heavydat<-tolower(paste(df1$Band,collapse=" "))
heavydat1<-gsub(pattern, "", heavydat)
words_heavy <- strsplit(heavydat1, " ")
final.list.heavy <- unlist(words_heavy)
not.blanks <- which(final.list.heavy != "")
final.list.heavy <- final.list.heavy[not.blanks]
dat.heavy<-sort(table(final.list.heavy),decreasing=TRUE)[1:15]
dat.heavy<-melt(dat.heavy)
dat.heavy1 <- transform(dat.heavy, final.list.heavy = reorder(final.list.heavy, value))
p1<-ggplot(dat.heavy1, aes(x = final.list.heavy, y = value,)) + geom_bar(stat="identity", color="cornsilk4", fill="cornsilk4")+coord_flip()
#Now do the same for black metal and death metal.
It almost makes me smile to look at the words thrown up. There also seem to be slight differences in the nomenclature of the bands of the three genres. Heavy Metal bands use somewhat reasonable words like "black", "steel", "fire", "angel", etc.; words that describe typical metal themes. Black Metal bands names have a more ominous tinge - words like "death", "funeral", "winter", "infernal" etc. make an appearance. Death Metal bands clearly take things to an extreme with words like "blood", "hate", "flesh", "decay", "corpse" etc. featuring prominently in their names.
dat.heavy$final.list.heavy
## [1] black steel dark fire angel force death iron
## [9] hell wild project x evil white cross
## 15 Levels: black steel dark fire angel force death iron hell ... cross
dat.black$final.list.black
## [1] black dark death blood darkness funeral forest
## [8] eternal infernal winter nocturnal lord throne moon
## [15] goat
## 15 Levels: black dark death blood darkness funeral forest ... goat
dat.death$final.list.death
## [1] death dead blood dark hate black flesh dawn
## [9] human decay chaos corpse eternal terror pain
## 15 Levels: death dead blood dark hate black flesh dawn human ... pain
grid.arrange(p1,p2,p3, ncol=1)
Many interesting words like "cross", "goat", "terror" etc. show up in the above lists; I would be interested in delving into the bands that named themselves so, if I ever have the time. Most names are in keeping with the spirit of the typical metal scene - despondent and dreary. Indeed I wonder if any metal band ever named themselves after something cheery.. if you know, I would love to hear about it.
Resources:
1. Encyclopaedia Metallum: http://www.metal-archives.com/
2. http://stackoverflow.com/questions/9249940/scraping-javascript-with-r
25 May 2015
Just this weekend, I was mucking around with the Monte Carlo method for estimating π and stumbled across quite a few interesting articles. It was intriguing to see the diversity of situatons that Monte-Carlo methods were adapted across and though I'll never get my head around most of them, I've - in this post - tried my hand at some classic problems just to get a sense of how well it works.
I don't suppose even for a second that Monte Carlo experiments find much favor with mathematical purists. Indeed anyone, through mere brute force, might come up with the right solutions to perplexing thought experiments like the Monty-Hall problem or the Boy-or-girl paradox. That said, there is no denying the efficacy of Monte Carlo methods in deriving quick estimates or in situations entailing less elegant solutions. Also, some of these estimates/solutions are quite fun to pull off :).
Say we wanted to estimate the value of π(pi). Imagine a circular dartboard of radius r against a square background of side length 2r. You aim a dart at this setup a looooot of times - the more the better - with probability of the dart landing anywhere within the square being uniform. Then count the number of the times the dart landed within the circle, and that should give you a fair estimate of the value of π. Cool, no?
$$\left( \frac{Area of Circle}{Area of Square} \right)= \frac{\pi r^2}{(2r)^2}= \frac{\pi}{4}$$
$$\pi=\left( \frac{Area of Circle}{Area of Square} \right)*4$$
Find the R code below. As you can see, I start off with a measly 1000 runs and then repeat the analysis for an increasing number of runs. I've provided the plots in a grid for ease of comparison.
n<-1000
x<-runif(n,min=0,max=1)
y<-runif(n,min=0,max=1)
circle <- (x-0.5)^2 + (y-0.5)^2 <= (0.5)^2
pi <- (sum(circle)/n)*4
plot(x,y,pch='.',cex=2,col=ifelse(circle,"tomato","grey")
,xlab='',ylab='',asp=1,
main=paste("Approximate Value of Pi =",pi," (n=",n,")"))



18 May 2015
This weekend, in a fit of boredom I thought I'd give Brainf**k a go. One of the most popular and recognized esoteric programming languages, I can't think of any real advantages that Brainf**k provides (If you can, let me know :) ). Richie Cotton has very kindly coded an interpreter for Brainf**k; you may find the code for the same here. I use his interpreter to write a HelloWorld program and a program that prints the integers 0 to 9.
if(!exists("foo", mode="function")) source("brainfuck.R")
Here goes the HelloWorld code:
helloworld<-"
+++++ +++++[>+++++ +++++<-] >++++.---.+++++ ++..+++.>>>+++[>+++++ +++++<-]
>++.<<<<+++++ +++.----- ---.+++.----- -.----- ---.>>+++++ +++++.
"
bfi <- fuckbrain()
bfi$interpret(helloworld)
And this beauty prints the integers 0 to 9:
print0to9 <- "++++[>+++++ +++++<-]+++++ +++++
>+++++ ++<[>+.<-]
"
bfi <- fuckbrain()
bfi$interpret(print0to9)
In keeping with the spirit of these languages, I haven't included any comments to help understand the code. If you need more understandable code, well-commented Brainf**k HelloWorld codes are present dime a dozen online. This Stackoverflow post gives a very engaging introduction to the language.
The codes here are just something I threw together to see if I could get the hang of this language and are neither optimised for length nor efficiency. I would love to know how they can be improved on both counts, if you have any thoughts.
*Resources: *
1. http://4dpiecharts.com/2013/04/24/a-brainfuck-interpreter-for-r/
2. http://stackoverflow.com/questions/16836860/how-does-the-brainfuck-hello-world-actually-work
3. http://en.wikipedia.org/wiki/Brainfuck