Module 04: Markov Inequality Simulation

Background

This simulation will demonstrate upper-bound probabilities as determined by the Markov inequality calculation. As an example of upper-bound probabilities, imagine the following scenario:

You own a grocery business and find that the first 50 customers today spent an average of $100 each. What is the maximum probability (P) that customers spent exactly $100 or more?

The answer is P = 1 because it is possible that all of the 50 customers each spent exactly $100. This is not likely, but that is the maximum probability of that event. Therefore, that is the upper boundary of that probability.

Rationale: The mean amount is equal to $100. The mean would be $100 if two people spent $100 each. It would be $100 if 1000 people each spend that amount. In each of these two situations, 100% of the population spent exactly the mean. The only way to achieve 100% is if everyone spends exactly the mean amount.

You may imagine that the maximum probability would decrease as the spending value increased. For example, given the average is $100, what percentage of customers could have spent more than $250? Probably not very many, so we could expect the upper bound of that percentage to be quite low.

Data

This dataset has funding information of the Indian startups from January 2015 to August 2017. It includes columns with the date funded, the city the startup is based out of, the names of the funders, and the amount invested (in USD).

Source: https://www.kaggle.com/sudalairajkumar/indian-startup-funding/

Goal

Use the Markov Inequality to simulate and plot the upper bound of the proportion of startups who raise at least \(a\) millions of dollars for the range of funding where \(a\) ranges from the mean funding level to $100,000,000.

Note

The Markov Inequality equation takes the form \(P(X \geq a) = \frac{E[X]}{a}\). We will calculate P for all funding values (\(a\)) from the mean (\(E[X]\)) to $100,000,000.

#import the data from a .csv file
data <- as_tibble(read.csv("startup_funding.csv", header=TRUE))

#check header for info
colnames(data)
##  [1] "Sr.No"                   "Date.ddmmyyyy"          
##  [3] "Startup.Name"            "Industry.Vertical"      
##  [5] "SubVertical"             "City..Location"         
##  [7] "Investorsxe2x80x99.Name" "InvestmentnType"        
##  [9] "Amount.in.USD"           "Remarks"
head(data)
dim(data)
## [1] 3009   10
#investigate data types
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3009 obs. of  10 variables:
##  $ Sr.No                  : int  0 1 2 3 4 5 6 7 8 10 ...
##  $ Date.ddmmyyyy          : Factor w/ 1013 levels "\\\\xc2\\\\xa010/7/2015",..: 170 132 132 132 132 132 132 132 96 22 ...
##  $ Startup.Name           : Factor w/ 2439 levels "\"BYJU\\\\'S\"",..: 729 325 511 25 30 1424 140 829 2284 307 ...
##  $ Industry.Vertical      : Factor w/ 812 levels "\\\\xc2\\\\xa0Casual Dining restaurant Chain",..: 262 262 16 707 384 371 125 303 13 181 ...
##  $ SubVertical            : Factor w/ 1913 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 581 824 405 488 188 386 310 1882 1039 113 ...
##  $ City..Location         : Factor w/ 112 levels "","\\\\xc2\\\\xa0Bangalore",..: 88 67 67 45 25 19 67 65 19 39 ...
##  $ Investorsxe2x80x99.Name: Factor w/ 2387 levels "","\"Kedaraa Capital, Ontario Teachers\\\\'\"",..: 1261 1768 2128 620 417 606 38 337 912 1825 ...
##  $ InvestmentnType        : Factor w/ 52 levels "","Angel","Angel / Seed Funding",..: 17 40 26 21 40 30 40 43 40 43 ...
##  $ Amount.in.USD          : Factor w/ 502 levels "\\\\xc2\\\\xa010,000,000",..: 311 233 391 353 134 49 50 339 353 168 ...
##  $ Remarks                : Factor w/ 75 levels "","\\\\xc2\\\\xa0Late Stage",..: 1 1 1 1 1 1 1 1 1 33 ...
#create vector
funding<-data$Amount.in.USD

#try this first
#mean(funding)
#returns error: In mean.default(funding) : argument is not numeric or logical: returning NA
#Translation: you are not using numeric values -- you can't get the mean of non-numbers

#Now you have to clean the data to be able to find the mean

#create data frame for easier manipulation/organization
df<-data.frame(funding)

#if you try to get the mean, you will find some additional problems with commas, plus signs. You need to remove them.

#remove the + signs, have to escape the + because it is a special character
df$fundingNP<-gsub("\\+","",df$funding)   

#remove the commas
df$fundingNC<-gsub(",","",df$fundingNP) 

#convert to numeric
df$fundingClean<-as.numeric(df$fundingNC) 

#check for rows containing NA values
df[!complete.cases(df),] 
#remove rows with NA values
df<- na.omit(df)

#The data are clean and numeric.
#Now you can find the mean
m<-mean(df$fundingClean)

#print it to see the value
#the cat() function (concatenate) allows you to mix text and variables 
cat("Mean funding value ($USD):", m)
## Mean funding value ($USD): 16975984
#define a sequence of funding levels ($M)
#min is m that was calculated before
fSeq<-seq(m,100000000, by = 1000000)

#define a function
#the function calculates the Markov inequality: "P(X >= a) = E[X]/a"
#will take a sequence of input values and calculate the output P for each
#each calculation uses the same expected value (mean)  
probability <- function(avgVal,input) {
  p <- avgVal/input
  return(p)
}

#run the function for values of the sequence
output<-probability(m,fSeq)

#set up scale factor
#this is so our x axis doesn't have huge labels
million<-1000000

#create data frame with scaled funding values and associated probabilities
df<-data.frame(fSeq/million,output)

#rename column names
colnames(df)<-c("funding","probability")

#define colors for the fill and border 
dataFillColor="#149AB5"
dataBorderColor="#10798f"

#set up the plot
p <- ggplot(data = df, aes(x = funding, y = probability)) +
  theme_minimal()+
  scale_y_continuous(breaks = seq(0,1,by=0.1), 
                     limits = c(0,1))+
  scale_x_continuous(label=dollar_format(),
                     breaks = seq(0,100,by=10),
                     limits=c(0,max(df$funding)))+ 
  geom_area(col=dataBorderColor, 
            fill=dataFillColor)+
  labs(x="Startup Funding (millions)", 
       y="Probability (X \u2265 x)",
       title="Upper Bound Probability for Indian Startup Business Funding",
       subtitle="Markov Inequality Model",
       caption="Data Source: https://www.kaggle.com/sudalairajkumar/indian-startup-funding/ ")+
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5))+
  theme(axis.text=element_text(size=14),
        axis.title=element_text(size=14,
                                face="plain"),
        plot.title = element_text(size=16,
                                  face="bold"))+
  geom_vline(xintercept=min(df$funding), 
             col="#333333", 
             linetype = "longdash")+
  annotate("text", 
           x=min(df$funding)-1,
           y=1,
           size=5,
           hjust=1, 
           label = expression(paste("Mean Funding Level ",symbol('\256')))
           )

#draw the plot
p

Here is an interactive version. Use your mouse to hover over the upper boundary line or the x axis values. A label will appear with additional detail.

#draw the plotly version
plotlyPlot <- ggplotly(p)

#use a div tag (shiny library allows this) to center the plot on the html page
div(plotlyPlot, align = "center")

Conclusions:

At most,

  • 100% of Indian Startup Businesses (ISB) receive the average funding amount or more.
  • ~50% of ISBs receive ~$35M or more.
  • ~25% of ISBs receive ~$70M or more.

Also,

  • ggplotly() removed some formatting from my plot but added some interaction
  • Data point 5 is questionable: 10,00,000?