Inspired by two readings apart in time, I decided to write this post. The first one, just a couple of days ago, when browsing I found a talk from Etsy, an e-commerce website, for “Business of APIs Conference 2012”. There, they talked about the need to think an API (Application Programming Interface) as a product and not simply as an interface to allow connections to a data repository. They also explained that in the development of a new product many people are involved from different disciplines or areas within the company as designers, project managers, marketers, etc. In contrast, the development of an API is more limited or restricted to the IT department, with the risk of losing focus on its usability and functionality, i.e. a vision too technical, maybe. With this I don’t mean that it always happens, surely many things have changed to date, but it’s clear that technical and commercial insights sometimes go in opposite directions and today an API is a great opportunity to make business, connect with customers more easily, and also to enhance the visibility of the business.
The second reading is older and goes back a few months ago; March to be exact., when I read an interesting blog post from Beautiful Data site in which the authors analyzed the Big Data investment in technology in 2014 using data gathered from CrunchBase portal. Now, I remember I liked how they presented this issue and how used CrunchBase API (RESTful interface) in order to find relationships between companies and investors related to Big Data. Although the use of APIs wasn’t alien to me then, because I already had done some developments using APIs from Facebook and Flickr for sentiment analysis, but its approach, at least for me, it was very interesting and with a great potential for evaluating and analyzing investments into a startup ecosystem. So, I remember I checked out CrunchBase API documentation (version 1) and then I applied a series of Python scripts getting some results comparing for instance, startup ecosystems at Barcelona, London, and Paris. Although in this point, I must say that API version 1 wasn’t very robust, moreover some fields were somewhat ambiguous and the requests rather limited.
Back to the present and considering both “revelations”, I tried to dust off my old code in order to publish some results in my blog, but unfortunately CrunchBase API had migrated to version 2 and I was forced to rebuild the scripts completely. Well, finally after some hours fighting with Python, R and MongoDB reached my goal. I used MongoBD because CrunchBase limits its usage to 2,500 calls per day and 50 calls per minute and I needed to save the requests in order to reduce the traffic. Moreover, MongoDB is a robust Non-SQL DB, very easy to configure and works fairly good with JSON-like documents. As tip, there is a package called Python Crunchbase 1.0.2 that “in theory” works well with version 2 but personally I haven’t tried (uploaded 17/09/14). In any case, most of scripts that I used are in this IPython Notebook.
According to Wikipedia, CrunchBase is “a database of companies and startups, which comprises around 500,000 data points profiling companies, people, funds, fundings and events. The company claims to have more than 50,000 active contributors. Members of the public, subject to registration, can make submissions to the database; however, all changes are subject to review by a moderator before being accepted. Data are constantly reviewed by editors to ensure they are up to date. CrunchBase says it has 2 million users accessing its database each month”.
Now, it’s important to remember that each CrunchBase member fills the information that wants or considers appropriate, so it isn’t unusual to find companies with a bit of relevant information. On the other hand, now version 2 includes IDs (uuids) for all objects, which it’s a new advantage compared with version 1, however, still there are some details to debug. Also, when you write a script must be careful to include exceptions handling because requests are very sensitive when a field is empty or doesn’t exist, maybe because it wasn’t filled properly or was removed intentionally for the system, etc. Also, depending on your app, perhaps you want to search companies for a specific city and in this case a first answer for this query could be too broad, i.e. it could include companies from other cities. For example, if the query is for Barcelona (uuid), between the companies gathered appears Veeva whose global headquarters is in USA but has an office in Barcelona that is its European headquarters. In this case it would be necessary to apply some filter to avoid confusions specially when you want to analyze strictly startups in a city. Even so, all this depends on how a company is defined itself…I mean its category.
In Barcelona for instance, if you sort by location (Barcelona, uuid: eead2c0cb178ad334e6d6c813c955e99) and category (“Startups”, uuid: 568e63721763cf41d3f05a985edc3220) just get 15 startups, when in reality there are many more on the list that could be considered startups currently. Anyway, data quality will depend on what a company wants to show: funding round, fundraise, if the funding is undisclosed, private equity, etc. and it’s for this reason that data gathered should be considered only as something illustrative.
Some results for Barcelona Ecosystem
A) The following figure shows a VC fundraising graph where is possible to see the relationships between companies (690 orange circles, 145 connected) and investors (188 blue circles). In this case, the representation is an undirected and unweighted graph where some links could be parallel links between two nodes, i.e. a multigraph but visually we see only one link, this is, because there are investors that participate in different funding rounds with the same company.
Also, as an remarkable detail, there are 42 connected components, i.e. 42 subgraphs, where in each one of them, any two nodes are connected to each other by paths but not with others nodes in the “supergraph”. This last point is key because there exists a big component comprised by 220 nodes, 98 companies and 122 investors forming a significant investment group for the city. Depending on which information you want to get, it’s possible to define this scheme as a weighted directed graph (see IPython notebook), for instance to apply pagerank algorithm or in/out degree centrality algorithm.
B) Total Fundraised vs Year
C) Top 10 Companies
D) Top 10 Investors
E) Total Fundraised vs Pagerank
By Networkx, “PageRank computes a ranking of the nodes in the graph G based on the structure of the incoming links. It was originally designed as an algorithm to rank web pages”.
F) Total Funraised vs Degree Centrality (in-degree)
by Networkx, “the in-degree centrality for a node v is the fraction of nodes its incoming edges are connected to. The degree centrality values are normalized by dividing by the maximum possible degree in a simple graph n-1 where n is the number of nodes in G.”
G) Percentage Category
The increase in the value for both Pagerank and Degree Centrality indicate also an increase in the position within of the fundraising ecosystem, i.e. conditions more favourable to get funds. So, it isn’t necessary to reinvent the wheel to know that having a greater and better connection to big investors gives advantages to a company when it seeks more funding. As Arthur Conan Doyle wrote in The Great Keinplatz Experiment: “Knowledge begets knowledge, as money bears interest” and this makes even more sense when we think for instance in a startup that is well connected in the network and possibly can have access to better mentors, new opportunities of fundings, etc.,although also it’s true this isn’t guarantee of success because many other factors must be considered surely. On the other hand, from investor point of view, the position in the network of a startup could be a decision parameter to invest, maybe.
Anyway, with this post my intention was to dust off my code and also to share some thoughts about APIs…better APIs attract more developers and therefore it’s possible to develop better products and new businesses…visualizations, data analysis and so on. As tip, this website gathers links to 531 reference APIs. Finally given the data, many other conclusions can be said, such as fashion is the industry that has received more investments (total fundraised) followed by the e-commerce or that 2014 is being the year with biggest investment, etc etc. but as I mentioned before, these data should be compared with other sources in order to validate certain trends. My example of Barcelona maybe wasn’t very representative, because in Spain, I think CrunchBase doesn’t have a critical mass, yet, but it’s a matter of changing city and making other analysis. Also, there are many other investor matchmaking websites. For example, Angellist is a case that could be interesting to analyze. Furthermore, there are companies like SiSense that develop analytics dashboard software with this type of data, or websites like Startup Genome, Foundum, etc.
In my github repository are the csv files, so anyone can try VC Fundraising graph with R.
library(RCurl) library(d3Network) library(igraph) library(rCharts) d1 <- read.csv("barcelona_link.csv",header = TRUE, sep = ",",stringsAsFactors=FALSE) d2 <- read.csv("barcelona_role.csv",header = TRUE, sep = ",",stringsAsFactors=FALSE) d1<-unique(d1) d2<-unique(d2) links<-data.frame(d1$company, d1$investor) colnames(links)<-c('source','target') nodes<-unique(data.frame(d2$name)) colnames(nodes)<-c('name') m=match(links$source, nodes$name) g <- graph.data.frame(links, directed=TRUE,nodes) dat1<-(data.frame(get.edgelist(g, names=FALSE))-1) links$source<-dat1$X1 links$target<-dat1$X2 nodes['group']<-1 nodes$group[m]<-2 d3ForceNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", NodeID = "name",Group = "group",width = 550, height = 400, opacity = 2, linkColour = "#000000", zoom = TRUE,file = "b_new.html")