The modern researcher’s toolbox

We all know that Moore’s law pushed technology a long way since ENIAC. The way we do the research changed unquestionably, but so did the way we write up our findings. I have the feeling that this latter part although, somewhat concentrates about the fact that library-roaming has been replaced with google-scholaring. Especially if you think about the last 15-20 years. The way we complie the articles hasn’t really changed… Create a Word doc – if you’re techy, a LaTeX file – and start right off. You reach a point, when you think it is reasonable to share it with the rest of the co-authors and off goes the email, attachment, track changes, and then it’s back to you. Sure, there are some collaborative initiatives – with online editing even and some of which have been around for quite a while, e.g. Google Docs, Micsoroft OneDrive – or some advancements in the way we share files such as Dropbox. But it’s not only about the writing. The figures are not just Excel or Matlab anymore. They are Python. Or R. And research requires an increasingly larger amount of code. Then, there is the formatting… You submit, get rejected, reformat and submit again. And repeat until forever. Much like:

This got me thinking that there must be some way of streamlining this process. And there is. I just had to put together the right tools. Meet my integrated research environment (much like an integrated development environment, used by coders):

Clipboard02

Tools for the integrated research environment of the modern researcher

I’ve stumbled upon this marvellous tool called Authorea, to find that it connects to everything that I was doing before and it ticks the box which was the most annoying for me: automatic reformatting for journals. It also made possible a move that I have been wanting to take for a long time – migrating to writing research in LaTeX from Word. And, in order to smooth that transition, you can first try Markdown.

So here are the steps to take towards writing a research paper, the modern way:

  1. Come up with your amazing new idea
  2. Search for data online, then fire up a Jupyter notebook (formerly Ipython) running Python (or R)
  3. Get and massage your data with pandas
  4. Create some nice plots with matplotlib
  5. Save the data into JSON format, then fire up your favorite text editor
  6. Pull the data from the JSON files and create an awesome interactive visualization with D3.js
  7. Create project website in HTML5 and CSS3, host it on GitHub
  8. Create a new Authorea article, set it up to push automatically to a Github repository
  9. Invite your collaborators to the article – and work simultaneously on the text with git versioning control
  10. Find your inner muse and write up that article in HTML, Markdown or LaTeX
  11. Put in some fancy equations in LaTeX
  12. Paste those graphs you created with matplotlib, then put a link to your source code – this way your readers can actually fire up a live Jupyter notebook on the Authorea server and even play with your code
  13. If you want to step up your game, go ahead and paste that interactive visualization you created in D3.js from the website hosted on GitHub
  14. Paste those references directly from CrossRef, without the need of a citation manager
  15. Chat with your co-athors, review and finalize your aticle – in the browser
  16. Go online again and search for the most awesome journal of your choice
  17. Export your article from Authorea in the formatting requirements of your chosen journal – just a click
  18. Get some sleep man!
  19. Repeat steps 15-18 until accepted 🙂

And the best part? All of the above are open source, free tols.

Good luck!

Insurgent Dynamics: A systematic analysis of social unrest using the GDELT Event database

Click here if you prefer to read this post on Medium.com (5 minute read).
Click here to look at the data visualization only on visualizing.org.

In this post I have rekindled one of my earlier data analysis and visualization projects from last year, about my explorations of conflict and insurgence dynamics using data from the GDELT event dataset and a simple epidemiological SIR model. The data visualization was done in Matlab, so it is a bit chunky, but please go ahead and check it out here.

Last year, I have started writing a paper about my the results of my exploration, but it is not ready yet. Meanwhile, here are the brief findings.

While the social dynamics that may drive social unrest events have been extensively studied recently and the general patterns regarding the distribution of event-sizes and timings are well-known, I tried to delve deeper into the problem and attempt to gain an insight into individual event dynamics. Using an event classification based on news reports from the Global Database of Events, Language and Tone (GDELT), I looked at social unrest events of different types across different scales and timelines and find that there is an underlying repetitive pattern in their dynamic. Using this information, I postulated a simple SIR system dynamics model and simulated it for various types of social unrest for the period covered by GDELT, including all armed conflicts and major protests between 1979 and 2014. I found that the great majority of unrests are characterized by very similar diffusion and decay rates, independent of their place, time or duration, thus implying a scale-free structure. What is even more interesting is that the variation of these parameters is also small when comparing across different unrest types, such as conflicts and nonviolent protests. The Achilles-heel of the analysis is the establishment of the correlation between actual events and the news reports covering them, for which there is limited literature. I tried to demonstrate the validity of this conjecture through semantic mining in Wikipedia and the BBC Country Profiles databases. So far, I found that there might be a possible universality in the dynamics and this could offer extensibility to dynamics disaster relief programs or social gatherings.

GDELT conflict

Insurgent Dynamics – A visual exploratory of GDELT events

Read More

How is a D3.js visualization made? – the road from CSV to SVG

People tell me that they would like to make a visualizations in D3.js. And it is too complicated. The learning curve is too steep. Even crafting C3.js or Vega simplified D3.js code seems too complicated. In this post, we will examine the road the data takes from the database or website until the drawing canvas – that is your computer screen.

In my previous post I explained how to load data with D3.js from the Quandl database aggregator directly into NVD3, an easy-to-use graphing library for D3.js. If you you want to visualize just one set of data and don’t worry too much about customization, this is a valid option. However, if you want add your own touch, combine or extend the data with additional fields, or add your own comments, usually you would have to do some additional data processing.

Read More

The global center of mass of higher education: university rankings mapped

Click here if you prefer to read this post on Medium.com (2 minute read).
Click here to look at the data visualization only on visualizing.org.

It is often-touted that the world has been shifting towards Asia (On all fronts, even Formula 1 🙂 ). Indeed, innovation has clearly gotten a good foothold in the East and higher education has been no exception: in the last 10 years the global center of mass of the top ranked 500 universities has been constantly shifting towards East: 650 kilometers, to be precise. To visualize this, I have created a dynamic map that tracks the top 500 universities and the global center of mass (geocenter) of higher education over the past 10 years!

The global center of mass of higher education: university rankings mapped

Dynamic map of the global center of mass of higher education between 2003-2014 – click for interactive

The geocenter of the top 500 universities in 2003 laid just off of the coast of Portugal. In the past decade, this point (red triangles in the map) has been constantly moving towards East, hovering over the border between Morocco and Algeria by 2014. This phenomenon can be attributed to the appearance of many Chinese universities in the list, as well as some from other Asian countries (Malaysia and Saudi Arabia in particular), and the strengthening of Korean and Japanese entries.

The above calculations are the results of an unweighted arithmetic average of the geographic coordinates of the universities that made the top 500 list, meaning that each university who made list had an equal weight. However, it is fair to calculate the geocenter taking into account the rankings of the universities. When accounting for this (using 1/sqrt(rank) as weights, green triangles in the map) we observe an (expected) shift towards North America, and the United States in particular, where most of the top 100 universities are located, including 8 of the top 10. The weighted geocenter has been constantly hovering further out in the Atlantic, above the Azores, in the past 10 years, but it had a smoother migration towards East than the unweighted one. This means that while it is true that many new Asian entries appeared on the list, they have managed to move up the rankings a bit as well.

The data source for this visualization was the Academic Ranking of World Universities. The data has been processed into JSON format with this IPython notebook, with help from pandas. The universities has been placed on the world map using topojson, after being geocoded by geopy using the MapQuest and Google Maps V3 APIs. the  The visualizations have been done entirely in d3.js and the svg language. The main outcomes are dynamic map of the Global center of mass of higher education between 2003-2014 (static, interactive). If you liked this post or have any questions or thoughts, Like, Share, Comment, and Subscribe!

Religious diversity in Romania visualized on colorwheels

Click here if you prefer to read this post on Medium.com (9 minute read).
Click here to look at the data visualization only on visualizing.org.
A localized Hungarian version of this post also exists.

In this post we will visualize and examine the religious breakdown of the country of Romania and its historical regions. We find that the 4 regions exhibit 4 different patterns and various levels of diversity. A good way to do this comparison is via RGB colorwheels. You can read more about these in previous posts here and here. We will use the colorwheel for world religions presented in a previous post, adapted for the regions and dominant religions of Romania.

As part of this adaptation, we extend the 3-axis, redgreenblue colorwheel to cater for the dominant religions of the country. The result is a 6-axis colorwheel with the sequence of redyellowgreencyanblueviolet. Then, to each of these colors we attach a religion. Using the statistics of the Romanian National Census Bureau, we can define the dominant religions and aggregate the ones with a smaller number of followers. This process yields the following color-coding:

  • Orthodox
  • Catholic
  • Reformed
  • Unitarian | Other protestant
  • Other religion | Atheist
  • Adventist | Pentecostal 

Technical details:

Using this coding, we plot each of the settlements (on a commune administrative resolution) onto the colorwheel. The color (hue) of a data-point (equivalent to the angle on the colorwheel) indicates the dominant religion of the settlement. The brightness of the color (radius) gives the relative dominance when compared to the other religions (There is a caveat when using colorwheels with more than 3 dimensions: while for 3 dimensional colorwheels a 40%-40% share of the two dominant colors gives us an exact indication of the third color, in higher dimensions this needs extra information). All points will lay within the indicated hexagon. For visual guidance, we have shown the diagonals. For points in the vertices, one religion has full relative dominance – 100% of the population follows it. A fully red point indicates a 100% orthodox settlement while a fully yellow point marks a pure catholic commune. Halfway on the diagonal connecting the red and yellow vertices (orange region) lay settlements with a population of half orthodox and half catholic believers. The presence of other religions will result in moving away from this diagonal. When reading the colorwheel, it is important to check the tooltips showing detailed breakdown of the religions and monitor the points placed along the edges and diagonals.

Let us look at the colorwheel of the religions in Romania (interactive infographic, fullscreen):

Religions of Romania Colorwheel

Religions of Romania Colorwheel – click for interactive

The religions of Romania seem to be clustered into 3 groups. It is clear that the orthodox church is by far the largest, followed by a roughly equal number of catholic and reformed believers. There are also smaller groups of other protestant (mainly unitarian) and adventist/pentecostal (mainly pentecostal) followers, while there are almost no settlements with a dominant religion other than these 5.

Read More