The modern researcher’s toolbox

We all know that Moore’s law pushed technology a long way since ENIAC. The way we do the research changed unquestionably, but so did the way we write up our findings. I have the feeling that this latter part although, somewhat concentrates about the fact that library-roaming has been replaced with google-scholaring. Especially if you think about the last 15-20 years. The way we complie the articles hasn’t really changed… Create a Word doc – if you’re techy, a LaTeX file – and start right off. You reach a point, when you think it is reasonable to share it with the rest of the co-authors and off goes the email, attachment, track changes, and then it’s back to you. Sure, there are some collaborative initiatives – with online editing even and some of which have been around for quite a while, e.g. Google Docs, Micsoroft OneDrive – or some advancements in the way we share files such as Dropbox. But it’s not only about the writing. The figures are not just Excel or Matlab anymore. They are Python. Or R. And research requires an increasingly larger amount of code. Then, there is the formatting… You submit, get rejected, reformat and submit again. And repeat until forever. Much like:

This got me thinking that there must be some way of streamlining this process. And there is. I just had to put together the right tools. Meet my integrated research environment (much like an integrated development environment, used by coders):

Clipboard02

Tools for the integrated research environment of the modern researcher

I’ve stumbled upon this marvellous tool called Authorea, to find that it connects to everything that I was doing before and it ticks the box which was the most annoying for me: automatic reformatting for journals. It also made possible a move that I have been wanting to take for a long time – migrating to writing research in LaTeX from Word. And, in order to smooth that transition, you can first try Markdown.

So here are the steps to take towards writing a research paper, the modern way:

  1. Come up with your amazing new idea
  2. Search for data online, then fire up a Jupyter notebook (formerly Ipython) running Python (or R)
  3. Get and massage your data with pandas
  4. Create some nice plots with matplotlib
  5. Save the data into JSON format, then fire up your favorite text editor
  6. Pull the data from the JSON files and create an awesome interactive visualization with D3.js
  7. Create project website in HTML5 and CSS3, host it on GitHub
  8. Create a new Authorea article, set it up to push automatically to a Github repository
  9. Invite your collaborators to the article – and work simultaneously on the text with git versioning control
  10. Find your inner muse and write up that article in HTML, Markdown or LaTeX
  11. Put in some fancy equations in LaTeX
  12. Paste those graphs you created with matplotlib, then put a link to your source code – this way your readers can actually fire up a live Jupyter notebook on the Authorea server and even play with your code
  13. If you want to step up your game, go ahead and paste that interactive visualization you created in D3.js from the website hosted on GitHub
  14. Paste those references directly from CrossRef, without the need of a citation manager
  15. Chat with your co-athors, review and finalize your aticle – in the browser
  16. Go online again and search for the most awesome journal of your choice
  17. Export your article from Authorea in the formatting requirements of your chosen journal – just a click
  18. Get some sleep man!
  19. Repeat steps 15-18 until accepted 🙂

And the best part? All of the above are open source, free tols.

Good luck!

Quandl + NVD3 = Interactive Data Plotter

During my daily work, I do a lot quick data checks on various (energy and development) indicators. For sure the classic way is to go EIA, BP, World Bank or one of the other large, established databases and copy the data into Excel or so and start staring. Nowadays, I would find it actually more convenient – and sometimes faster – to just directly load the data from the databank into a pandas dataframe and do a prettier-than-Excel plot with matplotlib. But if I want to share the data with somebody or make it interactive, this doesn’t cut it. Then I would turn to D3 and NVD3. However, loading the data into D3 is fairly cumbersome, this is not even the hardest part. If you want to eliminate the intermediary step of processing and formatting with pandas, then you have some serious work to do.

Most of the online databases give you the data in either an Excel file and rarely a csv or an XML. The data almost never comes in native Javascript JSON format. And if mining the data was not cumbersome enough, converting it into a D3 – readable format will take up a significant amount of your time. Until now: meet Quandl. Quandl is a data-aggregator that takes numeric data from the large online databases (or individuals) and normalizes them and gives you access in all formats, including JSON. This means it is enough to write a not-so-complicated Javascript data-parser and all you have to do later is to change the Quandl database codes to get pretty plots – a data-blogger’s dream 🙂 This is exactly what I did.

Quandl

Quandl data plotter

Quandl uses short database codes, similar to that of the World Bank to reference their data. These contain the country names usually in a 2 or 3 letter ISO format. So first, you need to grab the country names and their corresponding ISO codes from a public csv file with D3. Then, after doing some background searching among the Quandl data, you can define the short code that you want to load and you can obverse how is the country code embedded in the link structure. Then, using asynchronous JavaScript requests, you can load the desired indicator for the desired country/year. Voila – the bonus great thing about Quandl is that you can also upload (via plugins) and use your own datasets for your visualizations, as long you follow their conventions.

 

Made with d3, nvd3 and Quandl. The main outcome is the NVD3 + Quandl block.

 

From now on, I will start slowly shifting to d3plus, developed by Alexander Simoes at the MIT Media Lab, because of its superiority of handling multiple visualization types, compared to nvd3.

Sankey Diagram Generator

Sankey Demo

Check out the Sankey Diagram Generator I have just made. It supports self loops, moving around nodes in both horizontal and vertical directions and loading and saving diagrams! You can also change the opacity and the density of the links.

Use the Load/Save button to edit/create complex Sankey’s.

The source code for the Sankey displayed above is:

{"nodes":[{"name":"Oil"},{"name":"Natural Gas"},{"name":"Coal"},{"name":"Fossil Fuels"},{"name":"Electricity"},{"name":"Energy"}],"links":[{"source":0,"target":3,"value":15},{"source":1,"target":3,"value":20},{"source":2,"target":3,"value":25},{"source":2,"target":4,"value":25},{"source":3,"target":5,"value":60},{"source":4,"target":5,"value":25},{"source":4,"target":4,"value":5}]} 

You can also use the keyword layer to create nodes fixed along the x-axis. In the above example, if you use “layer”: 3 for the node Fossil Fuels and “layer”:4 for Electricity, they will not be aligned, with the latter being placed to the right of the former.

{"nodes":[{"name":"Oil"},{"name":"Natural Gas"},{"name":"Coal"},{"name":"Fossil Fuels","layer":3},{"name":"Electricity","layer":4},{"name":"Energy"}],"links":[{"source":0,"target":3,"value":15},{"source":1,"target":3,"value":20},{"source":2,"target":3,"value":25},{"source":2,"target":4,"value":25},{"source":3,"target":5,"value":60},{"source":4,"target":5,"value":25},{"source":4,"target":4,"value":5}]} 

Clipboard01

You can also fix the size of a node using the keyword value.

{"nodes":[{"name":"Oil"},{"name":"Natural Gas"},{"name":"Coal"},{"name":"Fossil Fuels","layer":3,"value":10},{"name":"Electricity","layer":4},{"name":"Energy"}],"links":[{"source":0,"target":3,"value":15},{"source":1,"target":3,"value":20},{"source":2,"target":3,"value":25},{"source":2,"target":4,"value":25},{"source":3,"target":5,"value":60},{"source":4,"target":5,"value":25},{"source":4,"target":4,"value":5}]} 

Clipboard02

UDPATE: Using the keyword fill, you can also color the nodes (and automatically their links).

{"nodes":[{"name":"Oil"},{"name":"Natural Gas"},{"name":"Coal","fill":"black"},{"name":"Fossil Fuels","layer":3,"value":10},{"name":"Electricity","layer":4},{"name":"Energy"}],"links":[{"source":0,"target":3,"value":15},{"source":1,"target":3,"value":20},{"source":2,"target":3,"value":25},{"source":2,"target":4,"value":25},{"source":3,"target":5,"value":60},{"source":4,"target":5,"value":25},{"source":4,"target":4,"value":5}]} 

sankey_colored_nodes


UDPATE 2: Today the Sankey Diagram Generator got a major update: I have been working on the load and save functions to include the layout. Another minor update is that now you can toggle the node labels (both the text and values) on/off. On top giving you the option to save the Sankey structure and layout, now you can also save the diagram as a PNG image.

Now, when trying to save the Sankey code, a checkbox shows up next to the Done button, giving you the option to save the Sankey layout for loading later. This includes the node and link positions, as well as the settings for opacity and density. This is a major milestone as it has been a headache to redesign Sankeys, as previously only the structure was saved but not the layout. On the save screen now you also have the option to download the diagram as image.

As a result of these modifications, the Sankey save string changed a little bit in structure. In order to preserve background compatibility, the Sankey structure code – which made up the entirety of the save string up until now – was put under the key “sankey“, the parameters on whether to display labels, density and opacity under the key “params” and finally, the layout, if selected, under “fixedlayout“. Subsequently, when loading back the Sankey save string you are given the option to try to read the layout from the string. If you do not provide the layout, or you choose to ignore it (via a checkbox next to the Done button), the algorithm computes the layout for you automatically, as before.

UDPATE 3: Some users have requested to be able to create Sankeys with multiple flows between the same two nodes. This can be interpreted as having parallel links. While with a small number of parallel links, this is not a problem, the relaxation algorithm fails to lay out the links correctly in case of many. This algorithm is at the core of the rendering and therefore hard to change as it is designed to optimize the layout in general (minimize the total link path length in the connected component). However, I have included an experimental feature to correctly display Sankeys with many parallel links – this will not necessary offer the best layout for regular Sankeys though. In the sankey.js, there is a function called computeLinkDepths which sorts the links in ascending order at the sources and subsequently at the targets. This leads to some ordering conflicts (two source nodes going to the same target, with links of different value, will not both have their top link at the top at the target), which are then solved be the least in number. In case of parallel links, this creates a messy layout. To solve this, I have included a toggle for parallel rendering. This is a bit of an advanced feature and I encourage you to try to understand the sankey.js code structure before you turn this on. How to turn it on? First, make sure that in your input string all parallel links are sorted by value and grouped by source node. An example of this would be:

{"sankey":{"nodes":[{"name":"a"},{"name":"b"},{"name":"c"},"name":"d"}],"links":[{"source":0,"target":1,"value":10},{"source":0,"target":1,"value":70},{"source":0,"target":1,"value":80},{"source":0,"target":1,"value":100},{"source":0,"target":1,"value":200},{"source":0,"target":1,"value":700},{"source":0,"target":1,"value":800},{"source":0,"target":2,"value":20},{"source":0,"target":2,"value":30},{"source":0,"target":2,"value":40},{"source":0,"target":2,"value":300},{"source":0,"target":2,"value":400},{"source":0,"target":3,"value":60},{"source":0,"target":3,"value":600},{"source":2,"target":3,"value":50}]},"params":[0.5,0.25,0,0]} 

.Then open up a console in your browser and turn on parallel rendering by typing the following:

parallelrendering=true

Then hit the Draw Sankey button to redraw the diagram and you should be seeing your Sankey with many parallel links loaded correctly. You can turn it off either by refreshing the page or setting it back to false in the console. Have fun!

UDPATE 4: Added option for adjusting decimals for nodes and links, as well a counter for the node editor on the right, making it easier to create larger diagrams.


Made with D3.js & Dragdealer. If you would like to show your support for my work, please consider a small donation. For a more advanced, applied implementation of this tool, see the Food Energy Flows Exploratorium.

Donate for more datawizardry!