Eclectic ENFJ, sketcher, bioinformagician.
Gabriele Girelli's avatar

R's need for speed:
plotting millions of points in seconds!

R's need for speed:<br />plotting millions of points in seconds!

Have you ever had to generate a scatterplot with one million points, or more? As a bioinformatician working in the academia, and specifically on large datasets, this happens to me almost on a daily basis.

My tool of choice for plotting is always R, and more specifically the grammar of graphs ggplot2 package. But, when handling such large amounts of data, I always encounter quite the bottleneck: plotting can take forever.

Until now, I usually plotted just a few randomly selected points while fixing the figure style. Only then I would generate the final plot using all data points, sometimes waiting more than 5 min for it to be generated and exported to a png file.

Today, I finally got tired of it and went down a rabbit hole of DDG searches (yes, I use DuckDuckGo, and you should too!). Here is what I unearthed.

How fast is plotting with R and ggplot2?

Let’s start by generating a dataset of 1 million X and Y coordinates, normally distributed:

require(data.table)
pdata = data.table(x=rnorm(1e6), y=rnorm(1e6))

How long would the default R plot() and ggplot methods take to plot this?

system.time(with(pdata, plot(x, y)))

user system elapsed
11.481 0.048 11.530

require(ggplot2)
system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point()))

user system elapsed
13.331 0.220 13.552

And here is our starting point: R would take around 11.5 s and ggplot even longer, with ~13.6 s.

Using pch='.' is fast (!!!)

One of the tips I found on the web comes from a StackOverflow answer, recommending to use the pch='.' option to plot data points as non-aliased single pixels.

system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point(pch='.')))

user system elapsed
2.688 0.100 2.787

This provides a ~5x speed up, from 13.6 s to less than 3 s!

scattermore is faster

Then, I found another StackOverflow answer, with a user recommending his new R package scattermore (last commit to the package was on Jan 31st, 2021, at the time of writing this post), which uses a C script to rasterize the dots as a bitmap and then plot them with R.

require(scattermore)
system.time(print(ggplot(pdata, aes(x=x, y=y)) +
geom_scattermore()))

user system elapsed
0.987 0.060 1.047

The overall speed up now is of ~13x: from 13.55 s to ~1 s!

So, if you have to plot a huge amount of points into a scatterplot, as I often do, I would highly recommend using scattermore. And a huge shoutoute to exaexa for implementing and sharing this amazingly fast package!

If you already heard about this package, good for you. Otherwise, I hope this piece helped you somehow ☺️ Peace out ✌️

References

Bookmarks

📷 Cover picture by @crisovalle, on Unsplash.com ❤️
Interact with this post on
Categories