R's need for speed:
plotting millions of points in seconds!
Have you ever had to generate a scatterplot with one million points, or more? As a bioinformatician working in the academia, and specifically on large datasets, this happens to me almost on a daily basis.
My tool of choice for plotting is always R, and more specifically the grammar of graphs
ggplot2 package. But, when handling such large amounts of data, I always encounter quite the bottleneck: plotting can take forever.
Until now, I usually plotted just a few randomly selected points while fixing the figure style. Only then I would generate the final plot using all data points, sometimes waiting more than 5 min for it to be generated and exported to a png file.
Today, I finally got tired of it and went down a rabbit hole of DDG searches (yes, I use DuckDuckGo, and you should too!). Here is what I unearthed.
How fast is plotting with R and
Let’s start by generating a dataset of 1 million X and Y coordinates, normally distributed:
1 2 require(data.table) pdata = data.table(x=rnorm(1e6), y=rnorm(1e6))
How long would the default R
ggplot methods take to plot this?
1 2 3 4 5 6 7 8 9 10 system.time(with(pdata, plot(x, y))) user system elapsed 11.481 0.048 11.530 require(ggplot2) system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point())) user system elapsed 13.331 0.220 13.552
And here is our starting point: R would take around 11.5 s and
ggplot even longer, with ~13.6 s.
pch='.' is fast.
One of the tips I found on the web comes from a StackOverflow answer, recommending to use the
pch='.' option to plot data points as non-aliased single pixels.
1 2 3 4 system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point(pch='.'))) user system elapsed 2.688 0.100 2.787
This provides a ~5x speed up, from 13.6 s to less than 3 s!
scattermore is faster!
Then, I found another StackOverflow answer, with a user recommending his new R package
scattermore (last commit to the package was on Jan 31st, 2021, at the time of writing this post), which uses a C script to rasterize the dots as a bitmap and then plot them with R.
1 2 3 4 5 require(scattermore) system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_scattermore())) user system elapsed 0.987 0.060 1.047
The overall speed up now is of ~13x: from 13.55 s to ~1 s!
So, if you have to plot a huge amount of points into a scatterplot, as I often do, I would highly recommend using
scattermore. And a huge shoutoute to
exaexa for implementing and sharing this amazingly fast package!
If you already heard about this package, good for you. Otherwise, I hope this piece helped you somehow Peace out