Eclectic by definition.

R's need for speed:
plotting millions of points in seconds!

R's need for speed:<br />plotting millions of points in seconds!

Have you ever had to generate a scatterplot with one million points, or more? As a bioinformatician working in the academia, and specifically on large datasets, this happens to me almost on a daily basis.

My tool of choice for plotting is always R, and more specifically the grammar of graphs ggplot2 package. But, when handling such large amounts of data, I always encounter quite the bottleneck: plotting can take forever.

Until now, I usually plotted just a few randomly selected points while fixing the figure style. Only then I would generate the final plot using all data points, sometimes waiting more than 5 min for it to be generated and exported to a png file.

Today, I finally got tired of it and went down a rabbit hole of DDG searches (yes, I use DuckDuckGo, and you should too!). Here is what I unearthed.

How fast is plotting with R and ggplot2?

Let’s start by generating a dataset of 1 million X and Y coordinates, normally distributed:

1
2
require(data.table)
pdata = data.table(x=rnorm(1e6), y=rnorm(1e6))

How long would the default R plot() and ggplot methods take to plot this?

1
2
3
4
5
6
7
8
9
10
system.time(with(pdata, plot(x, y)))

   user  system elapsed 
 11.481   0.048  11.530

require(ggplot2)
system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point()))

   user  system elapsed 
 13.331   0.220  13.552 

And here is our starting point: R would take around 11.5 s and ggplot even longer, with ~13.6 s.

Using pch='.' is fast.

One of the tips I found on the web comes from a StackOverflow answer, recommending to use the pch='.' option to plot data points as non-aliased single pixels.

1
2
3
4
system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_point(pch='.')))

   user  system elapsed 
  2.688   0.100   2.787 

This provides a ~5x speed up, from 13.6 s to less than 3 s!

scattermore is faster!

Then, I found another StackOverflow answer, with a user recommending his new R package scattermore (last commit to the package was on Jan 31st, 2021, at the time of writing this post), which uses a C script to rasterize the dots as a bitmap and then plot them with R.

1
2
3
4
5
require(scattermore)
system.time(print(ggplot(pdata, aes(x=x, y=y)) + geom_scattermore()))

   user  system elapsed 
  0.987   0.060   1.047

The overall speed up now is of ~13x: from 13.55 s to ~1 s!

So, if you have to plot a huge amount of points into a scatterplot, as I often do, I would highly recommend using scattermore. And a huge shoutoute to exaexa for implementing and sharing this amazingly fast package!

If you already heard about this package, good for you. Otherwise, I hope this piece helped you somehow :smile: Peace out :v:

References

Bookmarks

:heart: :camera: Cover picture by @crisovalle, on Unsplash.com


(Updated: )
Interact with this post on
Categories