I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:
import pandas as pd df = pd.DataFrame(index=range(100000),columns=range(100)) from rpy2 import robjects as ro ro.globalenv['df'] = df
And if we're in IPython:
%load_ext rmagic %R -i df
For some reason the
ro.globalenv route is slightly slower than the
rmagic route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:
- Even with just 1GB of data, the transfer is rather slow.
- If I understand correctly, this creates two copies of the dataframe in memory: one in Python, and one in R. That means I'll have just doubled my memory requirements, and I haven't even gotten to running statistical tests!
Is there any way to:
- transfer a large dataframe between Python and R more quickly?
- Access the same object in memory? I suspect this asking for the moon.