I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Best How To :
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
- Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
- Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
- Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
- Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
- Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
- Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
- If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.