I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as
NEWID() doesn't seem to be practical.
I have tried the following strategy and would like to know if this is the most ideal way.
Create a new field called 'RandomSort' as UniqueIdentified
At the end of each hour/day will do a
Update RandomSort = NewID() to the entire table
- Each time I need to query, I can do a
Top 10 Order by RandomSort
It does get the job done (better than
ORDER BY NewID), but not sure if this is the best practice so far?
Best How To :
Add an identity column 'rowid' (
bigint depending on your table size) and create a unique non-clustered index on it.
The following query uses the
NEWID() function to return approximately one percent of the rows of the table:
SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression
CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int) evaluates to a random float value between 0 and 1.
In fact you could use any column indexed column in your table (I believe).
If you just want to pick a single random row:
SELECT TOP 1 * FROM table
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table)
This works in constant time, provided the
rowid column is indexed. Note: this assumes that
rowid is uniformly distributed in the range
0..MAX(rowid), hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).