Consider a character vector, `pool`

, whose elements are (zero-padded) binary numbers with up to `max_len`

digits.

```
max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x)
do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
pool
## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010" "110"
## [11] "001" "101" "011" "111" "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"
```

I'd like to sample `n`

of these elements, with the constraint that none of the sampled elements are prefixes of any of the other sampled elements (i.e. if we sample `1101`

, we ban `1`

, `11`

, and `110`

, while if we sample `1`

, we ban those elements beginning with `1`

, such as `10`

, `11`

, `100`

, etc.).

The following is my attempt using `while`

, but of course this is slow when `n`

is large (or when it approaches `2^max_len`

).

```
set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
paste0('^', chosen[!prefixes]), pool)) == 0]
chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}
chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111" "0000" "0110" "1100" "0111"
```

This can be improved slightly by initially removing from `pool`

those elements whose inclusion would mean there are insufficient elements left in `pool`

to take a total sample of size `n`

. E.g., when `max_len = 4`

and `n > 9`

, we can immediately remove `0`

and `1`

from `pool`

, since by including either, the maximum sample would be 9 (either `0`

and the eight 4-character elements beginning with `1`

, or `1`

and the eight 4-character elements beginning with `0`

).

Based on this logic, we could then omit elements from `pool`

, prior to taking the initial sample, with something like:

```
pool <- pool[
nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]
```

Can anyone think of a better approach? I feel like I'm overlooking something much more simple.

**EDIT**

To clarify my intentions, I'll portray the pool as a set of branches, where the junctions and tips are the nodes (elements of `pool`

). Assume the yellow node in the following figure (i.e. 010) was drawn. Now, the entire red "branch", which consists of nodes 0, 01, and 010, is removed from the pool. This is what I meant by forbidding sampling of nodes that "prefix" nodes already in our sample (as well as those already *prefixed* by nodes in our sample).

If the sampled node was half-way along a branch, such as 01 in the following figure, then all red nodes (0, 01, 010, and 011) are disallowed, since 0 prefixes 01, and 01 prefixes both 010 and 011.

I don't mean to sample *either* 1 *or* 0 at each junction (i.e. walking along the branches flipping coins at forks) - it's fine to have both in the sample, as long as: (1) parents (or grand-parents, etc.) or children (grandchildren, etc.) of the node aren't already sampled; and (2) upon sampling the node there will be sufficient nodes remaining to achieve the desired sample of size `n`

.

In the second figure above, if 010 was the first pick, all nodes at black nodes are still (currently) valid, assuming `n <= 4`

. For example, if `n==4`

and we sampled node 1 next (and so our picks now included 01 and 1), we would subsequently disallow node 00 (due to rule 2 above) but could still pick 000 and 001, giving us our 4-element sample. If `n==5`

, on the other hand, node 1 would be disallowed at this stage.