(Analysis by Benjamin Qi)

For $N\le 20,$ any reasonable polynomial-time solution should work. One possible approach is to calculate the result for all $n\le N, k\le \binom{n}{2}$ in $O(N^7).$

For additional points, we should find a way to compute $d_i(a)$ without explicitly constructing the tree. The key condition is that $j$ is an ancestor of $i$ if $a[j]=\min(a[i\ldots j]),$ so it follows that

$$d_i(a)=1+\sum_{1\le j<i}(a[j] == \min(a[j\ldots i]))+\sum_{i<j\le n}(a[j] == \min(a[i\ldots j])).$$
Let's focus on counting the number of permutations $a$ such that $a[j] == \min(a[i\ldots j])$ for some fixed pair $(i,j)$ satisfying $i<j.$ We'll do this by constructing $a$ one element at a time.

First, we start with a sequence consisting of $a[i]$ only. Then $a[i+1]$ can be either greater than $a[i]$ or less than $a[i],$ contributing $0$ or $1$ inversion. Then $a[i+2]$ can take on any of three different values relative to $a[i]$ and $a[i+1],$ contributing anywhere from $0$ to $2$ inversions. Continuing in this fashion, the possible numbers of inversions in the sub-permutation $a[i\ldots j-1]$ can be represented by the polynomial product

$$\prod_{t=1}^{j-i}\left(\sum_{u=0}^{t-1}x^u\right).$$
This is known as a generating function because we are encoding a sequence using a polynomial. If we expand it and group together the terms with the same power of $x,$ then a term in the form $cx^d$ means that there are exactly $c$ permutations with $d$ inversions.

Adding $a[j]$ contributes $j-i$ inversions regardless of how many inversions $a[i\ldots j-1]$ has. Then we should add the remaining elements of the permutation, each of which can go anywhere in the sorted order. Thus, the final result is given by the generating function

$$\prod_{t=1}^{n}\left(\sum_{u=0}^{t-1}x^u\right)\cdot \frac{1}{\sum_{u=0}^{j-i}x^u}\cdot x^{j-i}.$$
The first part of this product does not depend on $i$ or $j,$ and we can calculate it in $O(N^3)$ time with prefix sums. We can divide it by $\sum_{u=0}^{j-i}x^u$ in $O(N^2)$ time by reversing the process we used to multiply.

After dividing, all we need is the coefficient of $x^{k-(j-i)}.$ Since the product depends only on $j-i,$ we only need to do $N$ different divisions. Alternatively, we can maintain prefix and suffix products without needing to do division. The process for $i>j$ is almost exactly the same, except $a[j]$ contributes $0$ inversions rather than $i-j.$

The whole solution runs in $O(N^3)$ time and $O(N^2)$ memory. My code follows. It turns out that $M$ being prime is irrelevant ...

#include <bits/stdc++.h>
using namespace std;

typedef long long ll;
typedef vector<int> vi;

#define FOR(i,a,b) for (int i = (a); i < (b); ++i)
#define F0R(i,a) FOR(i,0,a)
#define ROF(i,a,b) for (int i = (b)-1; i >= (a); --i)
#define R0F(i,a) ROF(i,0,a)
#define trav(a,x) for (auto& a: x)

#define pb push_back
#define rsz resize
#define sz(x) int(x.size())

void setIO(string name) {
ios_base::sync_with_stdio(0); cin.tie(0);
freopen((name+".in").c_str(),"r",stdin);
freopen((name+".out").c_str(),"w",stdout);
}

int MOD;
int n,k;

typedef int T;
struct mi {
T val;
mi() { val = 0; }
mi(const ll& v) {
val = (-MOD <= v && v <= MOD) ? v : v % MOD;
if (val < 0) val += MOD;
}
mi& operator+=(const mi& m) {
if ((val += m.val) >= MOD) val -= MOD;
return *this; }
mi& operator-=(const mi& m) {
if ((val -= m.val) < 0) val += MOD;
return *this; }
};
typedef vector<mi> vmi;

void ad(vmi& a, int b) { // multiply by (x^0+x^1+...+x^{b-1})
a.rsz(sz(a)+b-1);
R0F(i,sz(a)-b) a[i+b] -= a[i];
FOR(i,1,sz(a)) a[i] += a[i-1];
}
void sub(vmi& a, int b) {
ROF(i,1,sz(a)) a[i] -= a[i-1];
F0R(i,sz(a)-b) a[i+b] += a[i];
a.rsz(sz(a)-b+1);
}
mi get(vmi& a, int b) {
if (b < 0 || b >= sz(a)) return 0;
return a[b];
}

int main() {
setIO("treedepth");
cin >> n >> k >> MOD;
vmi v = {1}; FOR(i,1,n+1) ad(v,i);
vmi ans(n,v[k]);
FOR(dif,1,n) {
sub(v,dif+1);
mi x = get(v,k-dif), y = get(v,k);