itpdp/api/PARTY_ANALYSIS_AND_QUESTION_GENERATION.md
Daniel Bulant 6c7854edd4
gen docs
2026-06-20 22:51:51 +02:00

374 lines
13 KiB
Markdown

# Party Analysis and Question Generation Algorithm
This document describes how the API analyzes a party's Spotify data and turns that analysis into quiz questions.
Relevant implementation files:
- `src/workflows/party-analysis.ts` — computes and stores `party.analysisData`.
- `src/workflows/quiz.ts` — starts analysis, runs the quiz loop, and scores answers.
- `src/party/question-generator.ts` — chooses a question type and attaches a song.
- `src/party/audio-question-generator.ts` — builds audio metadata choice questions.
- `src/party/social-question-generator.ts` — builds social choice questions.
- `src/party/numeric-question-generator.ts` — builds numeric questions.
- `src/party/question-utils.ts` — shared fairness, deduplication, option, and song selection logic.
## High-level flow
```mermaid
flowchart TD
A[Quiz starts] --> B[Analyze party]
B --> C[Store party.analysisData]
C --> D[Initialize quiz state]
D --> E[Generate next question]
E --> F[Publish current question in party data]
F --> G[Wait for player answers or timeout]
G --> H[Score round]
H --> I[Review period]
I --> J{More questions?}
J -->|Yes| E
J -->|No| K[Show results and mark party ended]
```
When a quiz starts, `QuizWorkflow.startQuiz` first runs `partyAnalysisWorkflow.analyzeParty(partyId)`. The generated analysis is saved to the `party.analysisData` JSON column and then reused for every question in the quiz.
The quiz currently asks up to `TOTAL_QUESTIONS = 5` questions. Each question has a 60 second answer window, followed by a 5 second review period.
## Party analysis
Party analysis converts each member's listening data into comparable track, artist, and genre scores. The result is a compact party-level summary designed for fast question generation.
### 1. Minimum party size
If a party has fewer than 2 members, analysis is saved as empty:
- `storyClusters: []`
- `pairwise: []`
- `groupSummary.totalMembers`
- `groupSummary.mostSharedGenres: []`
- `groupSummary.mostDiverseMember: null`
- `groupSummary.mostAlignedPair: null`
- `memberProfiles: []`
The workflow returns `analyzed: false` in that case.
### 2. Per-member scoring
For each party member, the analysis workflow fetches several Spotify-derived tables and accumulates scores into three maps:
- tracks
- artists
- genres
The score inputs are:
| Source | Scoring |
| --- | --- |
| Medium-term top tracks | `MAX_POSITION - position + 1`, with `MAX_POSITION = 50` |
| Saved tracks | track `+10`, artists `+5`, genres `+2.5` |
| Playback history | track `+5` if played in last 24h, `+3` if played in last week, otherwise `+1`; artists get half, genres get quarter |
| Medium-term top artists | `MAX_POSITION - position + 1` |
| Followed artists | artist `+10`, genres `+10` |
| Saved albums | album artists `+5`, genres `+2.5` |
Top track and top artist scores are position-weighted, so rank 1 contributes more than rank 50. Saved and followed items add fixed preference signals. Playback history adds recency-weighted listening signals.
### 3. Party-level entity maps
After all member scores are fetched, the workflow builds one map per entity type:
- `TrackEntityScore`
- `ArtistEntityScore`
- `GenreEntityScore`
Each entity contains:
- entity id and display name
- track artist names and album name, for tracks
- `memberScores`, the list of members who contributed to the entity and their score
- `memberCount`, the number of party members represented by that entity
These maps make it possible to tell which songs, artists, and genres are shared by multiple people and which are strongly associated with one person.
### 4. Story clusters
Story clusters group entities by the exact subset of party members that share them.
For example, if Alice and Bob both have the same genre, that genre goes into the cluster keyed by `Alice|Bob`. If all party members share a track, that track goes into the all-members cluster.
Clusters are sorted by:
1. all-members cluster first
2. larger `memberCount`
3. total track score in the cluster
Within each cluster, tracks, artists, and genres are sorted by total score descending.
The stored analysis is compacted to:
- top 8 story clusters
- top 20 tracks, artists, and genres per cluster
### 5. Pairwise similarity
For every pair of party members, the workflow computes:
- `sharedTracks`
- `sharedArtists`
- `sharedGenres`
- `similarity`
Similarity uses a weighted Jaccard-style score across tracks, artists, and genres:
```text
similarity = sum(min(scoreA, scoreB) for shared entities)
/ sum(max(scoreA, scoreB) for all entities in either profile)
```
This rewards members who share high-scoring music preferences, not just raw overlap counts.
The most similar pair becomes `groupSummary.mostAlignedPair`.
### 6. Member profiles and genre diversity
Each member profile stores:
- `userId`
- `totalScore`, based on track and artist scores
- `genreScores`
- `trackCount`
- `artistCount`
Genre diversity is calculated as entropy over the member's genre score distribution:
```text
entropy = -sum(p * ln(p))
```
where `p` is the genre score divided by the member's total score. The member with the highest entropy becomes `groupSummary.mostDiverseMember`.
Stored member profiles keep only the top 20 genres by score.
### 7. Most shared genres
The workflow aggregates genre scores across members, sorts genres by:
1. `memberCount` descending
2. total genre score descending
It keeps the top 10 genres that are shared by at least 2 members as `groupSummary.mostSharedGenres`.
## Generated analysis shape
The saved `party.analysisData` contains:
```ts
type PartyAnalysisResult = {
storyClusters: StoryCluster[];
pairwise: PairwiseComparison[];
groupSummary: {
totalMembers: number;
mostSharedGenres: GenreEntityScore[];
mostDiverseMember: GenreDiversity | null;
mostAlignedPair: PairwiseComparison | null;
};
memberProfiles: MemberProfile[];
};
```
This JSON is intentionally denormalized and compact so the question generators can work without recomputing party analytics during each round.
## Question generation
Each quiz round calls `generatePartyQuestion`, passing:
- database client
- party id
- current `QuizState`
- saved `party.analysisData`
- question index
The generator fetches current party members, chooses a question type order, asks each question builder for a valid question, then attaches a suitable song.
### 1. Question type ordering
The possible question types are:
- `audio-metadata`
- `social`
- `numeric`
For each round, the generator randomizes their priority using base weights and recent-history penalties:
| Type | Base weight |
| --- | ---: |
| `audio-metadata` | `1` |
| `numeric` | `0.55` |
| `social` | `0.1` |
A random value up to `0.35` is added. Each occurrence of the same type in the last 3 questions subtracts `0.45`.
This means audio metadata questions are preferred by default, but the generator avoids repeating the same category too often.
### 2. Candidate generation
Each question builder creates multiple candidates, then `pickQuestionCandidate` selects one.
Candidates contain:
- `key` — unique question identity
- `subjectKey` — the entity being asked about, such as `track:...`, `artist:...`, `genre:...`, `member:...`, or `pair:...`
- optional fairness metadata
- the partial question object
A candidate is rejected if the quiz history already contains the same normalized:
- question key
- subject key
- question text
This prevents repeated questions and repeated subjects across the quiz.
### 3. Fairness weighting
Tracks and artists are sorted by fairness before they are used as question subjects.
Fairness is derived from an entity's `memberScores`:
- `memberIds` — party members connected to the entity
- `memberCount` — how many members are connected
- `score` — total member score for shared entities
For single-member entities, the fairness score is negative history usage for that member. This prevents the quiz from repeatedly focusing on one person when only single-member subjects are available.
Question candidate weight is:
```text
if no fairness data:
weight = 8
else:
weight = 8 + memberCount * 20 + clamp(score, 0, 100) / 20
```
Weighted random selection is then used. Shared, high-scoring entities are therefore more likely, but not guaranteed, to be selected.
### 4. Audio metadata questions
`buildAudioMetadataQuestion` produces choice questions about party music metadata. It uses:
- most shared genres
- fair tracks from story clusters
- fair artists from story clusters
- detailed track rows from the database for album, artist, release date, and duration metadata
Examples of generated questions include:
- `What song is currently playing?`
- `Which genre is shared by the most party members?`
- `Which genre is ranked #<rank> in the party's shared genres?`
- `Which artist is ranked highest in the shared audio data?`
- `Which artist is ranked #<rank> in the shared audio data?`
- `Which track is ranked highest across the party?`
- `Which track is ranked #<rank> in the party analysis?`
- `Which artist appears on "<album>"?`
- `Which of these tracks came out first?`
- `Which of these tracks came out most recently?`
- `What's the longest track by <artist>?`
- `Who performs "<track>"?`
- `What is the name of this track by <artist>?`
- `"<track>" appears on which album?`
Options are built from relevant candidate pools, deduplicated, shuffled, and only emitted when there are enough valid options.
### 5. Social questions
`buildSocialQuestion` produces choice questions about players and relationships in the party.
Examples include:
- `Who is leading the quiz right now?`
- `Who looks like the most diverse listener in the party?`
- `Who listens the most to "<track>"?`
- `Which two players share the most musical taste?`
Social questions require enough party members for the question to make sense:
- leader/diverse/top-listener questions need at least 2 members
- most-aligned-pair questions need at least 3 members
The top-listener question prefers shared tracks when shared tracks exist, so the quiz does not unnecessarily focus on solo-only data.
### 6. Numeric questions
`buildNumericQuestion` produces numeric-answer questions. Numeric questions are scored by closeness during the quiz rather than by exact choice index.
Examples include:
- `What's the release year of <album or track>?`
- `What year did "<track>" come out?`
- `What year did <artist>'s first party track come out?`
- `For how many players in the party is "<track>" a top track?`
- `How many players in the party have "<artist>" as a favourite artist?`
Release-year questions use a range around the correct year, capped at the current year and widened to a minimum span. Count questions use a range from `0` to the current party size.
### 7. Question timing
Every generated question is wrapped by `buildQuestionWindow`, which sets:
```ts
startTimestamp = Date.now();
endTimestamp = startTimestamp + 60_000;
```
These timestamps are used by the quiz workflow to decide when answer collection times out.
## Song selection
After a question candidate is selected, `selectQuestionSong` chooses an audio track to attach to the question.
Song candidates come from:
1. the song already attached by the question builder
2. the question subject, if the subject is a track or artist
3. people mentioned by the question, when member-specific subjects can imply a relevant song
4. fair tracks from the story clusters
5. top party songs queried from member top tracks
The selector avoids reusing songs from previous quiz rounds when possible by checking prior `song.platform_id` values.
Some question types should keep or prefer their subject song:
- Questions where hearing the exact song is necessary keep the subject song.
- Questions where the song helps but is not mandatory prefer a relevant fresh song.
- Other questions prefer fair, fresh, adjacent party songs so audio does not reveal the answer too directly.
## Quiz response and scoring
For each question, the quiz workflow waits until all current party members answer or the question deadline is reached. Missing answers are recorded with `selected: -1` and score 0.
Choice questions are scored exactly:
```text
pointsGained = question.points if selected option index equals question.correct
pointsGained = 0 otherwise
```
Numeric questions are scored relatively by answer distance:
1. Ignore no-answer responses for ranking.
2. Compute absolute distance from the correct numeric value.
3. Group equal distances together.
4. Award the closest group full points and linearly decrease points for later distance groups.
5. If all numeric answers are equally distant, only exact answers receive points.
After each round, scores are added to `quizState.scores`, the quiz enters `review`, then continues to the next question. After the final question, the quiz status becomes `results` and the party status is marked `ended`.
## Design goals
The current algorithm is optimized for:
- **Shared relevance:** Prefer content that represents multiple party members.
- **Personal variety:** Avoid repeatedly targeting the same member or subject.
- **Freshness:** Avoid repeated question keys, subjects, text, and songs.
- **Playable trivia:** Only emit questions with enough options, valid text, and usable metadata.
- **Low round latency:** Do expensive aggregation once in party analysis, then use compact JSON during quiz rounds.