Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Auto detect text files and perform LF normalization
* text=auto
*.sqlite filter=lfs diff=lfs merge=lfs -text
*.bson filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.db filter=lfs diff=lfs merge=lfs -text
*.sql filter=lfs diff=lfs merge=lfs -text
Expand All @@ -9,3 +11,4 @@ query_crmarenapro/query_dataset/*.sql filter=lfs diff=lfs merge=lfs -text
query_crmarenapro/query_dataset/hidden/*.db filter=lfs diff=lfs merge=lfs -text
query_crmarenapro/query_dataset/hidden/*.duckdb filter=lfs diff=lfs merge=lfs -text
query_crmarenapro/query_dataset/hidden/*.sql filter=lfs diff=lfs merge=lfs -text
query_krama/query_dataset/misc_files/* filter=lfs diff=lfs merge=lfs -text
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
.DS_Store
# general py
__pycache__/
.history/

# runtime
claude_projects/

# results
results/
Expand Down Expand Up @@ -35,6 +39,15 @@ common_scaffold/tools/backup/
common_scaffold/*_backup/

# datasets
query_imdb/query_dataset/imdb*
query_imdb/data_raw/
query_imdb/job_queries/
query_imdb/job_results/
query_imdb/query*/query.sql
query_imdb/scripts/
query_krama/query_dataset/topic_files/topic_files_db/files.bson.bak
query_krama/query*/ground_truth.py
query_krama/scripts/
query_civic_unstructured/
query_paper_unstructured/
query_notice_unstructured/
Expand Down
8 changes: 8 additions & 0 deletions query_imdb/db_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
db_clients:
movies_database:
db_type: postgres
db_name: movies_db
sql_file: query_dataset/movies.sql
people_database:
db_type: sqlite
db_path: query_dataset/people.sqlite
182 changes: 182 additions & 0 deletions query_imdb/db_description.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
You are working with two databases to solve this query.

Here are the descriptions of these two databases:

1. movies_database
- This database is stored in a PostgreSQL database and contains movie-centric information. It covers titles, production companies, keywords, genres, ratings, and structural relationships between titles.
- This database consists of 15 tables:
- title
- This table is the central table of the movies_database. Each row represents a unique title entry (movie, TV series, episode, video game, etc.).
- Fields:
- id (str): Unique identifier for the title.
- title (str): The title string of the movie or show.
- imdb_index (str): Disambiguation suffix used by IMDB (e.g., Roman numerals).
- kind_id (int): Foreign key referencing kind_type.id; indicates the type of title.
- production_year (int): Year in which the title was produced or released.
- imdb_id (int): Original numeric IMDB identifier.
- phonetic_code (str): Phonetic encoding of the title for search purposes.
- episode_of_id (int): For TV episodes, references the id of the parent series in this table.
- season_nr (int): Season number (for TV episodes).
- episode_nr (int): Episode number within a season.
- series_years (str): Year span for TV series (e.g., "1990-1995").
- md5sum (str): MD5 checksum of the record.
- aka_title
- This table stores alternate or foreign-language titles for movies and shows.
- Fields:
- id (int): Unique identifier for the alternate title record.
- movie_id (int): Foreign key referencing title.id.
- title (str): The alternate title string.
- imdb_index (str): Disambiguation suffix.
- kind_id (int): Foreign key referencing kind_type.id.
- production_year (int): Production year for this alternate release.
- phonetic_code (str): Phonetic encoding of the alternate title.
- episode_of_id (int): Parent series ID for episode alternate titles.
- season_nr (int): Season number.
- episode_nr (int): Episode number.
- note (str): Notes on context (e.g., country, dubbed version).
- md5sum (str): MD5 checksum.
- kind_type
- This lookup table enumerates the types of titles (e.g., movie, TV series, episode).
- Fields:
- id (int): Unique identifier.
- kind (str): Label for the title type (e.g., "movie", "tv series", "episode").
- movie_info
- This table stores free-text metadata about movies, such as genres, languages, countries, plot summaries, and technical specifications.
- Fields:
- id (int): Unique identifier for the info record.
- movie_id (int): Foreign key referencing title.id.
- info_type_id (str): Foreign key referencing info_type.id.
- info (str): The metadata value (e.g., "English", "USA", "Drama").
- note (str): Supplemental notes.
- movie_info_idx
- This table stores indexed numeric metadata about movies, primarily ratings and vote counts.
- Fields:
- id (int): Unique identifier for the index record.
- movie_id (int): Foreign key referencing title.id.
- info_type_id (int): Foreign key referencing info_type.id.
- info (str): The numeric metadata value (e.g., a rating score).
- note (str): Supplemental notes.
- info_type
- This lookup table enumerates the categories of metadata stored in movie_info and movie_info_idx (e.g., "genres", "rating", "votes").
- Fields:
- id (str): Unique identifier.
- info (str): Description of the info category (e.g., "genres", "rating", "languages").
- movie_keyword
- This table records the association between movies and descriptive keywords.
- Fields:
- id (int): Unique identifier for the movie-keyword association.
- movie_id (int): Foreign key referencing title.id.
- keyword_id (str): Foreign key referencing keyword.id.
- keyword
- This lookup table contains all descriptive keywords (tags) used to annotate movies.
- Fields:
- id (str): Unique identifier.
- keyword (str): The keyword text (e.g., "murder", "based-on-novel").
- phonetic_code (str): Phonetic encoding of the keyword.
- movie_companies
- This table links movies to the production or distribution companies involved.
- Fields:
- id (str): Unique identifier.
- movie_id (int): Foreign key referencing title.id.
- company_id (int): Foreign key referencing company_name.id.
- company_type_id (int): Foreign key referencing company_type.id.
- note (str): Additional notes on the company's role.
- company_name
- This lookup table contains names and metadata for production and distribution companies.
- Fields:
- id (str): Unique identifier.
- name (str): Full name of the company.
- country_code (str): Country code of the company (e.g., "[us]", "[gb]").
- imdb_id (int): Original IMDB numeric identifier.
- name_pcode_nf (str): Phonetic code for the company name.
- name_pcode_sf (str): Alternate phonetic code.
- md5sum (str): MD5 checksum.
- company_type
- This lookup table enumerates the roles a company can have (e.g., production, distribution).
- Fields:
- id (int): Unique identifier.
- kind (str): Company role label (e.g., "production companies", "distributors").
- movie_link
- This table records directional relationships between titles (e.g., sequel, remake, spin-off).
- Fields:
- id (int): Unique identifier.
- movie_id (int): Foreign key referencing title.id (the source title).
- linked_movie_id (int): Foreign key referencing title.id (the related title).
- link_type_id (int): Foreign key referencing link_type.id.
- link_type
- This lookup table enumerates the types of relationships between titles.
- Fields:
- id (int): Unique identifier.
- link (str): Relationship description (e.g., "follows", "sequel", "remake of").
- complete_cast
- This table records the completeness status of cast and crew information for a movie.
- Fields:
- id (int): Unique identifier.
- movie_id (int): Foreign key referencing title.id.
- subject_id (int): Foreign key referencing comp_cast_type.id; indicates whether cast or crew is being described.
- status_id (int): Foreign key referencing comp_cast_type.id; indicates the completeness status.
- comp_cast_type
- This lookup table enumerates the subjects and statuses used in complete_cast.
- Fields:
- id (int): Unique identifier.
- kind (str): Label for the subject or status (e.g., "cast", "crew", "complete", "complete+verified").

2. people_database
- This database is stored in a SQLite database and contains people-centric information from IMDB. It covers individuals who worked on movies (actors, directors, writers, etc.), their alternate names, roles, biographical information, and their casting associations with titles.
- This database consists of 6 tables:
- name
- This table is the central table of the people_database. Each row represents a unique person in the IMDB database.
- Fields:
- id (str): Unique identifier.
- name (str): Full name of the person (last name, first name format).
- imdb_index (str): Disambiguation suffix for persons sharing the same name.
- imdb_id (int): Original IMDB numeric identifier.
- gender (str): Gender of the person ("m" or "f").
- name_pcode_cf (str): Phonetic code for the full name.
- name_pcode_nf (str): Alternate phonetic code.
- surname_pcode (str): Phonetic code for the surname.
- md5sum (str): MD5 checksum.
- aka_name
- This table stores alternate names or pseudonyms for people.
- Fields:
- id (int): Unique identifier.
- person_id (int): Foreign key referencing name.id.
- name (str): The alternate name or pseudonym.
- imdb_index (str): Disambiguation suffix.
- name_pcode_cf (str): Phonetic code.
- name_pcode_nf (str): Alternate phonetic code.
- surname_pcode (str): Phonetic code for the surname.
- md5sum (str): MD5 checksum.
- cast_info
- This table records the association between a person and a movie in a specific role. It is the central join table linking people_database to movies_database.
- Fields:
- id (int): Unique identifier.
- person_id (str): Foreign key referencing name.id.
- movie_id (int): Foreign key referencing title.id in movies_database.
- person_role_id (int): Foreign key referencing char_name.id; the character played.
- note (str): Additional credit notes (e.g., "uncredited").
- nr_order (int): Billing order of the credit.
- role_id (int): Foreign key referencing role_type.id.
- char_name
- This table contains character names that appear in movie credits.
- Fields:
- id (int): Unique identifier.
- name (str): The character name.
- imdb_index (str): Disambiguation suffix.
- imdb_id (int): Original IMDB numeric identifier.
- name_pcode_nf (str): Phonetic code.
- surname_pcode (str): Phonetic code for the surname portion.
- md5sum (str): MD5 checksum.
- person_info
- This table stores biographical and career metadata about people (e.g., birth date, birthplace, trivia).
- Fields:
- id (int): Unique identifier.
- person_id (int): Foreign key referencing name.id.
- info_type_id (int): Foreign key referencing info_type.id in movies_database.
- info (str): The metadata value.
- note (str): Supplemental notes.
- role_type
- This lookup table enumerates the types of roles a person can have in a movie (e.g., actor, director, writer).
- Fields:
- id (str): Unique identifier.
- role (str): Role description (e.g., "actor", "director", "writer", "producer").
2 changes: 2 additions & 0 deletions query_imdb/db_description_withhint.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
HINTS:
- Many identifier columns in both databases use non-standard string encodings rather than plain integers. To join across tables, you must extract the embedded numeric ID by stripping the alphabetic prefix, any punctuation padding characters, and any leading zeros. For example, `tt0000042` → `42`, `nm001` → `1`, `InfT~~7` → `7`. This applies to the following columns: title.id, company_name.id, movie_companies.id, name.id, cast_info.person_id, keyword.id, movie_keyword.keyword_id, info_type.id, movie_info.info_type_id, and role_type.id.
2 changes: 2 additions & 0 deletions query_imdb/query1/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
movie_kind,complete_us_internet_movie
movie,Dirt Merchant
1 change: 1 addition & 0 deletions query_imdb/query1/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among movies, TV movies, video movies, and video games produced after 1990 that are tagged with at least one keyword, are associated with a US company, have a US internet release date recorded in the 1990s or 2000s, and have a complete and verified cast listing — what is the alphabetically first title, and what type of title is it?"
20 changes: 20 additions & 0 deletions query_imdb/query1/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import re


def normalize(text):
text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
return re.sub(r"\s+", " ", text).strip()


def validate(llm_output: str):
llm_norm = normalize(llm_output)

# movie_kind: 'movie'
if "movie" not in llm_norm:
return False, "Kind 'movie' not found in LLM output."

# complete_us_internet_movie: 'Dirt Merchant'
if normalize("Dirt Merchant") not in llm_norm:
return False, "Title 'Dirt Merchant' not found in LLM output."

return True, "Ground truth found in LLM output."
2 changes: 2 additions & 0 deletions query_imdb/query10/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
cast_member,complete_dynamic_hero_movie
"Abell, Alistair",...And Then I...
1 change: 1 addition & 0 deletions query_imdb/query10/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among cast members who played a character whose name is not null and contains 'man' or 'Man', in titles of kind 'movie' produced after 2000, tagged with at least one of the keywords 'superhero', 'marvel-comics', 'based-on-comic', 'tv-special', 'fight', 'violence', 'magnet', 'web', 'claw', or 'laser', and with a cast listing whose completeness status contains 'complete', where the keyword tag, cast credit, and cast completeness record all refer to the same movie — what are the alphabetically first cast member name and the alphabetically first title?"
20 changes: 20 additions & 0 deletions query_imdb/query10/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import re


def normalize(text):
text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
return re.sub(r"\s+", " ", text).strip()


def validate(llm_output: str):
llm_norm = normalize(llm_output)

# cast_member: 'Abell, Alistair'
if normalize("Abell, Alistair") not in llm_norm:
return False, "Cast member 'Abell, Alistair' not found in LLM output."

# complete_dynamic_hero_movie: '...And Then I...' — normalize to 'and then i'
if normalize("And Then I") not in llm_norm:
return False, "Title '...And Then I...' not found in LLM output."

return True, "Ground truth found in LLM output."
2 changes: 2 additions & 0 deletions query_imdb/query2/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
writer_pseudo_name,movie_title
"""A.J.""",#1 Cheerleader Camp
1 change: 1 addition & 0 deletions query_imdb/query2/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among writers who have a registered pseudonym and are credited on movies associated with a US company, what is the alphabetically first pseudonym and the alphabetically first movie title?"
21 changes: 21 additions & 0 deletions query_imdb/query2/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import re


def normalize(text):
text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
return re.sub(r"\s+", " ", text).strip()


def validate(llm_output: str):
llm_norm = normalize(llm_output)
llm_lower = llm_output.lower()

# writer_pseudo_name: '"A.J."' — check case-insensitively preserving dots
if "a.j." not in llm_lower and "aj" not in llm_norm:
return False, "Pseudonym 'A.J.' not found in LLM output."

# movie_title: '#1 Cheerleader Camp'
if normalize("#1 Cheerleader Camp") not in llm_norm:
return False, "Title '#1 Cheerleader Camp' not found in LLM output."

return True, "Ground truth found in LLM output."
2 changes: 2 additions & 0 deletions query_imdb/query3/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
voicing_actress,jap_engl_voiced_movie
"Aaron, Caroline",$9.99
1 change: 1 addition & 0 deletions query_imdb/query3/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among female actresses credited with a voice role — including general voice, uncredited voice, Japanese dubbed version, or English dubbed version — in movies produced after 2000 that are associated with a US company and have a release date on record, where the actress has a registered alternate name and played a named character, what is the alphabetically first actress name and the alphabetically first movie title?"
21 changes: 21 additions & 0 deletions query_imdb/query3/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import re


def normalize(text):
text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
return re.sub(r"\s+", " ", text).strip()


def validate(llm_output: str):
llm_norm = normalize(llm_output)

# voicing_actress: 'Aaron, Caroline'
if normalize("Aaron, Caroline") not in llm_norm:
return False, "Actress name 'Aaron, Caroline' not found in LLM output."

# jap_engl_voiced_movie: '$9.99' — check for '9.99' as a float
matches = re.findall(r"\d+\.\d+", llm_output)
if not any(abs(float(m) - 9.99) < 0.01 for m in matches):
return False, "Movie title '$9.99' not found in LLM output."

return True, "Ground truth found in LLM output."
2 changes: 2 additions & 0 deletions query_imdb/query4/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
member_in_charnamed_movie,a1
"Z'Dar, Robert","Z'Dar, Robert"
1 change: 1 addition & 0 deletions query_imdb/query4/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among cast and crew members whose name starts with 'Z' and who are credited in movies tagged with the keyword 'character-name-in-title' — indicating the movie's title contains a character's name — that are associated with at least one company, what is the alphabetically first name?"
16 changes: 16 additions & 0 deletions query_imdb/query4/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import re


def normalize(text):
text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
return re.sub(r"\s+", " ", text).strip()


def validate(llm_output: str):
llm_norm = normalize(llm_output)

# member_in_charnamed_movie / a1: "Z'Dar, Robert" — normalize to 'zdar robert'
if normalize("Z'Dar, Robert") not in llm_norm:
return False, "Name 'Z\\'Dar, Robert' not found in LLM output."

return True, "Ground truth found in LLM output."
2 changes: 2 additions & 0 deletions query_imdb/query5/ground_truth.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
producing_company,rating,movie
"""O"" Films",1.0,#54 Meets #47
1 change: 1 addition & 0 deletions query_imdb/query5/query.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"Among titles of kind 'movie' (excluding TV movies, video movies, and other title types) that have a release date and a rating on record and are associated with a US production company, what are the alphabetically first company name, the alphabetically first rating value, and the alphabetically first title?"
Loading