songs.csv contains the original Song table. It has 961,593 tuples.

tracks.csv contains the original Track table. It has 734,485 tuples.

sample_A.csv contains the downsampled Song table, with 4,192 tuples. It has the
same schema as songs.csv.

sample_B.csv contains the downsampled Track table, with 5,000 tuples. It has
the same schema is tracks.csv.

C.csv contains the survivors of blocking, with 5,223 tuples. The first column
is a unique id. The next two columns are ltable_id (the id of the song) and
rtable_id (the id of the track). The remaining columns are the non-id columns
from sample_A.csv, followed by the non-id columns from sample_B.csv.

G.csv contains the sampled and labeled candidates from C.csv. It has 500
candidate tuples from C.csv. The schema is the same as that of C.csv, with an
additional column gold_label, where the value can be 0 (not a true match) or 1
(true match).

I.csv contains the training set. It has 350 candidate tuples from G.csv, and
the same schema as G.csv.

J.csv contains the testing set. It has the remaining 150 candidate tuples from
G.csv, and the same schema as G.csv.