Datasets

Datasets

All datasets are available for download at https://drive.google.com/file/d/1df5PyaN8bQ8C_rDezLJhdjlWn5f1K0b4/view?usp=sharing or through the tables and links below.

Low weight varianceHigh weight variance
Low insertions and deletionsRegular insertions and deletionsSpiky insertions and deletionsLow insertions and deletionsRegular insertions and deletionsSpiky insertions and deletions
Single levelLow weight change15515370287
Single levelRegular weight change61255013836183
Single levelSpiky weight change491010478161
2 or 3 levelsLow weight change155157702105
2 or 3 levelsRegular weight change61275713835198
2 or 3 levelsSpiky weight change491110479200
4+ levelsLow weight change21465
4+ levelsRegular weight change2719
4+ levelsSpiky weight change1178

                                                                        |

Dataset extraction

GitHub

The dataset selection (for those named gh-*) was done through the scrapping of the first 10 pages of gitmostwanted.com, which is a website that lists popular GitHub repositories.

For the selected URLs (list can be found here), we run a python script that clones revisions of a repository with a given periodicity. For this batch, we used monthly extractions (-m), but the script takes a flag as argument that selects if it should clone yearly (-y), daily (-d), or all revisions (-a).

For each cloned revision, the script invokes the CLOC tool to list and count the number of lines of code in each source file. This is a faster free alternative to Understand.

Datasets are named as gh-<name_of_repository>-<periodicity>.

Ex.: gh-svgo-m for the github.com/svg/svgo repository with last commit of month extraction (-m).

The scripts and notebooks used for scrapping, metric collection and file generation can be found here.

The datasets that are named GitHub* where borrowed from previous works ([1] and [2]) that evaluated dynamic treemaps. gh* datasets were extracted in July 16, 2018. GitHub* datasets were extracted in May, 2017.

WorldBank

The WorldBank Group offers an open database with hundreds of indicators of global development. Their measurements can be downloaded as a large csv file.

Based on that, each indicator that has a minimum number of observations is turned into a dataset for our application. Each row in the original data represents one Region/Country entry and the columns give the indicator value for that entry in a given year. There is one column for each year since the first time the indicator was measured.

Datasets are named wb-<indicator_id>, where the _indicatorid is a unique code that maps to one global indicator.

For example: wb-TX.VAL.MMTL.ZS.UN corresponds to “Ores and metals exports (% of merchandise exports)” and wb-EN.URB.MCTY.TL.ZS corresponds to “Population in urban agglomerations of more than 1 million (% of total population)”.

The mapping of indicator ids to dataset descriptions can be found here.

Addtional information about datasets can be found here with descriptions, information about periodicity, aggregation method, limitations and exceptions, statistical concept and methodology, source, license type, etc. These datasets were extracted in July 4, 2018.

TMDB

The raw dataset contains over 20 million reviews about 27,278 movies. Each user review of a movie has a rating (value from 0 to 5) and a timestamp of when it was posted. The first reviews date to January 1995 and the last to March 2015. Each movie has an id, year of release and a list of genres it belongs to (sorted alphabetically). To extract 100+ datasets from this source I defined 4 levels of freedom (hierarchy, cell weight, time aggregation, filters). In bold are the keys that are used to identify the datasets.

  • Hierarchy
    • Using genre information. 0 to 7 levels (some movies don’t have this info) [genre] - Ex.:Adventure/Animation/Children/Comedy/ToyStory
    • Using year information. Always two levels. [year] - Ex.:1995/ToyStory
  • Cell weight
    • Number of reviews in a given period. [count]
    • Average rating in a given period. [average]
    • Review standard deviation in a given period. [std]
  • Time aggregation by review timestamp
    • Aggregate monthly (~240 months or revisions). [monthly]
    • Aggregate yearly (~20 years or revisions). [yearly]
  • Filters
    • Only action movies. [action]
    • Only children movies. [children]
    • Only documentaries. [documentary]
    • Only movies from 60s, 70s and 80’s. [60sto80s]
    • Only movies from 90s. [90s]
    • Only movies from 2000 onwards. [00stonow]
    • Only movies with 4 or more genres. [4plusngeres]
    • Only movies without genre info. [nogenre]
    • 2000 randomly selected movies. [2krand]

This gives us 2 * 3 * 2 * 9 = 108 datasets. Datasets are named tmdb-<periodicity>-<weight_value>-<hierarchy_type>-<filter>.

For example: tmbd-monthly-mean-genres-documentaries gives the average monthly rating for documentaries using the genre hierarchy and tmbd-yearly-count-release-90s counts the number of reviews 90s titles received during each year using the hierarchy given by Year/Title.

The code that generated the datasets is available here. The datasets were generated July 14, 2018.

Movielens

The MovieLens database. This database contains 45,000 movies, 750,000 keywords attached to these movies, and 26 million time-stamped 0 to 5 star ratings over roughly 22 years. The naming convention for these files is as follows:

Movies<hierachical><cumulative><collection_period><aggregation_period>

If the name contains the H of <hierarchical> we have a fixed 4 level hierarchy.

We first partitioned on the genres Crime, Adventure and Drama; second on the movie release date (before vs after 2010); third on the tags Ford, Pitt, Depp, Hanks, Stewart, Cooper, Grant, and Flynn; fourth on whether the movie title contains the word Act, War, Love, Time, Spirit, Night, or any other word; fifth on whether tags contain Friend, Enemy or neither/both; and finally on whether tags contain Past, Future or neither/both. This results in a very deep hierarchy with a large imbalance in the tree depths.

If the name does not contains the H of <hierarchical> it is a dataset consisting of only a single level. In this case we only considered movies with genres Action and Adventure to trim down the sheer amount of movies.

If the name contains the C of <cumulative> it means we consider all ratings from the start of the collection period until now. Otherwise we only consider the ratings in the current aggregation period.

The _<collectionperiod> indicates over how long a period we are collecting the data. The _<aggregationperiod> indicates how much time passes between two samples. Y indicates a year, M indicates a month, D indicates a day, H indicates an hour.

A few examples are: Movies15M1D, Movies3Y3M, MoviesC4Y90H, MoviesH22Y7M, MoviesHC10Y2M.

The datasets were generated February 10, 2018.

Flattened datasets

Datasets prefixed by f-* have been flattened. They come from the same sources mentioned previously but they have no hierarchy in the data, that is, the tree has depth equals to 1.