Human-Readable ◉‿◉

UDisc Golf Project: Scorecard Analysis

Background

For those less familiar with disc golf ("flying disc sport"), think ball golf with a few exceptions. For example, instead of striking a golf ball with a club, each player throws a disc toward the hole. In this variation of the sport the target is a metal basket with chains (side note - many people mistake these baskets for garbage containers!). Similar to ball golf the first attempt is performed from a tee box, whereas each subsequent toss may be performed from various terrain.

The typical length of a course is 9 or 18 holes. Ultimately, the goal is to score in the basket with as few attempts as possible. One this goal is achieved scores are tallied and then added to the scorecard.

This dataset consists of user entered values collected by a mobile app: UDisc. We'll also include a secondary dataset imported from 'Historical Climate Data' to gather weather data. Check out the data dictionary in Github to learn more about each dataset.

Preview of Udisc.scorecards

Preview of Udisc.climate_2021

To replicate the results shown throughout follow you can reference saved files on Github. Once the schema is set-up run each code snippet on your computer (copy & paste).

The questions featured in the analysis resulted from brainstorming sessions accompanied by a friend. This friend (alias 'Cvance') generously shared the data for which I am grateful for his generosity and friendship. Thank you!

Diving Into The Data

Q: How many scorecards were created for each disc golfer between 2020-01-01 and 2021-07-07?

For this question we'll include the number of rounds played, average score, and standard deviation. Start by grouping PlayerName then applying multiple aggregate functions to calculate average scores and standard deviation about the mean.

SELECT 
    PlayerName,
    ROUND(AVG(net_score),1) AS Avg_Net_Score,
    ROUND(STD(net_score),1) AS Avg_Net_Score,
    COUNT(net_score) AS Rnds_Played
FROM scorecards
GROUP BY PlayerName
ORDER BY ROUND(AVG(net_score),1)

MySql Workbench

As expected, Cvance played considerable more rounds (34) over a two year period than the other players. Pink Leopard was a distant second, followed by Spence, with 21, and 11 scorecards submitted, respectively.

The average net scores for each player are shown in the bar chart below. Keep in mind lower scores signal higher performance.

Tableau Public

Q: Were there any weeks where Cvance played more than one game of disc golf? Show the week and frequency in the output.

Let's use date values as a unique identifier to aggregate results.

Start by using MySql YEAROFWEEK() function within the SELECT statement to compute year+week. This transformation results in integer values that are both unique and will not be counted more than once. COUNT() is used to count the number of times each date appears in the rows that are filtered by Cvance.

SELECT *
FROM (   
    SELECT 
        YEARWEEK(DATE) AS Week_of_Year, 
        COUNT(Date) AS Freq_Per_Wk
    FROM scorecards
    WHERE PlayerName= 'Cvance'
    GROUP BY Week_of_Year
    ORDER BY COUNT(Date) DESC
    ) Table1
WHERE Freq_Per_Wk > 1

MySql Workbench

Cvance played more than one round of disc golf on 10 occasions. In April 2021 he played a whopping 8 rounds over a two week period. Way to go slinger! But what was so special about the month of April that facilitated greater involvement in the sport?

If we map the timeline to the week of April 12th in Ontario, Canada lockdown restrictions were in place. What better way to lift the "COVID-19 lockdown blues" than to spend time playing outside!

Source: Google via Wikipedia, government health ministries, The New York Times, and other authoritative sources, as attributed.

Q: Is there a relationship between time of day and performance? Produce a query that shows descriptive statistics comparing scorecards between morning, afternoon, and evening. Include average net score (standard deviation), and rounds played. Filter results by 'Cvance'.

We'll need to use a subquery to get the job done. Table1 is used to query the time of day and net score. Here the Time() function is useful to parse values from the Datetime to Time datatype.

The next step involves sorting and aggregating values. Subquery Table1 with a CASE statement to categorize the timestamps into three 'parts of the day' formatted as string values. Other fields are calculated with aggregate functions. After defining which fields to include in the SELECT clause the records selected are then grouped by the (new) parts of day categories.

SELECT 
    CASE
        WHEN start_time BETWEEN '08:00:00' AND '12:00:00' 
        THEN 'Morning   (08:00:00-12:00:00)' 
        WHEN start_time BETWEEN '12:00:01' AND '17:00:00' 
        THEN 'Afternoon (12:00:01-17:00:00)'
        ELSE 'Evening   (17:00:00-24:00:00)'
    END AS Parts_of_Day,
    ROUND(AVG(Net_score),1) AS Avg_Net_Score,
    ROUND(STD(Net_score),1) AS STDDEV,
    COUNT(Net_score) AS Num_Rounds
FROM (
    SELECT
        TIME(Date) AS start_time,
        Net_score
    FROM scorecards
    WHERE PlayerName = 'Cvance'
    ) AS Table1
GROUP BY Parts_of_Day
ORDER BY Avg_Net_Score

MySQL WorkBench

The results show that most scorecards were submitted during the afternoon followed by morning then evening. Without performing further analysis the results suggest Cvance prefers playing in the afternoon. However, on average higher performances were achieved in the evening. Note that a larger sample size for the evening might help strengthen our confidence when interpreting these findings.

Q: List Cvance's top 5 best rounds of disc golf. Filter for courses with 18 holes or greater. Include in the output net score and location. Where applicable, create a list of other players with their respective scores for each round.

The .scorecards table contains all the info we need to produce the intended output. Starting from the inside then work outwards. Within the subquery create 4 new columns with function: GROUP_CONCAT. New resulting fields return a string with concatenated non-NULL value from the group. The WHERE clause filters scorecards with blank values on the 10th hole as a filter criteria.

Using the subquery a series of string functions are applied to isolate the desired string value from Player_Group that reflects Cvance's score. The CAST statement is used to convert the resulting string value (e.g. '-1') to an integer. With this transformation, we can take advantage of the fact that numerical values (unlike strings) can be ordered by the size of the quantity of the value. Lastly, because the question asks for the top 5 scores the LIMIT clause is set to 5.

SELECT 
    Date, 
    Course_Name, 
    Layout_Name, 
    Player_Group, 
    CAST(REPLACE(SUBSTRING(Player_Group, LOCATE('Cvance',   
    Player_Group),11),'Cvance (','') AS SIGNED)
        AS Cvance_Net_Score
FROM
    (SELECT
        Date, 
        GROUP_CONCAT(DISTINCT PlayerName, CONCAT(' (', Net_Score,
        ') ')) AS Player_Group,
        GROUP_CONCAT(DISTINCT Net_Score, ' ') AS Net_Scores,
        GROUP_CONCAT(DISTINCT CourseName, ' ') AS Course_Name,
        GROUP_CONCAT(DISTINCT LayoutName, ' ') AS Layout_Name
    FROM scorecards
    WHERE Hole10 IS NOT NULL
    GROUP BY Date) AS Table2
WHERE LOCATE('Cvance', Player_Group)
ORDER BY Cvance_Net_Score
LIMIT 5

MySQL WorkBench

The results tell us that Cvance's best score was achieved at Riverside Park DGC during the Guelph Icebowl 2021 - an annual charity event hosted in March, 2021. I find it interesting to note that 4/5 top scores were achieved in groups. At face value the results suggest Cvance performs better when competing under pressure and within groups. These conditions often go hand-in-hand.

The histogram shown below is a distribution of Cvance's net scores derived from 34 rounds of disc golf. As a reminder, for those less familiar with how golf is scored, the lower the net score the better the performance.

Tableau Public

Q: Does Cvance perform better in a group or solo? In the output, include the number of rounds played for each category (group vs. solo) along with the average and standard deviation for net scores.

The beauty of code is that we can more or less recycle the previous query (RE: top 5 scores). We'll make a few minor tweaks though. The first edit is to remove the WHERE clause from Table2 (RE: scores greater than >9 holes) because the question doesn't specify to filter by the number of holes played. A CASE statement creates a new field that will capture Group_Status as string type. Lastly, the GROUP_CONCAT function is used in combination with DISTINCT statement to label records so it's clear which values belong to each group.

SELECT
    GROUP_CONCAT(DISTINCT Group_Status) AS Num_Players,
    COUNT(Group_Status) AS Freq,
    ROUND(AVG(Cvance_Net_Score,1) AS Average_Net_Score,
    ROUND(STD(Cvance_Net_Score,1) AS Stdev_Net_Score
FROM
    (SELECT
        Date,
        Player_Group,
        CAST(REPLACE(SUBSTRING(Player_Group, LOCATE('Cvance',
        Player_Group),11),'Cvance (','') AS SIGNED)
            AS Cvance_Net_Score,
        CASE WHEN LOCATE(',',Player_Group)
            THEN 'Group' ELSE 'Solo'
            END AS GROUP_Status
    FROM 
        (SELECT
            Date,
            GROUP_CONCAT(DISTINCT PlayerName, CONCAT(' (',
            Net_Score, ') ')) AS Player_Group,
            GROUP_CONCAT(DISTINCT Net_Score, ' ') AS Net_Scores,
            GROUP_CONCAT(DISTINCT CourseName, ' ') AS Course_Name,
            GROUP_CONCAT(DISTINCT LayoutName, ' ') AS Layout_Name
        FROM scorecards
        GROUP BY Date) AS Table2
    WHERE LOCATE('Cvance', Player_Group)
    ORDER BY Cvance_Net_Score) AS Table1
GROUP BY Group_Status

Preview of Table2

Preview of Table1 (MySql Workbench)

The data suggests Cvance prefers to play in groups rather than solo (77.1% vs. 22.9%). Moreover, despite the fact that 4 of the top 5 scores fall into the group category (RE: Top 5 scores), on average Cvance performed similarly regardless of group status. In this case we could employ other descriptive statistics to determine whether results are statistically significant (e.g. Chi-Squared analysis). For the time being I will omit this additional step.

Q: How much disc golf did 'Cvance' play in the spring, summer, fall, and winter?

If you'll recall we grouped performance based on time of day using somewhat arbitrary cut-offs in a previous question. We will use a similar approach here.

A CASE statement will return string values that correspond to date ranges within the scorecards table (Table1). If a date value in field 'Dates' falls within a given date range return season name as string ('spring', 'summer'...). The SELECT statement (subquery) is executed prior to a evaluating our new query is using aggregate function COUNT() to group the total of rounds played for each season.

SELECT 
    Season, 
    COUNT(Season) AS Freq_Rounds
FROM (SELECT
    date,
    CASE
        WHEN date BETWEEN '2021-04-20' AND '2021-06-19'
            THEN 'Spring'
        WHEN date BETWEEN '2021-06-20' AND '2021-09-21'
            THEN 'Summer'
        WHEN date BETWEEN ('2020-09-22' AND '2020-12-20') AND
        ('2021-09-22' AND '2021-12-20')
            THEN 'Fall 2020-2021'
            ELSE 'Winter'
        END AS 'Season'
    FROM scorecards
    WHERE PlayerName = 'Cvance') AS Table1
GROUP BY Season
ORDER BY Freq_Rounds DESC

As shown below the majority of rounds played were during colder winter months. That's 81.4% of scorecards to be precise. In addition, no scorecards were submitted throughout the Fall seasons in 2020 or 2021.

MySql Workbench

Pro tip: when playing in snowy conditions, not only should you dress warmly with layers, consider taping a ribbon to the underside of each disc. In this way you'll be able to more easily find it after the disc is submerged in snow.

Speaking of Canadian winters, there are three tables in the UDisc database which presents an opportunity to try a table join by joining relevant rows and fields from climate_2020 and climate_2021. The results could provide added insight into the types of weather conditions that Cvance and his peers were willing to play in at the time.

As was implied in a previous discussion Cvance is not fair-weather. Lets see what the results show.

Q: Produce a query that shows summary data for maximum total precipitation, minimum/maximum mean temperature, and maximum snow on the ground. Only include weather conditions corresponding for dates with scorecards submitted.

The first step is to append rows from climate_2021 to climate_2020 then select for all relevant fields. The UNION operator is used to combine the result sets from two select statements as a subquery 'Table1' in preparation for a right join with the scorecards table.

Note the reason why the DATE function wraps the JOIN statements is to account for unique data types (Datetime vs. Date) prior to performing an inner join. Lastly, we'll use our primary query to aggregate functions to select the largest and smallest values for each weather category.

SELECT
	MAX(Total_precip_mm),
	MIN(Mean_Temp_C),
    MAX(Mean_Temp_C),
	MAX(Snow_On_Grnd_cm)
FROM scorecards
JOIN 
	(SELECT 
		Date,
		Total_precip_mm,
		Mean_Temp_C,
		Snow_On_Grnd_cm
	FROM climate_2020
	UNION
	SELECT
		Date,
		Total_precip_mm,
		Mean_Temp_C,
		Snow_On_Grnd_cm
	FROM climate_2021) AS Table1
ON DATE(scorecards.Date) = DATE(Table1.Date)
WHERE PlayerName = 'Cvance'

MySql Workbench

Preview of Table1 (MySql Workbench)

MySql Workbench

Q: Of the eight distinct players listed in the PlayerName field (i.e. having played at least 1 round with Cvance), produce an output that shows the number of rounds played together. Do not differentiate between pairs (1:1) and participation in larger groups. Include the grand sum in the output.

Start by using the combination of GROUP_CONCAT() and GROUP BY clauses to create a list of players who started the round at the same date and time. Table1 will serve as a subquery to derive seven new fields each representing the presence or absence of specified PlayerName combinations.

The result is another subquery consisting of a matrix of 0s and 1s which is then transformed into aggregate values using the SUM() function for each column as shown in Table2.

This is one of those queries where I was satisfied that I answered the question sufficiently however its apparent to me that combining SUM() with CASE statements could result in cleaner code that is more readable and more easily changed should the code need to be updated. Suggestions to this end are welcome.

SELECT 
	SUM(PL) AS PinkLeo, SUM(CT) AS Chris, SUM(D) AS Dumb, SUM(G) 
		AS Garret, 
	SUM(M) AS Maria, SUM(MP) AS Mark, SUM(S) AS Spence, SUM(L) 
		AS Leo,
    SUM(PL+CT+D+G+M+MP+S+L) AS Total
FROM (
	SELECT 
		PlayerGroup,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Pink Leopard%' THEN 1 ELSE 0 
			END AS PL,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Chris Thompson%' THEN 1 ELSE 0 
			END AS CT,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Dumb%' THEN 1 Else 0 
			END AS D,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Garret%' THEN 1 ELSE 0
			END AS G,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Maria%' THEN 1 ELSE 0
			END AS M,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Mark Petruk%' THEN 1 ELSE 0
			END AS MP,
		CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
		LIKE '%Spence%' THEN 1 Else 0
			END AS S,
        CASE WHEN PlayerGroup LIKE '%Cvance%' AND PlayerGroup 
        LIKE '%Leo Zdanov%' THEN 1 ELSE 0
			END AS L
	FROM
		(SELECT
			Date,
			GROUP_CONCAT(PlayerName,'')  AS PlayerGroup
		FROM scorecards
		GROUP BY Date
		) AS Table1 
) AS Table2

Preview of Table1 (MySql Workbench)

MySql Workbench

As shown in the output above nearly half of scorecards submitted by Cvance include PinkLeopard (45.6%) in those groups followed by Spence (19.5%), and last but not least, a two-way tie between Maria and Garret for third (13.9%).

Conclusion

Brainstorming sessions with my friend 'Cvance' gave rise to interesting questions that I could ask of the dataset such as comparing performance on the front-9 vs. back-9 for courses with 18 holes. This line of thinking while constructive (creative even) led to the realization that the dataset lacks the level of specificity required to answer this question among others. Two subtotal fields are needed that correspond to totals scores for the front and back nine. But hey, that being said this was a simple dataset perfect for a beginner such as myself. No complaints.

This concludes our analysis and discussion of the Udisc database. I hope you enjoyed the write up. Thank you for your time and attention. If you learned something new along the way even better. Disc golf is a great little hobby. The sport is affordable, accessible, and promotes time spent in the great outdoors.

Although it's early days in my endevour to create and share a data science blog; it occurred to me that this is a neat medium to structure learning. By making data science projects public to anyone for their reading pleasure, the focus for self-learning can go far beyond learning hard skills in data analytics to include soft skills and networking. The blog format also encouraged me to write code that is clean and straightforward to replicate.

"The number of ‘doing’ iterations drives the learning curve" Naval Ravikant