The State of Haskell is a survey which goal is to sample the Haskell community’s demographics, tooling preferences, opinions on the language and more. It has been devised — and run annually since 2017 — by Taylor Fausak. Every November, the results are published online (2017, 2018, 2019, 2020) in form of descriptive statistics and plots.
This year I decided to attempt cluster analysis on the data.
(If you are only interested in the results, skip to paragraph From clusters to opinions)
The analysis will proceed in two steps:
1.
, examine how the distinct segments differ in their opinions on the community/language (does Haskell’s performance meet the needs of academics? Do Windows users think Haskell libraries are well documented?).As the input for the algorithm, I selected 13 questions/variables from the survey which I feel capture various elements of what «being an Haskeller» means:
☞ NB: due to a bug in the 2020 survey, some entries (763) were not correctly recorded. I excluded those and worked with the rest (N=603).
Since most of the variables are categorical or ordinal, we first compute the dissimilarity matrix (taking two tuples pairwise, how different are those?). The computation allows us to plot a hierarchical dendrogram, where we start at the top with a set all of surveys and we progressively reach a finer granularity at the bottom.
The higher the vertical distance from the parent node, the bigger the dissimilarity. We have a fairly spaced tree, splitting it in five branches seems a natural decision.
The green and teal groups hang a long distance from their parents, which means they are most likely unique segments much different from their neighbours. We will temporarily label the clusters in Roman numerals, I
, II
, III
, IV
and V
.
groups
I II III IV V
224 40 76 200 63
Now that we have sliced the clusters, we need to inspect them to see what is inside and what variables actually helped discrimination. As an example, Haskellers who use Haskell at work «most of the time» are almost all intercepted in one cluster (IV
):
but it is difficult to see any obvious differences in region
division:
As an exploratory tool, let us print the statistical mode for each question/cluster pair:
* How old are you?
I 25 to 34 years old
II 25 to 34 years old
III 18 to 24 years old # ← slightly younger
IV 25 to 34 years old
V 25 to 34 years old
* What is your gender?
I Male
II Male
III Male
IV Male
V Male
* region
I Europe & Central Asia
II Europe & Central Asia
III Europe & Central Asia
IV Europe & Central Asia
V Europe & Central Asia
# As expected, the demographics of Haskell users is not that varied.
# Cluster III sports lower age. Are they students by any chance?
* Are you a student?
I No
II No
STU Yes, full time # Yes they are, we will relabel the cluster
IV No # `STU` for easier recognition.
V No
* Do you use Haskell?
I Yes
NON No, but I used to # Former users are the majority in cluster II,
STU Yes # we will relabel it to `NON`.
IV Yes
V Yes
* Do you use Haskell at work?
I No, but I'd like to
NON No, but I'd like to
STU No, but I'd like to
WRK Yes, most of the time # Another defining feature, another relabel
V No, but I'd like to # (Cluster IV ⟶ WRK)
* How long have you been using Haskell?
I 1 year to 2 years # Cluster I has somewhat experienced users.
NON 1 month to 1 year
STU 1 month to 1 year
WRK 10 years to 15 years
V 3 years to 4 years
* What is the total size of all the Haskell projects you contribute to?
I <NA>
NON <NA>
STU <NA>
WRK Between 10,000 and 99,999 lines of code
V <NA>
# Unsurprisingly, professional Haskellers have been using the language
# for longer and on bigger codebases.
* How would you rate your proficiency in Haskell?
I Intermediate
NON Intermediate
STU Intermediate
WKD Advanced
V Intermediate
* devOnWin
I FALSE
NON FALSE
STU FALSE
WRK FALSE
V TRUE
* targetWin
I FALSE
NON FALSE
STU FALSE
WRK FALSE # Cluster V users target and develops on Windows.
WIN TRUE # We will relabel the cluster to `WIN`.
* cabalOrStack
I Stack only
NON Stack only
STU Stack only
WRK Cabal only
WIN Stack only
* academia
I FALSE
NON FALSE
STU FALSE
WRK FALSE
WIN FALSE
* Have you contributed to any open source projects?
RES TRUE
NON TRUE
STU TRUE # Not finding any evident quality for Cluster I,
WRK TRUE # will relabel it as a RES (for «residual»).
WIN TRUE
The first coarse classification leaves us with five buckets: professional haskell users (WKD
), students (STU
), Windows dev (WIN
), former users (NON
) and a residual category (RES
) of slightly experienced users.
The mode does not tell the whole story, so here are bar plots for every question/cluster combination:
The Residual group loves Stack.
The Residual group has some experience with Haskell at work, but not as much as the WRK
cluster.
It is obvious from looking at the Residual cluster that it intercepts multiple traits, we could tentatively think of it as Haskeller who are working, and would love to introduce some more Haskell in their job (but have not managed already).
Now that we have our cluster and we have put labels on them, it is time to check what each segment thinks about various Haskell topics. Let us recall the groups once again for anyone starting to read from here:
NON
);STU
);WRK
);WIN
); andRES
) composed by programmers with different backgrounds who would like to introduce Haskell in their $DAYJOB
.The survey contains a section named «Feelings» (statements from «I would prefer to use Haskell for my next new project» to «As a candidate, I can easily find Haskell jobs»), where the survey taker indicates how much they agree on five point scale (1: Strongly Disagree, … 5: Strongly Agree).
Here is a table with the mean for each question/group. The table is ordered by descending intergroup variance, so the questions where the community is split are listed at the top, while the questions where everyone agrees are placed at the bottom.
. NON STU WRK WIN RES
Haskell is critical to my company's success. 2.11 2.5 3.86 2.16 2.72
Haskell is working well for my team. 2.74 3.43 4.36 3.16 3.41
I have a good understanding of H. best practices. 2.56 2.97 3.71 2.67 2.99
As a hiring manager, I can easily find
qualified haskell candidates. 2.15 2.65 3.28 2.31 2.72
As a candidate, I can easily find Haskell jobs. 1.86 2.10 2.79 1.73 2.12
I would recommend using Haskell to others. 3.68 4.53 4.49 3.96 4.27
I think that software written in Haskell is
easy to maintain. 3.57 4.03 4.40 3.71 4.13
I would prefer to use Haskell for my next project. 3.77 4.42 4.61 4.16 4.34
I am satisfied with Haskell as a language. 3.55 4.27 4.26 3.88 4.18
I think that Haskell libraries are easy to use. 2.67 3.22 3.36 2.90 2.97
I think that Haskell libraries work well together. 3 3.5 3.66 3.15 3.43
I think Haskell libraries are well documented. 2.5 3 2.91 2.47 2.66
I think Haskell libraries are high quality. 3.31 3.79 3.92 3.67 3.81
I can easily compare competing Haskell libraries. 2.58 2.59 2.92 2.34 2.50
I am satisfied with Haskell's compilerers 3.88 4.15 4.16 3.76 4.25
I can find H. libraries for the the things I need. 3.34 3.66 3.75 3.26 3.51
I am satisfied with Haskell's build toools. 3.12 3.38 3.54 3.12 3.45
I can easily reason about the performance of my code. 2.43 2.68 2.80 2.43 2.65
Once my Haskell program compiles, it generally does
what I intended. 4 4.18 3.94 3.79 4.06
I think that Haskell libraries provide a stable API. 3.21 3.52 3.44 3.43 3.47
Haskell's performance meets my needs. 3.87 4.09 4.09 3.87 4.05
I feel welcome in the Haskell community. 3.9 4.07 4.08 4.04 3.88
I am satisfied with Haskell's package repositories. 3.69 3.93 3.80 3.72 3.79
I think that Haskell libraries perform well. 3.68 3.75 3.86 3.62 3.68
(Philoustic provided superb visualisation for this table, check Bonus)
There are some differences but in most question the averages are not that distant from each other; this is expected when using a 5 point scale.
To make comparison between clusters easier, I will use relative percentages and not absolute values in the graphs (thanks to Philoupap for suggesting this). E.g.:
this means that around 40% of «Students» segment and around 40% of «Workers» segment strongly agree with «I am satisfied with Haskell as a language», although obviously the latter cluster — and hence its absolute count — is bigger than the former.
Selected plots and comment:
I do not consider the top two (in intergroup variance) questions to be meaningful: some people picked «Neutral» instead of leaving the answer blank when the question did not apply to them. Still it is encouraging to see some «Agree» and «Strongly Agree» in the RES
grop (a portion of which uses some Haskell at work).
The biggest pain points for people who tried Haskell and then abandoned it are understanding what «Best practices» are and the ability to find a Haskell job.
Windows users are less satisfied than other users with Haskell tooling. Sometimes (statisfaction with Haskell compilers and libraries documentation) they are even less happy than people that abandoned Haskell!
Every group finds it difficult to reason about Haskell code performance; every group finds it difficult to compare competing Haskell libraries. In both cases, professionals fare slightly better but still, below pass marks.
On the bright side, every group — even ex Haskellers — feels welcome in the Haskell community.
Even RES
group («onboarding» users) are positive about the language and would like to use it in the future. Windows and ex-users trail.
If you want to check all the graphs, search the construction folder (feelings).
The goal of this analysis was to find groups in Haskell users and examine their preferences. Clustering was successful and lead to sensible (if a bit thick) buckets. Sentiments between groups vary even though not as much as I expected: this could be because of how the sampling was conducted (2020 bug, coarse 5-points scale) or because the community genuinely holds similar ideas.
Philoustic provided six more overview visualisations for the «feeling» section. In his own words:
NA scores are ignored.
Score values span between 2 ("strongly agree") and -2 ("strongly disagree")
with 0 for neutral.
I've append a dashed line at the general or cluster mean value.
First general:
And then one for each cluster: