Skip to content

A process to document a GHC API #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

facundominguez
Copy link

@facundominguez facundominguez commented Mar 11, 2025

Latest abstract:

This proposal defines a process to define and document a GHC API. The process involves tooling authors, GHC developers, and the Haskell Foundation in defining and validating pieces of a GHC API in incremental fashion.

Rendered


Old abstract:

This proposal is to build tools to define and maintain a GHC API. Some automation is necessary to monitor the needs of projects using GHC as a library and make GHC developers aware when their changes affect these projects. With this knowledge, the involved parts of GHC can be better defined and documented.

Old rendered

@noughtmare
Copy link
Contributor

noughtmare commented Mar 13, 2025

Rendered

@hsyl20
Copy link

hsyl20 commented Mar 14, 2025

To the best of my knowledge, no project has tried before to improve the documentation of the GHC implementation, though there have been efforts to refactor the implementation itself to make it easier to maintain and reuse. This author thinks that accessible documentation amplifies the the benefits of any code changes.

It is a bit unfair to the refactorings we've done with @Ericson2314, @doyougnu, and others. Documentation was a major reason why I personally started doing this. From our 2022 "Modularizing GHC" white paper:

screenshot-20250314@104636

In my experience documenting the code in its current state is often difficult because you end up documenting accidental complexity that shouldn't be here in the first place. For an actual example, see https://docs.google.com/document/d/1mQEpV3fYz1pHi64KTnlv8gifh9ONQ-jytk5sIHqnV9U/edit?tab=t.0#heading=h.xp3xd558qgs7 which was an attempt last year to document the big picture of cabal and ghc interaction: it's already a mess (and that's without documenting backpack).

About the proposal itself: I fear that making some part of the accidentally complex code now dubbed "GHC API" more difficult to change will mean that the accidental complexity will stay forever. But it depends on the indexing phase and it might also lead to fixing the code instead of ossifying it, so let's see.

@simonpj
Copy link
Contributor

simonpj commented Mar 14, 2025

I wonder whether building tooling is the first thing to do? It's a bit of a meta-thing. We want a house, but instead of building a house, we first spend time building tools to help us build a house. It can be the right thing to do. But there is a danger that you spend lots of time building tools, only to discover, when using them, that they aren't quite the right tools after all.

I wonder if it would instead be better to spent that effort instead to:

  • Simply define and document an initially-small GHC API, perhaps the interface exported by GHC.API.

Some thoughts about this

  • It's a human endeavour, not a tool-ish one.
  • It is immediately useful:
    • Clients know that they can rely (more) on functions exported by GHC.API
    • GHC authors know to avoid modifying the type or semantics of functions exported by GHC.API unless absolutely necessary.
  • By starting small we can deliver something almost immediately. Then we can grow that thing.
  • It allows us to think of the API as a whole: does it make sense? We need to do more than take the union of all the functions that clients have ever called: we need to think what a well-designed API to that capability would look like. This is a design task that involves dialogue and interaction.
  • GHC.API will initially be inadequate: it probably won't fully satisfy any clients. But it gives the basis for a continuing dialogue: a client can say "I need X, and currently I'm forced to call functions f and g from GHC's internal modules; could GHC.API export X somehow?". That suggests a design conversation about what the API for capability X might like. (e.g. typechecker-solver plugins need to construct "evidence" for constraints. What does the API to this evidence construction look like.)
  • Perhaps as GHC.API grows, there will indeed be issues of scale that suggest tooling support. Tooling is great when it is driven by "I'm doing task X every day; if I took 3 days to build a tool I could automate much of X; that's a win".

@facundominguez
Copy link
Author

facundominguez commented Mar 14, 2025

It is a bit unfair to the refactorings

Thanks @hsyl20. Happy to include a reference to the anecdote. If there is a more extensive discussion of documentation activities elsewhere I'd like to link them too.

For an actual example, see https://docs.google.com/document/d/1mQEpV3fYz1pHi64KTnlv8gifh9ONQ-jytk5sIHqnV9U/edit?tab=t.0#heading=h.xp3xd558qgs7

This one looks good to link too. 👍

@hsyl20
Copy link

hsyl20 commented Mar 14, 2025

Thanks @hsyl20. Happy to include a reference to the anecdote. If there is a more extensive discussion of documentation activities elsewhere I'd like to link them too.

Thanks. Module hierarchy introduction was discussed in https://gitlab.haskell.org/ghc/ghc/-/issues/13009 and ghc-proposals/ghc-proposals#57. Increased modularity in https://gitlab.haskell.org/ghc/ghc/-/issues/17957

This comment from @bgamari is particularly relevant to the current discussion.

@facundominguez
Copy link
Author

facundominguez commented Mar 14, 2025

Answering to Simon,

Simply define and document an initially-small GHC API, perhaps the interface exported by GHC.API.

The question from GHC developers that triggered the current direction of the proposal is: how do we know which functions need good documentation?

If the tooling effort is excessive, a middle ground is to do only the indexing which I think is simple enough, and manually build the GHC API modules. This will still help to prevent some accidental breakage even if not all of the functions in the API are documented.

But it looks to me like you prefer clients of the ghc library to open a ticket for missing functions that they use. That can also work. I guess that if the community wanted to spend some money on it, you could accumulate a few documentation issues first. The only comparative disadvantage I see is that it will take longer to reach the coverage that would allow GHC devs to know they are breaking client's code. But if you cared about it, we could arrange for an indexing solution disentangled from the API definition.

@doyougnu
Copy link

doyougnu commented Mar 14, 2025

In general I am in agreement with @simonpj, but it occurs to me that we have an example of a solution to one of the proposal's goals, in GHC already:

A solution should make easy for GHC developers to know when they are about to change parts of the GHC implementation that are used in other packages

Isn't this exactly the reason for the base-exports test? I can imagine adding an ghc-api-test that checks the surface area of the api for changes just as those export tests already do. Adding this to CI would force GHC Devs to be aware of downstream changes to the API.

But in the short term that is not tractable because the API is too large and knows too much about the compiler. So a test like this would hamper development speed too much and therefore is more of a goal than a step on the path towards the goal.

To migrate the code base to a state where a test like that is feasible we would need to do more modularity work and take inspiration from the base and the ghc-internal/ghc-experimental hierarchy.

Imagine if for each level of the module hierarchy we had a module called Foo.Interface, for example StgToCmm.Interface, just like we have for the Config records, e.g. StgToCmm.Config then at the top level of the module heirarchy we would have:

module GHC.API

import StgToCmm.Interface 
import StgToJS.Interface
...

and so on.

The basic idea is to migrate the code base to a state where we can create a shim module just as I did for the RTS flags here (in particular see this comment). The corollary to GHC.Internal would be the level of the module heirarchy, for example StgToCmm.Interface and then the shim module would be either GHC.API or we could split this out more if needed. Then with the shim modules we can begin shaving down the surface area of the API through selective exports just as we do in base, ghc-internal and ghc-experimental.

This would make the API explicit and obvious in the file system, give GHC developers finer grained control over the API, lay the foundation to begin to test the API for changes and centralize the entry point to the API. Perhaps we could even split out the GHC.API module into another internal package just as we split ghc-internal from base.

@adamgundry
Copy link
Contributor

Perhaps we could even split out the GHC.API module into another internal package just as we split ghc-internal from base.

This sounds like a good place to start! Why not define the GHC.API.* module hierarchy in a ghc-api core package, which is not a boot package? It can depend on ghc and selectively re-export things, adding shims/documentation as appropriate. Potentially it can even introduce CPP compatibility code so as to support multiple ghc versions from a single ghc-api version. @sheaf has done some excellent work along these lines for the type-checker plugin fragment of the API: https://hackage.haskell.org/package/ghc-tcplugin-api

Of course it could be helpful to have indexing information to know which parts of the existing ghc package to target. But equally important, as @simonpj said, is to think about what a well-designed API should look like. Then it'll be necessary to port one or more key client libraries to go through ghc-api rather than directly depending on ghc, thereby validating that the API is sufficient for its use case.

@mpickering
Copy link

I think that a big overall issue is that the compiler internals are just quite messy and entangled. It is expert level work to refactor these parts of the code-base as it can be very difficult to understand sufficient context to imagine a design which accounts for all the different situations. Refactoring needs to maintain existing (underspecified) behaviour most of the time. As well as being tricky, it is not advantageous to proceed in a very incremental fashion, since it is a live project with many people interacting with the code everyday.

There are quite well specified abstraction boundaries in a few places (in particular how GHC/Cabal interact via package databases). The memory usage behaviour of the compiler pipeline is also better specified than it used to be. I'm sure there are other examples. It has been something that we have worked on for quite a few years now but it's slow progress.

@AndreasPK
Copy link

The estimate for developing a plugin which simply reports which modules/names/types from the ghc package are used seems to me to be a bit on the high end.


I agree with @adamgundry that introducing a designed, and not auto generated API under GHC.API would be more sensible.

I believe auto generating a GHC API based on used functions might be useful for an initial version. But is unlikely to be useful in the long run. Ideally a GHC API exposes functionality from GHC but not necessarily mirrors the same structure. So the investment for such a tool sounds quite high compared to the benefit.

As @mpickering and @hsyl20 allude to simply going with the currently used functions as definition for the GHC API is likely to result in a interface not meaningfully better than what we have today.

I think ideally someone would go over the currently used functionality by these packages and for everything decide:

  • Is the existing interface:
    • A reasonable interface for the functionality it provides
    • Relatively stable.
  • If yes export it as-is from the GHC API.
  • If not then do one of:
    • Provide a reasonable and stable interface as part of the API if reasonably simple to do so.
    • If that's not reasonable consider weither or not the relevant exports should be re-exported from the GHC.API namespace for the sake of tracking changes only, without any guarantee of stability.

This would ensure that users of the ghc API can tell if what they are using is:

  • A stable part of GHC that should be easy to adapt going forward.
  • A unstable part of GHC which is considered to be useful enough to still export via the API which means:
    • Changes can be tracked and potentially guidance can be provided for breaking changes.
    • It makes it less likely that breaking changes happen accidentally.
    • It provides a obvious place for user facing documentation.
  • True internals for which there are no guarantees at all.

However I can also see that the cost of doing this properly far exceeds the estimated effort required for the auto generate GHC API.

@facundominguez
Copy link
Author

facundominguez commented Mar 14, 2025

Answering Andreas,

The estimate for developing a plugin which simply reports which modules/names/types from the ghc package are used seems to me to be a bit on the high end.

I'll add here that such a plugin already exists and is linked in the proposal. Additional effort is required if tweaking the output for API generation, documentation, automated tests, optimizations, etc. I don't vest much in the actual estimation, I think is fair for it to be revised.

However I can also see that the cost of doing this properly far exceeds the estimated effort required for the auto generate GHC API.

Your considerations are interesting. At this point, I feel your unstable part of an API could make much of a difference to users.

@Bodigrim
Copy link
Collaborator

I'm somewhat uneasy about spending 100+ hours just to prepare tooling before even starting to deliver any practical value. I mean, if we are looking to fund 1000+ hours, it would make sense to spend 10% on tooling. But realistically at this stage we are probably talking about 200 hours in total at best, and spending better half of it on tooling for API autoselection feels underwhelming. Surely there are enough of easily identifiable parts of GHC API such that writing even cursory documentation for them will fill the entire budget. (That's accepting the premise that documentation is the bottleneck)

@Ericson2314
Copy link
Contributor

Ericson2314 commented Mar 16, 2025

I had written far-too-long comment which I canned; glad since I wrote that, many other people echoed my sentiments!


Firstly, I strongly agree with @hsyl20

In my experience documenting the code in its current state is often difficult because you end up documenting accidental complexity that shouldn't be here in the first place. [...]

About the proposal itself: I fear that making some part of the accidentally complex code now dubbed "GHC API" more difficult to change will mean that the accidental complexity will stay forever. [...]

To quote the original proposal

However, the general sentiment is that documentation is still lacking. This could be due to documentation being not easy to navigate and discover, for instance if there are relevant cross references that are missing. And secondly, it could be due to documentation being written for an audience with a shared context about GHC internals, which does not always include the authors of Haskell tooling.

IMO, based on our experience, if we're blaiming the docs here, we're merely "blaming the messenger". The problem is that the interfaces themselves are not modular, and given this unfortunate fact, documentation can't be terse and self-contained either. "being written for an audience with a shared context about GHC internals" is fundamentally a problem with the code itself, which merely reappears as a documentation problem.


Firstly, I strongly agree with @simonpj and the other GHC devs that chimed in, and also @Bodigrim, that this is "too meta" and "too automated too soon". Frankly it sounds like the group behind the proposal is at impasse over what to do, and is hoping that the new tooling's output will provide a clear vision instead of humans doing so. I think that is doomed to fail.

I also agree with Simon et al's counterproposal, that we should simply pick some small portion of the API to manually audit and refactor, in human/qualitative ways.

I agree with @mpickering also that because things are currently quite tangled, we can't just stick another layer on top to do this. We need to actually untangle something to be able to make good interfaces that are possible to document well. And yes, that's hard! But it doesn't need to be fatally hard --- we just need to carefully pick where we begin so we don't spend all our time untangling, and we instead have time left over to clean up the thing after it is separated from the rest.


To that end, I put forth my #56 as a counterproposal. Merely splitting out the AST and Parser as separate packages was never supposed to be the final step. Rather the idea is once you have a "clean workbench" of a fully-separated component, you can then bring in all the stakeholders --- GHC and 3rd-party tools --- and have a productive discussion refactoring and documenting interfaces (with much less effort!) until everyone is happy.

At the point is done, we should have the vision and consensus we lack today --- this is as important as the refactored interfaces themselves! The experience of making the AST and parser a nice-to-use competent (at the level and docs and Haskell interfaces alike) will inform everyone involved what we are aiming for for the rest of GHC, and how much effort might be involved to get there.

Only that that point should we pause, take a step back, and consider the sort of empiric analysis that this proposal proposes, because from the AST and parser cleanup, we will have the shared qualitative vision to guide us. These empirics are theory-laden, so we need a good theory first.

@facundominguez
Copy link
Author

facundominguez commented Mar 17, 2025

Thanks all for their thoughts so far.

At this point I think most participants would agree to slash the API generation phase, and probably most of the indexing phase if we only require some indexing to inform what parts of the GHC library are of interest to some client libraries. Thus we are mostly left with the documentation review phase.

Some people proposed gradually collecting and documenting an API in specific modules or in a dedicated package. That would be easy to add to the proposal. There is also the question of what client packages to serve first, but I think almost any package considered interesting would do to evaluate the approach.

Some comments seem to argue that helpful documentation is very difficult to write without refactoring GHC first (please, correct me if I'm wrong). If this were what GHC developers think in general, then there would be little point to press with this proposal just yet. Agreement would be necessary to keep the produced documentation up to date as GHC evolves.

To that end, I put forth my #56 as a counterproposal.

Thanks @Ericson2314. I do think your proposal is very useful, even if I don't consider it in principle a prerequisite to improve documentation.

@hasufell
Copy link
Contributor

If this were what GHC developers think in general, then there would be little point to press with this proposal just yet.

It was my perception that this is exactly what this project should be helping out with... not exactly the refactoring, but figuring out what the constraints are, what possible self-contained parts of the API should exist, etc.

@simonpj
Copy link
Contributor

simonpj commented Mar 17, 2025

Some comments seem to argue that helpful documentation is very difficult to write without refactoring GHC first (please, correct me if I'm wrong). If this were what GHC developers think in general, then there would be little point to press with this proposal just yet.

My hope is that this project will lead to a fruitful dialogue:

  • GHC API clients want to access capability X in GHC. For example, in a typechecker plugin, the ability to build "evidence terms".
  • We write down the existing API, functions GHC currently defines.
  • We notice that, seen from outside, it's a bit of a mess -- inconsistent naming, say, or something a bit deeper.
  • We propose a refined API that looks simple and consistent from the outside. There is a lot of dialogue here, because it might be that the "simple, consistent API" is, for some reason, difficult to implement. So there is to-and-fro.
  • The GHC team refactors GHC to adopt that API.

This human design conversation is what I'd love to see. It starts from a concrete need (capability X) and a draft, albeit unsatisfactory API, and works from there. I'm agreeing with @hasufell here.

@facundominguez
Copy link
Author

facundominguez commented Mar 17, 2025

This human design conversation is what I'd love to see.

Here's a possible rendering of the responsibilities. There is a project developer who does the work, is mentored by a GHC developer, and is possibly funded by the Haskell Foundation. The project developer might be a GHC developer herself, if there is someone available.

  • The Haskell Foundation and GHC developers select some tool to serve first. Availability of the tool authors need to be checked at this stage. Also some assessment of project size needs to be done at this time.
  • The project developer studies the functions and types that the tool uses from the GHC library, and engages with the tool authors where the purpose of using them is unclear.
  • The project developer makes a proposal for an API that suits the tool use case, or improve the documentation if the used functions and types are already reasonable.
  • The tool authors provide feedback on the proposal.
  • Iterate on the API proposal/documentation until it is satisfactory.

It is looking to me now like such process could happen with @Ericson2314's #56, if some tool is selected to drive the dialog. Alternatively, perhaps smaller projects are within reach for small tools like om-plugin-imports or print-api. Beware that the size of the tool might not mean that the project is small. I'm thinking of the hi-file-parser case.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Mar 17, 2025

It is looking to me now like such process could happen with @Ericson2314's #56, if some tool is selected to drive the dialog.

@shayne-fletcher coauthored that proposal, with the idea rhat HLint could be used as such a tool to evaluate the work. I also solicited these reviews #56 (comment) and #56 (comment) from tool authors.

It's now been some time, but I'd hope that these people / some developers tools they work on, would want to be involved.

@soulomoon
Copy link

soulomoon commented Mar 18, 2025

As an HLS developer and a new GHC contributor, I completely agree with @simonpj. In fact, Simon’s approach is already happening naturally with HieAst. Here’s a concrete example from my own experience:
1. While developing the semantic tokens feature for HLS, I needed entity information for names. Initially, I turned to HieAst, but I found that it didn’t provide enough details. As a result, I had to rely on deeper internal parts of GHC to obtain the necessary data.
2. To address this limitation, I contributed back to GHC by improving HieAst (GHC Issue #24544). This enhancement removed the need to depend on those internal GHC components.

This is exactly how GHC.API is meant to evolve—through an iterative feedback loop where tooling needs drive improvements. I believe this process will benefit more tooling users and continuously refine GHC.API over time.

In the mean time, we are doing an upgration for HLS to ghc 9.12.2 haskell/haskell-language-server#4517. We have roughly 30 plugins in HLS beside the core ghcide. This should be a perfect opportunity to identity some parts of GHC that should be putting into GHC.API.

@facundominguez
Copy link
Author

👋 I have reoriented the proposal with the bits I collected from the discussion. Most notably:

  • the section on prior art is now based on the material provided by @hsyl20,
  • the tooling phases have been eliminated, and
  • roles and steps are defined to document and refactor where necessary in a structured dialog between GHC developers and tool maintainers.

As before, the proposal leaves to the Haskell Foundation and GHC developers to decide what tools to support first.

I didn't include the ideas about how to organize the code and how to identify the API modules. But if there is consensus to do it one way or another, I'm happy to accept amendments.

@facundominguez facundominguez changed the title Tooling for maintaining a GHC API A process to document a GHC API Mar 19, 2025
@eyeinsky
Copy link

Some thoughts on this topic:

  • evolving an complete-ish API is a long process and going to take (IMO) years;
  • thus this proposal should just get the starting point right (a direction and a process), choosing some way to evolve the GHC API, e.g having a package in GHC repo and add things into it;
  • GHC developers should be very much on-board with the chosen method, otherwise it's unlikely to succeed;

.. and then drive this direction with one or more client libraries using GHC API as far as it could go within allocated time. Not sure there is a way to draw a line to be reached.

@gbaz
Copy link
Collaborator

gbaz commented Apr 25, 2025

My initial view is this proposal has too much in it. It states it wants to document the API, but then it has a process of refactoring as well. I think that is part of what the goal is -- but it shouldn't be in this proposal. The part I like most is discovering what is already used by others.

I think it is more reasonable to survey how the GHC API is used (portion-by-portion, as this proposal suggests, and which is good), and to then document as a snapshot (i.e. not tracking changes over time) what has been discovered -- i.e. which calls are invoked, and with what arguments. A work product can also be then an example layer on top with a streamlined interface, which satisfies current needs. This should not be released as a library -- but just remain a "cookbook" taken at a point in time.

This work-product (the documentation in "cookbook" form), in turn can be used by the GHC team, stability team, etc to inform how they propose to evolve the API over time, and perhaps how to refactor the internals more broadly. However, it is better to not try to mandate that process within this proposal.

@facundominguez
Copy link
Author

Thanks @gbaz.

This work-product (the documentation in "cookbook" form), in turn can be used by the GHC team, stability team, etc to inform how they propose to evolve the API over time, and perhaps how to refactor the internals more broadly.

If the final deliverable is a cookbook, a follow up proposal to use it to act on GHC will need to be assembled timely, or the value of the cookbook will diminish as GHC and the tools evolve. If there is such a commitment from stakeholders, it looks to me like the cookbook approach can be effective at improving documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.