Andrey Prokopenko's Blog

Servant plugin for SEO

Intro

I have website built with servant framework. And I need to add SEO for it. Here we go.

One day I started with a task where I had a choice:

  • To make a solution that could be copy-pasted across several projects. It would be fast to implement but error-proned.
  • Or to make a step back and to extend the framework with a new plugin.

Problem statement

I am talking about two handlers /robots.txt and /sitemap.xml useful in SEO optimisations in a project with servant as dependency. Unlike yesod, there are no such extensions in servant ecosystem.

So I had to add/modify the code by hand every time when I need to test some idea.

  • Pros: it is quite fast to implement each time.
  • Cons:
    • maintenance cost: trigger on API modification should be installed to keep handlers up-to-date.
    • could not be reused further.

Otherwise, to write plugin for servant.

  • Pros:
    • more compile-time checks.
    • simpler usage.
    • could be re-used heavily and even shared via OSS.
  • Cons:
    • harder to implement (i.e. type-level programming which I was not familiar with).
    • as a consequence, more time should be spent on implementation.

Community opinion about servant is controversial (based on what I have heard). From user point of view it is simple. And the separation of API and handlers is great! Docs and tutorials are awesome! From development side, servant considered as type-level fancy stuff. That’s what I heard.

There were no hard issues with using servant. So I become curious. Is it really true about development side? The only way to find is to resolve my task this way and to share thoughts and findings with you.

I had only one hour per day (evenings, mostly). So I started.

Robots.txt

Robots.txt is plaintext file that could be statically or dynamically served from web server via /robots.txt handler. It gives instructions to robots what should be indexed and what should not.

User-agent: *
Disallow: /static

Sitemap: https://example.com/sitemap.xml

Here is an example of content I want to receive from API. It means that robots with any user-agent should not be allowed to index endpoints starting with /static. And sitemap is located by the following URL.

Robots specification is not limited by these commands. But for the start it would be enough. As a user, I want to specify Disallow as a keyword somewhere in API.

Sitemap.xml

Sitemap.xml is an XML file that could contain either list of URLs of pages that should be indexed by robots or list of nested sitemaps URLs if website index is large enough.

There is a /urlset/url list with an URL inside and some optional parameters. I want to use Frequency (i.e. /urlset/url/changefreq) and Priority (/urlset/url/priority) as keywords in API as well. And XML should be rendered somehow from API.

There is a /sitemapindex/sitemap list with sitemap index. Each URL (//loc) from this index represents a single part of whole sitemap. If sitemap is large enough then sitemap index should be built instead. Each index location should contain no more than fifty thousands target locations (HTML page URLs). Sitemap specification contains more details.

Proposal for servant

In order to start the implementation I asked myself how final API should look like? What it should be like? And I came with something like:

Looks like I need to encode Disallow, Frequency and Priority into API without any impact on serving.

And I should find the way to derive both robots.txt and sitemap.xml with as less code as possible from user perspective. E.g. only one function serveWithSeo should be called.

Design and Implementation

Next step was to ask: what really should be implemented and how to achieve it? The only way to become comfortable with the task is to decompose it to degree where the each subtask from the tree would become transparent for you.

Functional requirements

I group all requirements on several components:

  1. Servant combinators. Disallow, Frequency and Priority are the starting point in my journey.
  2. Robots data type. There should be the way to transform the final API into intermediate Robots representation.
  3. Sitemap data type. The same as previous one but for Sitemap.
  4. UI part. User API should be automatically extended with sub-API for both Robots and Sitemap. Handlers with default renderers should be provided from intermediate data types to target content types.

I was starting to notice how organically types drive me throughout the gathering of functional requirements.

Servant combinators

  1. Disallow should wrap the path piece of URL (i.e. symbol).
  2. Disallow should not affect servant-server functionality.
  3. Disallow should be gathered to Robots intermediate data type for each API branch.
  4. Disallow should invalidate Sitemap for whole API branch if it is present in API branch.
  5. Frequency should be used with :> combinator.
  6. Frequency should have (on type level) one parameter with following available values (according to sitemap spec): never, yearly, monthly, weekly, daily, hourly, always.
  7. Frequency should not affect servant-server functionality.
  8. Frequency should be gathered for each URL in contained API branch.
  9. In case of several Frequency values in one API branch, the outer-most should be used, i.e. overwrite rule.
  10. Priority should be used with :> combinator.
  11. Priority should have (on type level) one parameter with a value representing priority according to sitemap spec, i.e. in range between 0.1 and 1.0.
  12. Priority should not affect servant-server functionality.
  13. Priority should be gathered for each URL in contained API branch.
  14. In case of several Priority values in one API branch, the outer-most should be used, i.e. overwrite rule.

Robots data type

I called it RobotsInfo.

  1. RobotsInfo should contain information about every API branch where Disallow appeared, i.e. list of disallowed path pieces.
  2. RobotsInfo should contain knowledge about sitemap presence in API (present or not).

Sitemap data type.

I called it SitemapInfo.

  1. SitemapInfo should contain list of sitemap entries (i.e. [SitemapEntry]).
  2. Each SitemapEntry should represent particular API branch.
  3. SitemapEntry should contain information about all pieces from which list of URLs could be constructed.
  4. SitemapEntry should contain information about all query parameters from which list of URLs could be constructed.
  5. SitemapEntry might contain Priority or not (optional parameter).
  6. SitemapEntry might contain Frequency or not (optional parameter).
  7. SitemapInfo should be automatically gathered once there is Get '[HTML] a in API present.
  8. Rest methods or content-types should be ignored and no SitemapInfo should be created for such API branches.
  9. It should be possible for User to implement how to retrieve a list of possible values for userType from Capture' mods sym userType (we will discuss it later).
  10. It should be possible for User to implement how to retrieve a list of possible values for userType from QueryParam' mods sym userType (we will discuss it later).
  11. SitemapInfo could be splitted on sitemap index where every key from index should have reproducible set of corresponding URLs when its length is more than 50000.

UI: built-in API and handlers

  1. Provide the way to retrieve information from API to get RobotsInfo.
  2. Provide the way to retrieve information from API to get SitemapInfo.
  3. Provide the way to extend API with /robots.txt.
  4. Provide the way to extend API with /sitemap.xml.
  5. Provide the way to extend with /sitemap/:sitemapIndex/sitemap.xml.
  6. Provide default handler for Robots (i.e. rendering from RobotsInfo to its textual representation).
  7. Provide default handler for sitemap.
  8. Provide default handler for sitemap index.
  9. If sitemap URL list length is no more than 50000 then this list should be rendered.
  10. If sitemap URL list length is greater than 50000 then index list should be rendered.
  11. If sitemap URL list length is no more than 50000 then sitemap index should be empty (i.e. client should receive HTTP 404 NOT FOUND while querying /sitemap/:sitemapIndex/sitemap.xml).
  12. If sitemap URL list length is greater than 50000 then sitemap index should be accessible (somehow).

I decided to choose at least one key per API branch. And if one particular branch will have more than 50000 URLs then split it on several keys/parts.

Diagram

There are several levels of context. For simplicity I draw them on this diagram.

Figure 1: Diagram contains several levels of contexts: API level (i.e. types), handler level and two extra levels: one for Robots Info, another one for Sitemap Info.
Figure 1: Diagram contains several levels of contexts: API level (i.e. types), handler level and two extra levels: one for Robots Info, another one for Sitemap Info.

Implementation details

There was one strange thing across the design. I started to see the pattern. How to implement these requirements without previous type-level experience. But I was not sure. I stepped back and read the brilliant book Thinking with Types by Sandy Maguire. I looked in servant-server, servant-swagger sources and Restricting servant query parameters by Alexander Vershilov as guidelines. These insights showed me how to deal with the task.

  • Each servant combinator is inhabitant type.
  • RobotsInfo and SitemapInfo should both be Monoid and Semigroup to be retrieved and combined altogether in their final representation.
  • Making a constraint (i.e. type class for inhabitant types and its instantiation) is a way to gather units. These units could be combined with each other.
  • Proxy is the way to retrieve information from type level to data level.
  • User-supplied values in Capture and QueryParam could be handled with a separate type classes (25 and 26 requirements). User must implement instances for corresponding type classes for a function that looks like MonadIO m => Proxy a -> app -> m [a]. Lists of possible values will be further propagated in Capture' and QueryParam' instances.
  • GHC itself was helpful throughout the way. It showed me which extensions should be included and brought the useful messages. The only time I felt that I am doing something really wrong was about PolyKinds. When I needed types of different kinds GHC does not tell me anything about it.
  • GHC User Guide is an awesome reference!

Summary

  • 4 inhabitant data types.
  • 10 types.
  • 5 type classes.
  • 70 instances.
  • 0 type families.
  • 18 different language extensions.
  • goal achieved.

Conclusion

I have heard a lot about Simple Haskell, Boring Haskell, Fancy Haskell. I realized that restrictions and limitations produced for good reasons: TTM, benefits vs. costs, risks and so on.

On the contrary, there are some awesome explanations from problem solving point of view. They lead to specific technologies or even “type-level magic” stuff.

These are different mindsets. They all share that problem have to be solved. Difference is in the way to solve the problems. Maybe different problems. Maybe higher-order problems. Maybe single task from long awaited todo-list.

One day you can find yourself in a situation when you have to deliver faster than usual. Another day there are no hard time boundaries and you feel relaxed. We are living in the endless uncertainty of estimations and expectations. We are living in chaos where nothing could be predicted.

Sometimes we know what should be done and we have to do the things we did before. Sometimes task feels like extremely new: new domain, interfaces, requirements, specs. Abstractions are quite good way to dealing with them. It is a single side of coin.

When you add team to consideration, situation could become even worse. In large projects it is absolutely normal that you spend on documenting/analysing/coding only 10% of business hours. Rest of time are about communications, mostly. Communications with business partners, with BA folks, with development team, with management, with QA team, with executives, with users.

I feel myself like I am simultaneously living in a few worlds. And it is hard to express the concerns to somebody because I have to understand what are the preconditions, what context is suitable for people I interact with, what goals they are pursuing, what agenda is on their tables. And it is impossible. In the village where I grew up, neighbours used to say: “Going to politics? - Good riddance!”

And I am starting to feel myself unhappy from the beginning of interaction. So the best solution for me in these situations is to remain silent. Unless, I am experiencing mentioned trade-offs by myself.

There is a force bigger than us. Nothing could stop it. We cannot resist it. We could only accept it and let it drive us all the way.

Since the beginning I struggled a lot, there were:

  • health issues during hard self-isolation,
  • surgical interventions,
  • contacting with COVID-19,
  • dealing with police as a consequence of contact,
  • even hardware display issue on the laptop.

Despite all of that, the goal was achieved. It finally lets me go.

Results are available as a library:

I encourage you to build useful website with servant and hundred of thousands pages and to be successfully indexed by robots!

If you have the notes to add or you see some issues in the spec/code, please contact me, e.g. on Github.

Thank you and have a good day!

Links (in order of appearance)

  1. Robots specification.
  2. Sitemap specification.
  3. Thinking with Types.
  4. servant-server.
  5. servant-swagger.
  6. Restricting servant query parameters.
  7. GHC User Guide.
  8. Simple Haskell.
  9. Boring Haskell.
  10. Fancy Haskell.
  11. Interview with Edward Kmett.
  12. https://github.com/swamp-agr/servant-seo.
  13. https://hackage.haskell.org/package/servant-seo.

Posted on 2020-07-13 by agr . Powered by Hakyll. Inspired by Yann Esposito.