I’m counting on a grueling work week starting tomorrow, so I put away all my work gear late Friday evening and decided to improve my mood by reading a bit, catching up on TV (Final Space is great fun, by the way) and polishing some of my projects, starting with an old nemesis, an RSS feed topic analyzer.
I’ve been coding variations on it for a long while now, because it’s a tricky problem:
- You need to be able to scale horizontally both fetching and parsing
- Parsing and analyzing items is a dark art (especially in title-only feeds, which force you to be creative and crawl original pages if you want to do any sort of meaningful NLP)
- Storing the data (and resulting metadata) efficiently requires some planning
- Divining relevancy, “hot” topics and item relationships is really hard if you want to do it above “demo”/undergrad level
Add to that the fact that I still power through well over 200 individual feeds daily at breakfast in Reeder, and no wonder I’ve been poking at this for a while–having intelligently curated items that cater to my interests and raise awareness of specific topics (what has the competition been up to? who has announced a new project? what are today’s topic outliers?) would be a great way to save time, and I there are many, many ways to derive useful insights from that data.
The High-Level Take
The gauche way to do this is to grab everything, toss it into a data layer that can do TF-IDF, and emerge most relevant keywords from the index, as well as “close” items. I’ve done that in the past, and it is useless for my case–regular newsfeeds drown out the really interesting stuff every single time.
A slightly better way is to do some form of natural language processing and do topic extraction. I’ve tried doing that with cognitive APIs of various kinds, but a) that doesn’t work well for the (significant) minority of (continental) Portuguese feeds, and b) I have no way to train most of those APIs.
So I built my own RAKE keyword extractor, and am fiddling about with ways to do topic extraction, but need to have the thing running on a more or less permanent basis again.
Loosely-Coupled Components
I’m a fan of CSP in various guises, and it was a natural thing for me to split this off into a scheduler, a fetcher and various kinds of workers deployed as independent processes that talk to each other via Redis and stuff data into MongoDB2. I’ve thought of doing everything in Go more than once, but asyncio
and my own custom Python builds are more than fast enough, and have all the NLP tooling I need, so the current iteration runs using Docker compose inside a small VM.
But I don’t want to maintain a VM for this, so I’ve been patiently waiting for bits of the Azure roadmap to land. Azure Functions would be perfect and I built a proof-of-concept version for it, but it doesn’t let me run the libraries I need (yet). So I’ve been looking running it inside Kubernetes but with “native” (i.e., cheap) Azure services.
Right now, it’s working fine with Azure Redis (zero changes, as expected) and Cosmos DB (a few tweaks for asyncio
compatibility, but nothing special). The UI is non-existent, but I don’t really need it yet–I’m more concerned with having a way to store the data and try different strategies for processing it in the short term.
Flattening Data Costs
As it happens, Cosmos DB is meant to scale up and out to planetary scale and very high transaction rates, but not necessarily “down” to personal projects. Running it with the current absolute minimum (400 RUs1) means I have to spend €20/month for each collection, which (despite my having a “free” personal MSDN Azure subscription) is a bit much for handling persistent storage for a couple of thousand RSS feed items (and largely the same applies for Redis, although I have more options there).
So I decided to move to raw table storage, a partition/key-oriented store for arbitrary data structures somewhat like Google Cloud Datastore (which I used in the past when I developed for App Engine) and a nice advantage Azure has over the likes of AWS S3.
However, nearly all of my modern Python code uses asyncio
, and the Azure SDKs aren’t quite there yet, so I decided to roll my own Azure Storage library yesterday, and it’s gone swimmingly–in a few idle minutes here and there, I’ve already gotten nearly complete (for my purposes) table storage support and have been working on Storage Queues (another Azure nicety).
The only reason I’m not done yet is that Storage Queues use XML payloads, which means my nice async
generators need to be a fair bit uglier than expected… So I put that chore off for a bit.
But what is there already is very fast when run on Azure–aiohttp
connection pooling and uvloop
are great, and I don’t even need to dip into Cython
to go up to 500 table transactions/s (faster and cheaper than the 400 RU Cosmos DB tier for key/value interactions). Altogether, not bad.
Not as deep as I’d wish (I’ve actually been fiddling with another, more interesting API, which I’ll get to next week, if I’m lucky), but not a bad way to take my mind off things.
-
Request Units, which are a metric for throughput and transactions/s ↩︎
-
Which is OK for playing around with, but not something I want to maintain either. ↩︎