Creating the Novem Example Index

Posted January 5, 2024 by Sondov Engen  ‐ 22 min read

One of the biggest challenges with building a datavisualisation platform is access to realistic data. This article explains how we created our financial data set – the Novem Example Index.

Creating the Novem Example Index Creating the Novem Example Index


One of the things we set out to do with the novem platform was to create a good user experience (UX). The idea is to place our data visualization platform where the user is already productive, such as in their programming language, or integrated with their tools.

In our world, a good UX is not only defined by being easy to use, but also by how it handles errors and exceptions. Error handling should be graceful and informative. Existing workflows should not break at the first sign of a problem, but provide a best effort result and inform the user of any problems.

To be able to create this experience it’s important to do a lot of testing with real-world data, as real-world data tends to break in ways no developer could imagine. While fuzzing is a very useful tool for security, it’s not really applicable when trying to replicate actual workflows.

To help with development and testing as well as product demonstration, we decided to try and create a few relevant real-world datasets. This article explains how we created the first of these datasets – the Novem Example Index (NEI).

The objective

The objective of the NEI is to provide a rich and detailed dataset to use across our different visualizations and workflows:

  • Line, pie, bar, map, scatter and treemap charts.
  • Daily, weekly, monthly and on-demand e-mail examples
  • Grids and documents
  • Dashboard screens

In addition we want our dataset to be open and reproducible, that means:

  • Pick sources that are openly accessible, without api keys or subscriptions
  • Provide source code and utilities to replicate all the derived values
  • Do not require any novem proprietary or hosted information outside the scope of visualization

Finally we hope this blog post can serve as an introduction to data processing and analysis in python, to illustrate the power of the platform, and serve as inspiration and material in a quest for learning.

The Index

The name, Novem Example Index, was chosen to be illustrative of the dataset. The NEI is built to create a dataset that looks like a regular market index, but created bottom up from individual securities and price series.

The index is also extremely simplified. We currently make no effort to deal with the myriad of technicalities that comes with providing a real market index such as liquidity constraints, corporate actions, rebalancing, voting rights, share structure, etc. We also make no attempt at creating a historically accurate constituent set.

We do however desire some features from the NEI:

  • It should be possible to create on demand (though we don’t require, or expect, it to produce the same exact numbers each time)
  • It should resemble the overall market performance
  • It should have a reasonable diversity in geography and sector

If you follow the steps outlined in this guide you should be able to source the underlying data, and calculate the weights and construct a table of returns similar to the example below.

Creating our index

To start off we need a list of securities that make up our index. In addition we’d want a bit of metadata such as sector, industry, country and market cap. This is so we can create some more exciting illustrations as well as a more representative aggregate return.

To make it simple, the Nasdaq stock screener has a download csv function that provides a pretty decent starting point (a list of over 8000 global equities, though only ones that are either directly, or through derivatives, listed on the NASDAQ).

import pandas as pd

df = pd.read_csv('nasdaq_screener_xx.csv')

# Index(['Symbol', 'Name', 'Last Sale', 'Net Change', 
#        '% Change', 'Market Cap', 'Country', 'IPO Year', 
#        'Volume', 'Sector', 'Industry'],
#        dtype='object')

As you can see we get most of our key attributes here:

  • Symbol - a short representation of the security which we can use with yahoo finance to get price history
  • Country - to create a geographic distribution/visualization
  • Industry/Sector - nice category values for data visualization
  • Market Cap - lets us calculate shares outstanding and index weight
  • IPO Year - allows us to filter out young companies (if we want)

One obvious problem here is that this information is point in time. For a real index setup we’d want to track all of these variables through time, however for our simplified demo data, this is sufficient.

We should also do a quick check of data quality, to see if all the important fields are populated and clean up the data if it’s not.

# in the real world we'd strive to enrich this 
# information  so we had access to as wide as 
# dataset as possible, but in this case, we'll 
# just filter it out

# our total securities before filter
# 8307

# only keep assets that has all of the following
df = df.loc[
    ~(pd.isna(df["Country"]))    # country
  & ~(pd.isna(df["Sector"]))     # sector
  & ~(pd.isna(df["Market Cap"])) # market cap
  & ~(pd.isna(df["Last Sale"]))  # price

# our total securities after filter
# 6720


While we want our dataset to be representative of the overall market, we also want it to be visually interesting to look at. This might require making some adjustments to bring out underrepresented aspects of the data.

# get top 10 countries by sum of market cap in billion USD
df.groupby('Country')['Market Cap'].sum().sort_values(

# Country
# United States     41245.053434
# United Kingdom     1299.244579
# Switzerland         612.885696
# Netherlands         607.707868
# China               579.752448
# Japan               516.300279
# Canada              431.550941
# Taiwan              417.590556
# Ireland             413.421136
# Brazil              369.924693
# Name: Market Cap, dtype: float64

# get the largest 500 companies by market cap (mcap)
def get_500_cty(df):
  top500 = df.sort_values(
    by='Market Cap', 

  # group by country and create count and mcap aggs
  gp = top500.groupby('Country').agg({
    'Market Cap':'sum'
  }).sort_values(by='Market Cap', ascending=False)

  # calculate percentages
  gp['pct_mcap'] = gp['Market Cap'] / \
    gp['Market Cap'].sum()*100
  gp['pct_names'] = gp['Symbol'] / \

  return gp

# have a look
gp = get_500_cty(df)

#                  pct_mcap  pct_names
# Country
# United States   83.091668       76.8
# United Kingdom   2.909212        3.8
# Switzerland      1.409058        2.0
# Netherlands      1.320883        1.2
# Japan            1.223538        1.6

As you can see the US is dominating our dataset, making any geography visualization pretty boring. Even if you filter for the top 500 largest companies the us dominance is still over 80% by mcap.

Let’s create and apply a factor for US companies and see if our dataset is a bit more diverse (this is an arbitrary factor, but it’s a nice way to play around with the numbers).

# create a new dataframe with country factors
factor = pd.DataFrame([
  {'Country': 'United States', 'factor':0.15},

# add factor to our screener using a left join 
# (preserving all members of our original dataset, 
# introducing NA for missing country factors)
dfact = pd.merge(df, factor, on='Country', how='left')

# assign 1.0 to all countries without an explicit 
# factor (no adjustment)
dfact.loc[:, 'factor'] = dfact.loc[
  :, 'factor'

# create our factor adjusted market cap
dfact['Market Cap'] = dfact['Market Cap'] * \

# have a look
gp = get_500_cty(dfact)

#                  pct_mcap  pct_names
# Country
# United States   38.281250       56.4
# United Kingdom   9.997520        5.4
# Switzerland      4.781284        2.6
# Netherlands      4.735530        2.2
# China            4.225116        3.8
# Japan            4.076380        1.8
# Taiwan           3.253116        0.6
# Ireland          3.250415        2.4
# Canada           3.064369        3.2
# France           2.816903        0.8

Much better, the US is still dominating (as it should) but the rest of the world is getting a bit of representation as well. The last thing we should check is that we haven’t skewed our sector distributions too bad.

# let's check out our sectors too
def get_500_sec(df):
  top500 = df.sort_values(
    by='Market Cap', 

  # group by sector and create count and mcap aggs
  gp = top500.groupby('Sector').agg({
    'Market Cap':'sum'
  }).sort_values(by='Market Cap', ascending=False)

  # calculate percentages
  gp['pct_mcap'] = gp['Market Cap'] / \
    gp['Market Cap'].sum()*100
  gp['pct_names'] = gp['Symbol'] / \

  return gp

# get adjusted sectors
gp = get_500_sec(dfact)[['pct_mcap','pct_names']]

# get original sectors
ogp = get_500_sec(df)[['pct_mcap','pct_names']]

# compute diff
gp['pct_mcap_old'] = ogp['pct_mcap']
gp['diff_pct'] = ((gp-ogp)/ogp * 100)['pct_mcap']

# have a look

#                      pct_mcap  pct_mcap_old    diff_pct
# Sector
# Technology          23.280741     28.604865  -18.612651
# Consumer Discre...  19.571802     20.617996   -5.074175
# Health Care         15.332663     14.986135    2.312330
# Finance             13.685485     11.158983   22.640965
# Energy               9.387199      6.193034   51.576750
# Industrials          5.512074      6.401306  -13.891403
# Consumer Staples     4.203575      3.465928   21.282815
# Telecommunications   3.237112      2.040436   58.648081
# Basic Materials      2.791836      1.116991  149.942517
# Utilities            2.235667      3.485694  -35.861636
# Real Estate          0.761844      1.928634  -60.498240

Not perfect, but still decent. This point in time snapshot along with the above factors will be used for constructing our NEI.

Price data

The last step on our constituent construction is to make sure that we have access to price history for our members. For the NEI we’ll focus on 2022, both to keep the data manageable and because our constituent set is a point in time snapshot.

# get the pandas_datareader utility
from pandas_datareader import data

# get 500 largest companies of our new factor
top500 = dfact.sort_values(
  by='Market Cap', 

# Get price history for our 500 largest companies
# N.B. This takes a long time to run (30min+), consider 
# getting the entire price history once and to cache the 
# result
px = data.DataReader(
  start='2021-11-18',   # get a few days more than we need
  data_source='yahoo' # use yahoo finance
)['Adj Close']

# unstack the data for easier indexing
upx = px.unstack().reset_index().rename(columns={
  0: 'px'

# now that you have the price data, lets see if any are
# missing data and if so how may days
miss = upx.loc[

# let's have a look at the latest price for the symbols
# with missing price. If they have a price on the last
# day it's probably either listed this year or an 
# intermittent data issue

# Symbols
# CEG     95.139999
# MBLY    28.090000
# Name: 2022-11-10 00:00:00, dtype: float64

As you can see we have obtained all the necessary data to start constructing our index.

Index construction

500 individual stock prices aren’t very useful when it comes to data visualization, what we really want is the aggregate behavior of these stocks grouped across interesting dimensions.

Now the easiest way to calculate this would probably be simple averages, but that would not provide a very good representation of the market behavior (and one of our goals was to create an index that behaved similar to the market).

In the real world companies aren’t all worth the same, so $100 invested across the broader market will buy you more of some companies (say Apple) than others (Snap). To create an index that mirrors this market, the index performance also has to be more affected by the movement of AAPL prices than that of SNAP.

This is where the Market Capitalization (mcap) that we downloaded above comes into play. We can assign each company in the index a weight according to the pct mcap of the company compared to the rest of the market.

# assign a weight [wt] value 
top500['wt'] = top500['Market Cap'] / \
  top500['Market Cap'].sum()

# have a look

# Symbol  Name                      factor  wt
#   AAPL                Apple Inc.  0.15    0.030162
#    TSM    Taiwan Semiconducto...  1.00    0.029011
#   MSFT     Microsoft Corporation  0.15    0.021455
#    NVO          Novo Nordisk A/S  1.00    0.020552
#   ASML         ASML Holding N.V.  1.00    0.017590
#    AZN           AstraZeneca PLC  1.00    0.015929
#     TM         Toyota Motor Corp  1.00    0.015188
#   SHEL     Royal Dutch Shell PLC  1.00    0.015104
#   BABA             Alibaba Group  1.00    0.014587
#    NVS               Novartis AG  1.00    0.014460

As you can see, our factor adjustments have had quite a significant impact on a security level (as in the real world Apple is approx. 6.5 times more worth than TSMC). Nevertheless, this is what we wanted, and we now have the individual stock weights as well.

However, while the weight we have calculated is good today, using the same weight yesterday, a week ago or a month ago would be very wrong. After all, if a company doubles in value, its weight should increase as well (all things being equal - which they seldom are).

So what we need to do is calculate the weight, not just for one day, but for every day!

So, how do we calculate the relative weight of each of our constituents for every day so far in 2022? Well, the calculation is pretty simple, we need to get the market cap for each company for each day in 2022. This can be obtained by calculating the number of outstanding shares the company has on each day and multiply it with the price.

Now, the number of shares a company has outstanding can vary through time thanks to stock splits, share buybacks, stock issuance and other. In addition, the historic price of a share can change not just on market moves but through corporate actions as well.

Luckily for us, Yahoo finance has already provided us with a corporate action adjusted price, meaning that todays shares + the adjusted price should correctly reflect the market-cap of the company going back in time.

# clean up our $ price
top500['Last Price'] = top500['Last Sale'].apply(
  lambda x: float(x.replace('$',''))

# calculate outstanding shares today for our "index"
top500['shs'] = top500['Market Cap'] / \
  top500['Last Price']

# keep in mind that while the real world seldom has
# fractional shares, due to our factor, we could have

# Get price history for our index members
ixp = px[top500['Symbol']]

# pivot down our columns so we get date, symbol, price
pixp = px.unstack().reset_index().rename(columns={
  0: 'px'

# enrich our outstanding shares
pp = pd.merge(

# calculate our historic market cap
pp['mcap'] = pp['px'] * pp['shs']

# HINT: This uses a convenient pandas function that lets
#       us calculate the sum of a value by category
#       and then spread that value back across its 
#       respective position, in essence creating a 
#       total mcap for the day across all symbol/date
#       combinations
# calculat our daily wt as a pct of daily mcap
pp['wt'] = pp['mcap'] / pp.groupby('Date')[

# verify that our calculation is correct
# array([1., 1.])

# two values, so not every row sums to 100%
# 0.9999999999999999

# floating points and rounding... but more than good
# enough for our use

So there we go, end of day relative size by market cap for all securities in our index.

Index returns

Now that we have our index constituents and weights, it’s time to calculate our returns. Again, we have some goals we want to achieve:

  • We would like to have some fairly accurate looking returns
  • We want to calculate the returns across various dimensions and time periods
  • We’d want the code doing this to generalize quite well so we can play around with the data

Calculating returns

To achieve our top goal of creating a fairly accurate return number, it’s important to understand a couple of things.

  • Prices aren’t inherently comparable, two companies with a market value of one million can have vastly different share prices. (e.g. one has 100 shares outstanding and a price of 10,000 the other has 1,000,000 shares outstanding and a price of 1)

  • Movement in price however, is comparable. If you invest an equal amount in each company and the prices double, you’ve doubled your money.

Now if we only had one investment, calculating the return would be simple - we would just take the price on the last day and divide it with the price on the first day, easy. However, we do not have one investment, we have 500, and we don’t want to calculate individual returns, but aggregate returns.

The trick to calculating aggregate returns is to realize, as we did above, that different securities contributes differently to aggregate return. If you hold 90% of your investments in Apple, then Apple performance will affect your aggregate return more than companies where you have only invested 1%.

Another important thing to be aware of is the value your return is applied to. If you hold 100,000 dollars in a company, and the company drops 90% then your investment is worth 10,000 dollars. But when doing aggregates it’s important that you apply that 90% loss to the 100,000 dollar investment, not the 10,000 dollar one.

So with all the background out of the way, let’s calculate some returns.

# let's start by making sure we have date, symbol,
# sector, country, price and factor adjusted market 
# cap for all our securities


# Index(['Symbol', 'Date', 'px', 'shs', 'Sector',
#        'Country', 'mcap', 'wt'], dtype='object')

# the following analysis depends on the data being 
# sorted by symbol and ascending date, so let's make 
# sure
pp = pp.sort_values(by=['Symbol','Date'])

# let's change our dollar price into price movements
pp['ret'] = pp.groupby('Symbol').px.pct_change()

# Now we need to decide which aggregate categories 
# we want to calculate. For the NEI we want to show 
# both country and sector returns as well as their 
# contributions to the index.

# The total index and contributions comes the bottom
# most weight we've already calculated, but to do sector
# and country returns we have to create a new weighting 
# regime for each category.

# for each security, each day, calculate individual pct 
# of total sector mcap
pp['swt'] = pp['mcap'] / pp.groupby([

# repeat for cty
pp['cwt'] = pp['mcap'] / pp.groupby([

# Now remember that we have our returns calculated as 
# the change of price between two close prices, it's 
# therefore important that we apply the return to the 
# relative weights at the beginning of the day (before 
# the weight reflects the price)

# We do a little hack here where we shift the previous 
# days wt to an "open" weight on the next day.

# Bottom level open wt
pp['owt'] = pp.groupby(['Symbol'])['wt'].shift()

# Let's repeat the above for sector and country
pp['oswt'] = pp.groupby(['Symbol'])['swt'].shift()
pp['ocwt'] = pp.groupby(['Symbol'])['cwt'].shift()

# OK, finally we're ready to calculate our daily, 
# weighted return contributions. Here we calculate 
# how much each security should contribute to the 
# overall aggregate categories we're creating:
# market, sector and country
pp['mret'] = pp['owt'] * pp['ret']
pp['sret'] = pp['oswt'] * pp['ret']
pp['cret'] = pp['ocwt'] * pp['ret']

Now that we got our individual returns and weights it’s time to create our aggregates. The challenge here is that we have two dimensions we need to produce the product across, the category (sector/country/market) and the time (month /year). If you’re curious why we’re using geometric and not arithmetic functions then FirstLinks has an excellent article.

Mathematically the operations are simple: normalize your returns to deviate from 1 and multiply everything:

$$ \text{period return for } c = \prod_{t=\text{start}}^\text{end}(1+w_{ct}r_{t}) -1 $$ $$ \text{where } w_{ct} \text{ is the asset category weight at time } t \\ \text{ and } r_t \text{ is the corresponding asset return } $$

However, implementing the above in code can be a bit tricky, particularly if you’re using a language that’s missing a convenient $\prod$ function (such as many dialects of SQL).

An alternative is then to use logarithmic returns for all your calculations, the advantage of this approach is that the summation of logarithms is the same as multiplication of the exponent, so by transforming the returns into log space, you can do all your aggregates with simple summation. As an educational example we’ll use the log approach for the NEI.

Let’s start by testing our assumptions and write some code to calculate the end of period price using exp.sum.log.

# we only care about 2022, so let's filter the data year 
# to date
ppx = pp.loc[pp['Date'] >= '2022-01-01']

# let's start by a simple test of our calculation 
aapl = ppx.loc[ppx.Symbol == 'AAPL', ['Date','px','ret']]

# calculate the current apple price using sum of log 
# returns index times first days close price
tstpx = aapl.iloc[0]['px'] * \

# it can be helpful to break the above code down a bit, 
# python doesn't really have good function composition 
# support, so it's best to read the below steps in 
# reverse order (1 -> 2 -> 3 -> 4)
# np.exp(                   4. convert from log back into 
#                              "normal" space
#  sum(                     3. sum all the log returns
#                              (thus getting a single 
#                              value)
#   np.log(                 2. convert all returns to log 
#                              space
#    1+aapl.iloc[1:]["ret"] 1. get all year to date apple
#                              asset returns except the
#                              first date and normalize 
#                              around 1. We're skipping
#                              the first day as the first
#                              day close price already
#                              incorporates that days 
#                              returns
# )))

# compare against last price from yahoo

# 0.9999999999999998
# we're using floats here, so this is good enough

Looks good, so let’s continue.

Building our datasets

As you can see, the log/sum approach works well, so it’s time to define and calculate our NEI aggregates. For our research purposes we want to create the following datasets:

  • A table showing monthly returns as well as YTD and Market returns for sectors
  • A table showing monthly returns as well as YTD and Market returns for countries
  • A table showing return contributions for the sectors/countries to the market (what is the largest contributors to the market return)
  • Daily cumulative returns for the above three categories (nice for line charts)

We’ll start with the YTD Month over month sector/country tables.

# We'll start with our three tables as they all exhibit 
# a similar structure. Categories down the rows, months 
# across the cols and two aggregate columns

def catagg(ref, c="Sector", r="mret"):
  Calculate and return a month by month performance table 
  with Index and YTD aggregates. 

  :param ref: input dataframe used for calculation, needs 
              category column along with "Date" and asset 
              returns `r` and "mret" (market/index returns)
  :param c:   category value, must be a column in input
  :param r:   column name of category return
  :return:    dataframe with aggregate data

  Example of output table strucutre:

             Jan Feb Mar Apr ... YTD
  Category 1 xxx xxx xxx xxx ... xxx
  Category 2 xxx xxx xxx xxx ... xxx
  Category 3 xxx xxx xxx xxx ... xxx
  Index      xxx xxx xxx xxx ... xxx

  # prepare our reference data, include mret for index
  # calculations
  if r == 'mret':
    df = ref.loc[:, ['Date', c, r]]
    df = ref.loc[:, ['Date', c, r, 'mret']]

  # construct our log rets
  df['lret'] = np.log(1+df[r])
  df['lmret'] = np.log(1+df['mret'])

  # construct our total index return by month
  mperf = (
        pd.Grouper(freq='M', key="Date")
  mperf[c] = 'Novem Example Index'

  # compute performance by category and month
  perf = (
        c, pd.Grouper(freq='M', key="Date")

  # construct our performance by category and year
  ytd = (
        c,pd.Grouper(freq='Y', key="Date")

  # clean up the YTD data so it fits with the other
  # numbers
  ytd['Date'] = 'YTD'
  ytd = ytd[[c,'wret']].rename(columns={'wret':'YTD'})

  # merge index and category performance, pivot it with
  # categoires and index on rows and dates on cols
  cperf = pd.pivot_table(
    pd.concat([perf, mperf]),

  # construct a date range for every month of the current
  # year
  index_month = pd.date_range(
    start=df.Date.max() - offsets.YearBegin(),
    end=df.Date.max() + offsets.YearEnd(),

  # Assign our new month index, adding NA columns for the 
  # missing months
  cperf = cperf.reindex(index_month, axis=1)

  # Give short month names as column labels
  cperf.columns = cperf.columns.strftime('%b')

  # merge in YTD returns
  cperf = pd.merge(
    cperf, ytd, on=c, how='left'

  # Add index YTD return (all market returns in period)
  cperf.loc['Novem Example Index','YTD'] = \

  # sort by last column
  cperf = cperf.sort_values(

  # shiftindex market to end of table
  cperf = cperf.reindex(
    [x for x in cperf.index.to_list() 
     if x != 'Novem Example Index'] + 
    ['Novem Example Index']

  # return our new table
  return cperf

# construct performance table for sectors
sperf = catagg(ppx, c='Sector', r='sret')

# construct performance table for countries
cperf = catagg(ppx, c='Country', r='cret')

# construct performance table for sector contriubtion
cntr_sperf = catagg(ppx, c='Sector', r='mret')

# construct performance table for country contriubtion
cntr_cperf = catagg(ppx, c='Country', r='mret')

As you can see quite a bit of code is neccessary to construct the tables we’re looking for, in particular the custom code for YTD and market aggregates. Fortunately the data is interesting to look at and provide insightful heatmaps for our research. We’ve included the Sector performance table below and if you want to see the other tables just click the links to country performance, sector contributions or country contributions.

Now that we got our monthly performance tables it’s time to create the data for our line charts.

phist = ppx.loc[:, [

# create our log returns
phist['lsret'] = np.log(1 + phist['sret'])
phist['lcret'] = np.log(1 + phist['cret'])
phist['lmret'] = np.log(1 + phist['mret'])

# Create our sector aggregates
shist = pd.pivot_table(phist,
    index='Date', columns='Sector',
    values='lsret', aggfunc='sum'

# create our country aggregates
chist = pd.pivot_table(phist,
    index='Date', columns='Country',
    values='lsret', aggfunc='sum'

# for the market we can just sum all of them
mhist = phist.groupby('Date')['lmret'].sum()

# name the market for later joins = 'Market'

# cumulate our returns
shist = shist.cumsum()
chist = chist.cumsum()
mhist = mhist.cumsum()

# shift it back from log space
shist = np.exp(shist)-1
chist = np.exp(chist)-1
mhist = np.exp(mhist)-1

# let's sort it by the last row
shist = shist.sort_values(
  axis=1, ascending=False

chist = chist.sort_values(
  axis=1, ascending=False

# add in the market performance
shist = pd.merge(shist, mhist, on='Date')
chist = pd.merge(chist, mhist, on='Date')

A fair bit simpler than the previous table most of the effort is taken up by duplicating the country and sector performance along with adding in the additional index return. I’ve added the performance chart below, and the observant viewer will notice that the end of the lines correspond with the YTD returns in the previous table.

As a final comparison below is a comparison of the factor and non-factor adjusted performance series for the NEI.


In this article we’ve shown how a psuedo equity index can be created using open source data and python + pandas. We’ve briefly looked at different ways to adjust the index composition as well as how to calculate returns.

For the next chapter we’ll look at how to enrich the datasets with third party information and create some interesting hierarchies and visualisations.