AI For Trading:Feature Engineering and Labeling (113)
Feature Engineering and Labeling
We'll use the price-volume data and generate features that we can feed into a model. We'll use this notebook for all the coding exercises of this lesson, so please open this notebook in a separate tab of your browser.
Please run the following code up to and including "Make Factors." Then continue on with the lesson.
import sys
!{sys.executable} -m pip install --quiet -r requirements.txt
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14, 8)
Registering data
import os
import project_helper
from zipline.data import bundles
os.environ['ZIPLINE_ROOT'] = os.path.join(os.getcwd(), '..', '..', 'data', 'module_4_quizzes_eod')
ingest_func = bundles.csvdir.csvdir_equities(['daily'], project_helper.EOD_BUNDLE_NAME)
bundles.register(project_helper.EOD_BUNDLE_NAME, ingest_func)
print('Data Registered')
Data Registered
from zipline.pipeline import Pipeline
from zipline.pipeline.factors import AverageDollarVolume
from zipline.utils.calendars import get_calendar
universe = AverageDollarVolume(window_length=120).top(500)
trading_calendar = get_calendar('NYSE')
bundle_data = bundles.load(project_helper.EOD_BUNDLE_NAME)
engine = project_helper.build_pipeline_engine(bundle_data, trading_calendar)
universe_end_date = pd.Timestamp('2016-01-05', tz='UTC')
universe_tickers = engine\
.run_pipeline(
Pipeline(screen=universe),
universe_end_date,
universe_end_date)\
.index.get_level_values(1)\
.values.tolist()
from zipline.data.data_portal import DataPortal
data_portal = DataPortal(
bundle_data.asset_finder,
trading_calendar=trading_calendar,
first_trading_day=bundle_data.equity_daily_bar_reader.first_trading_day,
equity_minute_reader=None,
equity_daily_reader=bundle_data.equity_daily_bar_reader,
adjustment_reader=bundle_data.adjustment_reader)
def get_pricing(data_portal, trading_calendar, assets, start_date, end_date, field='close'):
end_dt = pd.Timestamp(end_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')
start_dt = pd.Timestamp(start_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')
end_loc = trading_calendar.closes.index.get_loc(end_dt)
start_loc = trading_calendar.closes.index.get_loc(start_dt)
return data_portal.get_history_window(
assets=assets,
end_dt=end_dt,
bar_count=end_loc - start_loc,
frequency='1d',
field=field,
data_frequency='daily')Make Factors
- We'll use the same factors we have been using in the lessons about alpha factor research. Factors can be features that we feed into the model.
from zipline.pipeline.factors import CustomFactor, DailyReturns, Returns, SimpleMovingAverage
from zipline.pipeline.data import USEquityPricing
factor_start_date = universe_end_date - pd.DateOffset(years=3, days=2)
sector = project_helper.Sector()
def momentum_1yr(window_length, universe, sector):
return Returns(window_length=window_length, mask=universe) \
.demean(groupby=sector) \
.rank() \
.zscore()
def mean_reversion_5day_sector_neutral(window_length, universe, sector):
return -Returns(window_length=window_length, mask=universe) \
.demean(groupby=sector) \
.rank() \
.zscore()
def mean_reversion_5day_sector_neutral_smoothed(window_length, universe, sector):
unsmoothed_factor = mean_reversion_5day_sector_neutral(window_length, universe, sector)
return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=window_length) \
.rank() \
.zscore()
class CTO(Returns):
"""
Computes the overnight return, per hypothesis from
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2554010
"""
inputs = [USEquityPricing.open, USEquityPricing.close]
def compute(self, today, assets, out, opens, closes):
"""
The opens and closes matrix is 2 rows x N assets, with the most recent at the bottom.
As such, opens[-1] is the most recent open, and closes[0] is the earlier close
"""
out[:] = (opens[-1] - closes[0]) / closes[0]
class TrailingOvernightReturns(Returns):
"""
Sum of trailing 1m O/N returns
"""
window_safe = True
def compute(self, today, asset_ids, out, cto):
out[:] = np.nansum(cto, axis=0)
def overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe):
cto_out = CTO(mask=universe, window_length=cto_window_length)
return TrailingOvernightReturns(inputs=[cto_out], window_length=trail_overnight_returns_window_length) \
.rank() \
.zscore()
def overnight_sentiment_smoothed(cto_window_length, trail_overnight_returns_window_length, universe):
unsmoothed_factor = overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe)
return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=trail_overnight_returns_window_length) \
.rank() \
.zscore()
universe = AverageDollarVolume(window_length=120).top(500)
sector = project_helper.Sector()
pipeline = Pipeline(screen=universe)
pipeline.add(
momentum_1yr(252, universe, sector),
'Momentum_1YR')
pipeline.add(
mean_reversion_5day_sector_neutral_smoothed(20, universe, sector),
'Mean_Reversion_Sector_Neutral_Smoothed')
pipeline.add(
overnight_sentiment_smoothed(2, 10, universe),
'Overnight_Sentiment_Smoothed')
all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)
all_factors.head()
| Mean_Reversion_Sector_Neutral_Smoothed | Momentum_1YR | Overnight_Sentiment_Smoothed | ||
|---|---|---|---|---|
| 2013-01-03 00:00:00+00:00 | Equity(0 [A]) | -0.262769 | -1.207978 | -1.485669 |
| Equity(1 [AAL]) | 0.099926 | 1.713471 | 0.919350 | |
| Equity(2 [AAP]) | 1.669138 | -1.535061 | 1.507733 | |
| Equity(3 [AAPL]) | 1.698746 | 1.193111 | -1.367992 | |
| Equity(4 [ABBV]) | NaN | NaN | -0.250063 |
Stop here and continue with the lesson section titled "Features".
Universal Quant Features
- stock volatility: zipline has a custom factor called AnnualizedVolatility. The source code is here and also pasted below:
class AnnualizedVolatility(CustomFactor):
"""
Volatility. The degree of variation of a series over time as measured by
the standard deviation of daily returns.
https://en.wikipedia.org/wiki/Volatility_(finance)
**Default Inputs:** :data:`zipline.pipeline.factors.Returns(window_length=2)` # noqa
Parameters
----------
annualization_factor : float, optional
The number of time units per year. Defaults is 252, the number of NYSE
trading days in a normal year.
"""
inputs = [Returns(window_length=2)]
params = {'annualization_factor': 252.0}
window_length = 252
def compute(self, today, assets, out, returns, annualization_factor):
out[:] = nanstd(returns, axis=0) * (annualization_factor ** .5)
from zipline.pipeline.factors import AnnualizedVolatility
AnnualizedVolatility()
AnnualizedVolatility((Returns((USEquityPricing.close::float64,), window_length=2),), window_length=252)
Quiz
We can see that the returns window_length is 2, because we're dealing with daily returns, which are calculated as the percent change from one day to the following day (2 days). The AnnualizedVolatility window_length is 252 by default, because it's the one-year volatility. Try to adjust the call to the constructor of AnnualizedVolatility so that this represents one-month volatility (still annualized, but calculated over a time window of 20 trading days)
Answer
# TODO
AnnualizedVolatility(window_length=20)
AnnualizedVolatility((Returns((USEquityPricing.close::float64,), window_length=2),), window_length=20)
Quiz: Create one-month and six-month annualized volatility.
Create AnnualizedVolatility objects for 20 day and 120 day (one month and six-month) time windows. Remember to set the mask parameter to the universe object created earlier (this filters the stocks to match the list in the universe). Convert these to ranks, and then convert the ranks to zscores.
# TODO
volatility_20d = AnnualizedVolatility(window_length=20, mask=universe).rank().zscore()
volatility_120d = AnnualizedVolatility(window_length=120, mask=universe).rank().zscore()
Add to the pipeline
pipeline.add(volatility_20d, 'volatility_20d')
pipeline.add(volatility_120d, 'volatility_120d')
Quiz: Average Dollar Volume feature
We've been using AverageDollarVolume to choose the stock universe based on stocks that have the highest dollar volume. We can also use it as a feature that is input into a predictive model.
Use 20 day and 120 day window_length for average dollar volume. Then rank it and convert to a zscore.
"""already imported earlier, but shown here for reference"""
#from zipline.pipeline.factors import AverageDollarVolume
# TODO: 20-day and 120 day average dollar volume
adv_20d = AverageDollarVolume(window_length=20, mask=universe).rank().zscore()
print(adv_20d)
adv_120d = AverageDollarVolume(window_length=120, mask=universe).rank().zscore()
GroupedRowTransform((Rank(AverageDollarVolume((USEquityPricing.close::float64, USEquityPricing.volume::float64), window_length=20), method='ordinal', mask=AssetExists()), Everything((), window_length=0)), window_length=0)
Add average dollar volume features to pipeline
pipeline.add(adv_20d, 'adv_20d')
pipeline.add(adv_120d, 'adv_120d')
Market Regime Features
We are going to try to capture market-wide regimes: Market-wide means we'll look at the aggregate movement of the universe of stocks.
High and low dispersion: dispersion is looking at the dispersion (standard deviation) of the cross section of all stocks at each period of time (on each day). We'll inherit from CustomFactor. We'll feed in DailyReturns as the inputs.
Quiz
If the inputs to our market dispersion factor are the daily returns, and we plan to calculate the market dispersion on each day, what should be the window_length of the market dispersion class?
Answer
Quiz: market dispersion feature
Create a class that inherits from CustomFactor. Override the compute function to calculate the population standard deviation of all the stocks over a specified window of time.
mean returns
$$\mu = \sum{t=0}^{T}\sum{i=1}^{N}r_{i,t}$$
Market Dispersion
$$\sqrt{\frac{1}{T} \sum{t=0}^{T} \frac{1}{N}\sum{i=1}^{N}(r_{i,t} - \mu)^2}$$
Use numpy.nanmean to calculate the average market return $\mu$ and to calculate the average of the squared differences.
class MarketDispersion(CustomFactor):
inputs = [DailyReturns()]
window_length = 1
window_safe = True
def compute(self, today, assets, out, returns):
# TODO: calculate average returns
mean_returns = np.nanmean(returns)
#TODO: calculate standard deviation of returns
out[:] = np.sqrt(np.nanmean((returns - mean_returns)**2))
Quiz
Create the MarketDispersion object. Apply two separate smoothing operations using SimpleMovingAverage. One with a one-month window, and another with a 6-month window. Add both to the pipeline.
# TODO: create MarketDispersion object
dispersion = MarketDispersion(mask=universe)
# TODO: apply one-month simple moving average
dispersion_20d = SimpleMovingAverage(inputs=[dispersion], window_length=20)
# TODO: apply 6-month simple moving average
dispersion_120d = SimpleMovingAverage(inputs=[dispersion], window_length=120)
# Add to pipeline
pipeline.add(dispersion_20d, 'dispersion_20d')
pipeline.add(dispersion_120d, 'dispersion_120d')
Market volatility feature
- High and low volatility
We'll also build a class for market volatility, which inherits from CustomFactor. This will measure the standard deviation of the returns of the "market". In this case, we're approximating the "market" as the equal weighted average return of all the stocks in the stock universe.
Market return
$r{m,t} = \frac{1}{N}\sum{i=1}^{N}r_{i,t}$ for each day $t$ in window_length.
Average market return
Also calculate the average market return over the window_length $T$ of days:
$$\mu{m} = \frac{1}{T}\sum{t=1}^{T} r_{m,t}$$
Standard deviation of market return
Then calculate the standard deviation of the market return
$$\sigma{m,t} = \sqrt{252 \times \frac{1}{N} \sum{t=1}^{T}(r{m,t} - \mu{m})^2 } $$
Hints
- Please use numpy.nanmean so that it ignores null values.
- When using
numpy.nanmean:
axis=0 will calculate one average for every column (think of it like creating a new row in a spreadsheet)
axis=1 will calculate one average for every row (think of it like creating a new column in a spreadsheet) - The returns data in
computehas one day in each row, and one stock in each column. - Notice that we defined a dictionary
paramsthat has a keyannualization_factor. Thisannualization_factorcan be used as a regular variable, and you'll be using it in thecomputefunction. This is also done in the definition of AnnualizedVolatility (as seen earlier in the notebook).
class MarketVolatility(CustomFactor):
inputs = [DailyReturns()]
window_length = 1 # We'll want to set this in the constructor when creating the object.
window_safe = True
params = {'annualization_factor': 252.0}
def compute(self, today, assets, out, returns, annualization_factor):
# TODO
"""
For each row (each row represents one day of returns),
calculate the average of the cross-section of stock returns
So that market_returns has one value for each day in the window_length
So choose the appropriate axis (please see hints above)
"""
mkt_returns = np.nanmean(returns, axis=1)
# TODO
# Calculate the mean of market returns
mkt_returns_mu = np.nanmean(mkt_returns)
# TODO
# Calculate the standard deviation of the market returns, then annualize them.
out[:] = np.sqrt(annualization_factor * np.nanmean((mkt_returns-mkt_returns_mu)**2))
# TODO: create market volatility features using one month and six-month windows
market_vol_20d = MarketVolatility(window_length=20, mask=universe)
market_vol_120d = MarketVolatility(window_length=120, mask=universe)
# add market volatility features to pipeline
pipeline.add(market_vol_20d, 'market_vol_20d')
pipeline.add(market_vol_120d, 'market_vol_120d')
Stop here and continue with the lesson section "Sector and Industry"
Sector and Industry
Add sector code
Note that after we run the pipeline and get the data in a dataframe, we can work on enhancing the sector code feature with one-hot encoding.
pipeline.add(sector, 'sector_code')
Run pipeline to calculate features
all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)
all_factors.head()
| Mean_Reversion_Sector_Neutral_Smoothed | Momentum_1YR | Overnight_Sentiment_Smoothed | adv_120d | adv_20d | dispersion_120d | dispersion_20d | market_vol_120d | market_vol_20d | sector_code | volatility_120d | volatility_20d | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-01-03 00:00:00+00:00 | Equity(0 [A]) | -0.262769 | -1.207978 | -1.485669 | 1.338573 | 1.397411 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 0 | -0.836546 | -1.219809 |
| Equity(1 [AAL]) | 0.099926 | 1.713471 | 0.919350 | 1.139994 | 1.081155 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 3 | 1.639924 | 1.566220 | |
| Equity(2 [AAP]) | 1.669138 | -1.535061 | 1.507733 | -0.301547 | -0.919350 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 8 | 1.072400 | -1.470404 | |
| Equity(3 [AAPL]) | 1.698746 | 1.193111 | -1.367992 | 1.728377 | 1.728377 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 1 | 1.050289 | 1.617813 | |
| Equity(4 [ABBV]) | NaN | NaN | -0.250063 | -1.728377 | -1.647475 | 0.014595 | 0.014595 | 0.127654 | 0.135452 | 0 | NaN | NaN |
One-hot encode sector
Let's get all the unique sector codes. Then we'll use the == comparison operator to check when the sector code equals a particular value. This returns a series of True/False values. For some functions that we'll use in a later lesson, it's easier to work with numbers instead of booleans. We can convert the booleans to type int. So False becomes 0, and 1 becomes True.
sector_code_l = set(all_factors['sector_code'])
sector_0 = all_factors['sector_code'] == 0
sector_0[0:5]
2013-01-03 00:00:00+00:00 Equity(0 [A]) True
Equity(1 [AAL]) False
Equity(2 [AAP]) False
Equity(3 [AAPL]) False
Equity(4 [ABBV]) True
Name: sector_code, dtype: bool
sector_0_numeric = sector_0.astype(int)
sector_0_numeric[0:5]
2013-01-03 00:00:00+00:00 Equity(0 [A]) 1
Equity(1 [AAL]) 0
Equity(2 [AAP]) 0
Equity(3 [AAPL]) 0
Equity(4 [ABBV]) 1
Name: sector_code, dtype: int64
Quiz: One-hot encode sector
Choose column names that look like "sector_code_0", "sector_code_1" etc. Store the values as 1 when the row matches the sector code of the column, 0 otherwise.
# TODO: one-hot encode sector and store into dataframe
for s in sector_code_l:
all_factors[f'sector_code_{s}'] = (all_factors['sector_code'] == s).astype(int)
all_factors.head()
| Mean_Reversion_Sector_Neutral_Smoothed | Momentum_1YR | Overnight_Sentiment_Smoothed | adv_120d | adv_20d | dispersion_120d | dispersion_20d | market_vol_120d | market_vol_20d | sector_code | ... | sector_code_2 | sector_code_3 | sector_code_4 | sector_code_5 | sector_code_6 | sector_code_7 | sector_code_8 | sector_code_9 | sector_code_10 | sector_code_-1 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-01-03 00:00:00+00:00 | Equity(0 [A]) | -0.262769 | -1.207978 | -1.485669 | 1.338573 | 1.397411 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Equity(1 [AAL]) | 0.099926 | 1.713471 | 0.919350 | 1.139994 | 1.081155 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 3 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Equity(2 [AAP]) | 1.669138 | -1.535061 | 1.507733 | -0.301547 | -0.919350 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
| Equity(3 [AAPL]) | 1.698746 | 1.193111 | -1.367992 | 1.728377 | 1.728377 | 0.013270 | 0.011178 | 0.127654 | 0.135452 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Equity(4 [ABBV]) | NaN | NaN | -0.250063 | -1.728377 | -1.647475 | 0.014595 | 0.014595 | 0.127654 | 0.135452 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24 columns
Stop here and continue with the lesson section "Date Parts".
Date Parts
- We will make features that might capture trader/investor behavior due to calendar anomalies.
- We can get the dates from the index of the dataframe that is returned from running the pipeline.
Accessing index of dates
- Note that we can access the date index. using
Dataframe.index.get_level_values(0), since the date is stored as index level 0, and the asset name is stored in index level 1. This is of type DateTimeIndex.
all_factors.index.get_level_values(0)
DatetimeIndex(['2013-01-03', '2013-01-03', '2013-01-03', '2013-01-03',
'2013-01-03', '2013-01-03', '2013-01-03', '2013-01-03',
'2013-01-03', '2013-01-03',
...
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05'],
dtype='datetime64[ns, UTC]', length=363734, freq=None)
DateTimeIndex attributes
-
The
monthattribute is a numpy array with a 1 for January, 2 for February ... 12 for December etc. -
We can use a comparison operator such as
==to return True or False. -
It's usually easier to have all data of a similar type (numeric), so we recommend converting booleans to integers.
The numpy ndarray has a function.astype()that can cast the data to a specified type.
For instance,astype(int)converts False to 0 and True to 1.
# Example
print(all_factors.index.get_level_values(0).month)
print(all_factors.index.get_level_values(0).month == 1)
print( (all_factors.index.get_level_values(0).month == 1).astype(int) )
[1 1 1 ... 1 1 1]
[ True True True ... True True True]
[1 1 1 ... 1 1 1]
Quiz
- Create a numpy array that has 1 when the month is January, and 0 otherwise. Store it as a column in the all_factors dataframe.
- Add another similar column to indicate when the month is December
# TODO: create a feature that indicate whether it's January
all_factors['is_January'] = (all_factors.index.get_level_values(0).month == 1).astype(int)
# TODO: create a feature to indicate whether it's December
all_factors['is_December'] = (all_factors.index.get_level_values(0).month == 12).astype(int)
Weekday, quarter
- add columns to the all_factors dataframe that specify the weekday, quarter and year.
- As you can see in the documentation for DateTimeIndex,
weekday,quarter, andyearare attributes that you can use here.
# we can see that 0 is for Monday, 4 is for Friday
set(all_factors.index.get_level_values(0).weekday)
{0, 1, 2, 3, 4}
# Q1, Q2, Q3 and Q4 are represented by integers too
set(all_factors.index.get_level_values(0).quarter)
{1, 2, 3, 4}
Quiz
Add features for weekday, quarter and year.
# TODO
all_factors['weekday'] = all_factors.index.get_level_values(0).weekday
all_factors['quarter'] = all_factors.index.get_level_values(0).quarter
all_factors['year'] = all_factors.index.get_level_values(0).year
Start and end-of features
- The start and end of the week, month, and quarter may have structural differences in trading activity.
- Pandas.date_range takes the start_date, end_date, and frequency.
- The frequency for end of month is
BM.
# Example
tmp = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM')
tmp
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-29', '2013-04-30',
'2013-05-31', '2013-06-28', '2013-07-31', '2013-08-30',
'2013-09-30', '2013-10-31', '2013-11-29', '2013-12-31',
'2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
'2014-05-30', '2014-06-30', '2014-07-31', '2014-08-29',
'2014-09-30', '2014-10-31', '2014-11-28', '2014-12-31',
'2015-01-30', '2015-02-27', '2015-03-31', '2015-04-30',
'2015-05-29', '2015-06-30', '2015-07-31', '2015-08-31',
'2015-09-30', '2015-10-30', '2015-11-30', '2015-12-31'],
dtype='datetime64[ns, UTC]', freq='BM')
Example
Create a DatetimeIndex that stores the dates which are the last business day of each month.
Use the .isin function, passing in these last days of the month, to create a series of booleans.
Convert the booleans to integers.
last_day_of_month = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM')
last_day_of_month
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-29', '2013-04-30',
'2013-05-31', '2013-06-28', '2013-07-31', '2013-08-30',
'2013-09-30', '2013-10-31', '2013-11-29', '2013-12-31',
'2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
'2014-05-30', '2014-06-30', '2014-07-31', '2014-08-29',
'2014-09-30', '2014-10-31', '2014-11-28', '2014-12-31',
'2015-01-30', '2015-02-27', '2015-03-31', '2015-04-30',
'2015-05-29', '2015-06-30', '2015-07-31', '2015-08-31',
'2015-09-30', '2015-10-30', '2015-11-30', '2015-12-31'],
dtype='datetime64[ns, UTC]', freq='BM')
tmp_month_end = all_factors.index.get_level_values(0).isin(last_day_of_month)
tmp_month_end
array([False, False, False, ..., False, False, False])
tmp_month_end_int = tmp_month_end.astype(int)
tmp_month_end_int
array([0, 0, 0, ..., 0, 0, 0])
all_factors['month_end'] = tmp_month_end_int
Quiz: Start of Month
Create a feature that indicates the first business day of each month.
Hint: The frequency for first business day of the month uses the code BMS.
# TODO: month_start feature
first_day_of_month = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BMS')
all_factors['month_start'] = (all_factors.index.get_level_values(0).isin(first_day_of_month)).astype(int)
Quiz: Quarter end and quarter start
Create features for the last business day of each quarter, and first business day of each quarter.
Hint: use freq=BQ for business day end of quarter, and freq=BQS for business day start of quarter.
# TODO: qtr_end feature
last_day_qtr = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQ')
all_factors['qtr_end'] = (all_factors.index.get_level_values(0).isin(last_day_qtr)).astype(int)
# TODO: qtr_start feature
first_day_qtr = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQS')
all_factors['qtr_start'] = (all_factors.index.get_level_values(0).isin(first_day_qtr)).astype(int)
View all features
list(all_factors.columns)
['Mean_Reversion_Sector_Neutral_Smoothed',
'Momentum_1YR',
'Overnight_Sentiment_Smoothed',
'adv_120d',
'adv_20d',
'dispersion_120d',
'dispersion_20d',
'market_vol_120d',
'market_vol_20d',
'sector_code',
'volatility_120d',
'volatility_20d',
'sector_code_0',
'sector_code_1',
'sector_code_2',
'sector_code_3',
'sector_code_4',
'sector_code_5',
'sector_code_6',
'sector_code_7',
'sector_code_8',
'sector_code_9',
'sector_code_10',
'sector_code_-1',
'is_January',
'is_December',
'weekday',
'quarter',
'year',
'month_end',
'month_start',
'qtr_end',
'qtr_start']
Note that we can skip the sector_code feature, since we one-hot encoded it into separate features.
features = ['Mean_Reversion_Sector_Neutral_Smoothed',
'Momentum_1YR',
'Overnight_Sentiment_Smoothed',
'adv_120d',
'adv_20d',
'dispersion_120d',
'dispersion_20d',
'market_vol_120d',
'market_vol_20d',
#'sector_code', # removed sector_code
'volatility_120d',
'volatility_20d',
'sector_code_0',
'sector_code_1',
'sector_code_2',
'sector_code_3',
'sector_code_4',
'sector_code_5',
'sector_code_6',
'sector_code_7',
'sector_code_8',
'sector_code_9',
'sector_code_10',
'sector_code_-1',
'is_January',
'is_December',
'weekday',
'quarter',
'year',
'month_start',
'qtr_end',
'qtr_start']
Stop here and continue to the lesson section "Targets"
Targets (Labels)
- We are going to try to predict the go forward 1-week return
- Very important! Quantize the target. Why do we do this?
- Makes it market neutral return
- Normalizes changing volatility and dispersion over time
- Make the target robust to changes in market regimes
- The factor we create is the trailing 5-day return.
# we'll create a separate pipeline to handle the target
pipeline_target = Pipeline(screen=universe)
Example
We'll convert weekly returns into 2-quantiles.
return_5d_2q = Returns(window_length=5, mask=universe).quantiles(2)
return_5d_2q
Quantiles((Returns((USEquityPricing.close::float64,), window_length=5),), window_length=0)
pipeline_target.add(return_5d_2q, 'return_5d_2q')
Quiz
Create another weekly return target that's converted to 5-quantiles.
# TODO: create a target using 5-quantiles
return_5d_5q = Returns(window_length=5, mask=universe).quantiles(5)
# TODO: add the feature to the pipeline
pipeline_target.add(return_5d_2q, 'return_5d_5q')
# Let's run the pipeline to get the dataframe
targets_df = engine.run_pipeline(pipeline_target, factor_start_date, universe_end_date)
targets_df.head()
| return_5d_2q | return_5d_5q | ||
|---|---|---|---|
| 2013-01-03 00:00:00+00:00 | Equity(0 [A]) | 0 | 0 |
| Equity(1 [AAL]) | 1 | 1 | |
| Equity(2 [AAP]) | 0 | 0 | |
| Equity(3 [AAPL]) | 1 | 1 | |
| Equity(4 [ABBV]) | -1 | -1 |
targets_df.columns
Index(['return_5d_2q', 'return_5d_5q'], dtype='object')
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)