The Improbability of Garrett Wittles’ Streak

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Garrett Wittels – Improbable Streak?

Since the beginning of the 2010 NCAA Baseball season, Garrett Wittels (3B – FIU Panthers – Div. I) has had at least one hit every game. He is threatening Robin Ventura for the NCAA Div. I record 58-game streak (1987) with his current 56-game streak. If Garrett Wittels opts to return next year for his Junior season at FIU, he will have an opportunity to set the new NCAA Div. I hitting streak record.

After reading about this streak in an ESPN article, I started to wonder how improbable such a streak really was even given Wittels’ high batting average. I also was curious to see how likely he is to break Ventura’s mark of 58 straight games.  Thankfully, I work at RJMetrics where fun analyses like these are part of my job description.  I soon had some answers.

Assuming Wittels returns for the 2011 NCAA season and given the batting average derived from the start of the streak to now (.413), I calculated the following probabilities:

  • Wittels falls short of the record: 34.3%
  • Wittels ties the record: 6.6%
  • Wittels breaks the record streak: 59.1%

Below is a chart displaying the probability that the Garrett Wittels’ streak would last as many games as it has (by game number).  I used his current-season batting average and his number of at-bats in each game as the input to each statistic.  As you can see, these probabilities compound quickly, making this many consecutive games with at least one hit extremely unlikely.

Wittels' Streak Probability - Pre-Streak Perspective

Some highlights of Wittels’ hitting streak probabilities:

  • Probability of 10+ game hitting streak: 78.18%
  • Probability of 20+ game hitting streak: 34.09%
  • Probability of 30+ game hitting streak: 11.08%
  • Probability of 40+ game hitting streak: 2.45%
  • Probability of 50+ game hitting streak: .73%
  • Probability of 56+ game hitting streak: .45%

Please note that these probabilities are from the perspective of Wittels and his unique hitting profile:

  • Very high variable batting average (ranging from .401 to .600 – 34th in nation)
  • High at bats per game (ranging from 3 to 6)
  • Were calculated given Wittels’ batting average entering each game and at bat opportunities during each game (both of which are not known before a streak begins)

Forecasted Probabilities: (assuming constant hitting average of .413 and constant at bats average of 4.321)

  • Probability of 56+ game hitting streak: 100% (Joe DiMaggio – MLB)
  • Probability of 57+ game hitting streak: 90.01%
  • Probability of 58+ game hitting streak: 81.02%(Robin Ventura – Div. I)
  • Probability of 59+ game hitting streak: 72.93% (Damian Constantino – Div. III)
  • Probability of 60+ game hitting streak: 65.65%
  • Probability of 61+ game hitting streak: 59.09% (Joe DiMaggio – AAA)
  • Probability of 70+ game hitting streak: 22.92%
  • Probability of 80+ game hitting streak: 8.00%
  • Probability of 90+ game hitting streak: 2.79%
  • Probability of 100+ game hitting streak: .98%

Please note that the forecasted probabilities are calculated given that the streak has already gone through 56 games. In all likelihood, Wittels will break Robin Ventura’s 58 game streak (72.93%). In fact, there’s a 50/50 shot he stretches the streak beyond 61 games, successfully setting the hitting streak record across all levels of baseball.

Assuming we’re lucky enough for Wittels to stay for his Junior year, keep your eyes peeled for the 2011 FIU season opening series against the University of Massachusetts on February 18th-20th.

RJMetrics is a MySQL Enterprise Ready Partner

We are proud to announce that we have been selected as a “MySQL Enterprise Ready Partner,” as profiled on MySQL’s website.

As satisfied users of MySQL ourselves, we are pleased to fully support all clients using MySQL as their database platform.  We are looking forward to working closely with MySQL to offer the strongest possible business intelligence solution for our clients.

MySQL Enterprise Ready Partner

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

NBA Finals Simulation Predicts “Lakers in 7” (and other insights)

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

This post is brought to you by RJMetrics.  We provide hosted business intelligence software that helps web-based businesses harness the power of their data to make smarter business decisions.  Check us out!

The Experiment

With the increased use of sports analytics and this year’s NBA finals, I thought it would be interesting to construct a rudimentary model projecting the outcome of the 2010 NBA Finals.

To gather data, I referenced the NBA Encyclopedia – Playoff Edition from NBA.com. I aggregated data points containing year, game number, home/away teams, home/away scores from all Finals games since 1946 (363 games).

These data points were fed into a model which ran 10,000 Monte Carlo simulations of the NBA finals.  We can use the results of these simulations to draw insights about the outcome of the series.

Some highlights:

  • In 64% of the simulations, the Lakers won the series
  • The most likely outcome was a Lakers sweep, representing 11% of the simulations
  • The probability of a 7-game series was 29%
  • Given the current state of the series (Lakers up 2-1), the numbers heavily suggest a Lakers win:
    • The chance of the Celtics winning the championship in any number of games is now just 25%
    • The most likely outcome is Lakers in 7 games (31%)
    • The next most likely outcomes Lakers in 5 games (22%) and Lakers in 6 games (21%)

The Results

After collecting the necessary data, I calculated three key statistics: home court winning percentage by game number, “streaking” momentum probability, and a weighted expected winning percentage based on the regular season stats.

These stats are obviously not fully independent, nor do they represent perfectly clean input data (for example, which teams played “home” in which games has changed during the NBA’s history).  However, they each provided some interesting insights into the historical outcome of NBA Finals match-ups.

The historical home team winning percentage by game is shown below.

Home Team Winning Percentage by Game

Throughout the life of the NBA, games 3 and 4 have been played at the home of the team with the worse regular-season record.  The impact of regular season records is evident in the dip seen during those games in the chart above.

The propensity to have a winning streak is also interesting:

"Streaking" Momentum Probability

As you might expect, with each consecutive win your chances of winning the next game go up.  However, what I found very interesting is how low the historical chances are of winning a second game in a row. I have a few theories on what might be influencing that statistic:

  • The Finals highlight the two most competitive teams so there is likely to be less dominance between the teams.  This makes you less likely to see a multi-game winning streak in finals play than it is to see some back-and-forth between teams.
  • Factors such as home court advantage could be influencing the numbers, as the “home” team switches often throughout the series.

My final statistic was a simple weighted average of the two teams’ regular season winning records.  I used this as a baseline probability that the Lakers would win any given game in the series (53.3%).

To simulate the games in the series, I took each of these three inputs and weighted them equally in a 10,000 iteration Monte Carlo simulation of the series.  There are obviously countless other ways I could have approached the problem or chosen to weigh these statistics.  I chose this rudimentary “equal weighting” methodology to provide some basic insights into how my inputs would combine to create simulated outcomes.

After running the simulations, the following statistics surfaced:

From beginning of NBA Finals:

  • Probability Lakers win series: 64%
  • Probability Celtics win series: 36%
  • Most likely game-by-game series outcome: A Lakers sweep (11% chance)
  • The next four most likely series outcomes were the four permutations of a Lakers Championship in 5 games (these, combined with the previous stat, meant a 26% chance of the Lakers winning inf 5 games or less).
  • Least likely series outcome: C-C-C-L-L-L-C
  • Expected length of series: 5.8 games

Given what has occurred through 3 games of the NBA Finals (with the Lakers up 2-1 as of Tuesday night):

  • Probability Lakers win series: 75%
  • Probability Celtics win series: 25%
  • Most likely series outcome: L-C-L-L-C-L (Lakers in 6)
  • Least likely series outcome: L-C-L-L-C-C-C (Celtics in 7)
  • Expected length of series: 6.2 games

Here are the chances of each possible remaining outcome:

  • Lakers in 7: 31%
  • Lakers in 5: 22%
  • Lakers in 6: 21%
  • Celtics in 7: 15%
  • Celtics in 6: 10%

Interestingly, even if the Celtics win game 4 (tying the series 2-2), the Lakers are still more favored than they were before the series started (60% vs 53%).

The Methodology

I found the process of extracting and analyzing the data to be quite educational.  If you’re curious about how I arrived at these numbers, read on.

Game-by-Game Home Court Advantage Explained

To create the game-by-game home court advantage for the Finals, I used all data points since the NBA finals began in 1946.  The data points were simply the percentage of “home court wins” across the entire data set by game number.

For those of you familiar with the NBA Finals format, you may know that the format changed from 2-2-1-1-1 to 2-3-2 after the 1984 finals. This means that games 1,2,6,7 have been played at the superior team’s home court for the past 25 years. Prior to 1985, the 2-2-1-1-1 Finals format held steady with a few exceptions.

The older series format held games 1,2,5,7 at the superior team’s home court from 1946-1984 (39 years). I initially wanted to just use the newer playoff format to avoid an inflation in the game 5 home winning percentage (from game 5 being held at the superior team home court for 39 years) and to avoid deflation in the game 6 home winning percentage (from game 6 being held at the inferior team home court for 39 years).  However, to achieve a statistically significant amount of data points in all situations, I aggregated the two playoff formats and took all 64 years of Finals data.

Momentum Analysis Explained

Every game that is played (except for the first in the series) is an opportunity to continue a streak.  Streaks can be as short as two games and as long as four (since, after winning four games, the series has ended).  Streaks also end organically when the series ends, so we have to be careful to not count “end of series” games as missed opportunities to continue streaks.

A 4 game sweep (as LA accomplished in this year’s Utah series) is viewed as and limited strictly to 3 statistical data points in our momentum analysis

  • Given one win, what was the outcome of the second game?
    • In this case, the result is a successful conversion of a 1 game streak into a 2 game streak
  • Given two consecutive wins, what was the outcome of the third game?
    • In this case, the result is a successful conversion of a 2 game streak into a 3 game streak
  • Given three consecutive wins, what was the outcome of the fourth game?
    • In this case, the result is a successful conversion of a 3 game streak into a 4 game streak

Note that we do not count mini-streaks within a streak as their own streaks (for example, the third win in a 3-game streak doesn’t also count as the second win in a two-game streak).

We chose to exclusively use historical Finals data for “streakiness.” We considered using 2009-2010 regular season data for Lakers and Boston streakiness but decided against it for two reasons.

  • There were not enough data points from the regular season to provide a good basis for analysis
  • The characteristics of a “streak” in the regular season are quite different, as they can span different teams and stretch far beyond the “4 game” limit of a playoff series.

Head to Head Winning Percentage Explained

To create the head to head winning percentages, I simply looked at each team’s regular season winning percentages (50-32 and 57-25) and determined that the winning percentage of the Lakers was 14% larger than that of the Celtics.

I then constructed a head-to-head winning probability for the Lakers that was 14% better than the complimentary winning probability of the Celtics.

Conclusion

I hope you enjoyed learning about my experience simulating the NBA finals using statistics.  There are obviously a number of areas where this model could be expanded and improved, and I hope to explore them in the future.

Thanks to RJMetrics for supporting this small project as part of my summer internship.  If your web-based business needs better insight into its backend data, RJMetrics can help you measure, manage, and monetize better.  Give it a try!

RJMetrics Feature Spotlight: Syndicated Dashboards

We are pleased to announce a new feature that allows our clients to offer centrally-managed dashboard content to their users.  This feature is called “syndicated dashboards.”

Using syndicated dashboards, a company’s administrative user can “syndicate” selected dashboards from her account to any other user in the system.  These dashboards can not be edited by the recipients, but will automatically update when they are changed by the administrator.

For example, consider an administrative user who wants to share her “Key Performance Indicators” dashboard with other users at her company, but not allow those users to edit its contents.  This feature has several benefits, including:

  • Future changes to syndicated dashboards are automatically reflected on the recipients’ accounts.
  • End users are prevented from editing the syndicated content, leaving full control in the hands of the administrator.
  • Management can ensure that the logic being used to arrive at certain metrics is consistent across all users.

As always, end users still have the ability to create and edit their own private dashboards.   Only the syndicated dashboards in their accounts are read-only.

Once syndicated dashboards are enabled, the admin user for any company can syndicate any of her dashboards to any of the company’s RJMetrics users.  This is done by going to the Settings page, viewing the “Dashboards”  sub-menu, choosing a dashboard, and selecting the syndicated users from the dropdown list:

Syndication Process

When the admin user then visits her newly syndicated dashboard, she will see a message in the top-left corner saying how many users the dashboard has been syndicated to, and a helpful mouse-over tip explaining what that means.

Syndicated Dashboard - Admin User Perspective

When the end user logs in, the syndicated dashboard will appear on her list of dashboards with a message in the top-right corner explaining that its contents can not be edited.

Syndicated Dashboard - Syndicated User Perspective

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

EnterpriseDB Certified Application Vendor

We are pleased to announce that we are now an EnterpriseDB certified application vendor.  EnterpriseDB is the leader in postgreSQL-based products and services.
We are proud to have a large and growing customer base using the postgres platform.

Our Profile on EnterpriseDB's Website

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

RJMetrics Feature Spotlight: Logged Values

The RJMetrics Logged Values feature is a great way to track values that change in your database over time.

For example, while most e-commerce companies do record their current inventory, many do not keep a historical record of past inventories.  Using logged values, our clients are able to compare historical snapshots of frequently-changing database tables and build related charts and graphs that provide new insights into their businesses.

Take our favorite fictional company, Vandelay Industries, as an example.  Each time a Vandelay Industries product is sold or inventory is replenished, the inventory value in the company’s database is changed. Our system can log these inventory values over time, allowing access to useful information such as the relationship between inventory characteristics (product, category, size, or other attribute) and other company events.
Some helpful perspectives provided by using logged values might include:
  • Tracking the number of items that are “out of stock” over time (and their “out of stock” durations)
  • Monitoring sales velocity of specific products to identify key replenishment points
  • Determining “overstock” status for certain products or styles
  • Monitoring the total value of merchandise in inventory over time (and by different characteristics)

And logged values aren’t just for inventory.  Any time values are being overwritten in your database (user statuses, preferences, etc), we can log historical values to provide you with insights into how those values change over time.

Vandelay Industries

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

RJMetrics Feature Spotlight: Moving End Dates

Many businesses use the moving date range available in step two of the RJMetrics chart wizard.  This allows users to always view their most recent data without having to edit the properties of a chart.

Displayed below is a 1-month moving line-chart entitled “Sales Generated by Day.”  In this case, the profiled company is an affiliate marketing business whose sales data is reported with a 7-day delay.

Sales Generated Chart without Moving End of Range Restriction

Notice that there are 6 to 7 data points (circled) displaying insignificant or incomplete sales figures at the end of the chart.

This is a normal characteristic of a business that experiences delayed sales reporting in their database.  Companies may have delayed reporting for several reasons, including:

  • Sales data coming from 3rd party sources
  • Manually-entered data from internal sales teams
  • Volatility in “order status” or other qualifying characteristics
  • Payment processing issues or changes

These kinds of short-term post-sale issues can make the most recent several days of data unreliable.

To prevent displaying incomplete data, we allow users to conveniently set a moving end date in the RJMetrics Chart Wizard.  This can be found in step 2 under “End of Range” when choosing to “Show a Moving Date Range”.

Chart Wizard - Step 2 - Moving End of Range Date

Once a moving end of range date is selected, the user can build a chart with only the complete data points as seen below.  We have edited the chart above to make use of the moving end date feature.  We chose to view the past 30 days but exclude the past 7 days, so the chart now contains only data from 30 days prior to the date it is viewed through 7 days prior to that date.  (In the image below, the chart was accessed on May 21st.)

As with all moving date range charts, this range automatically adjusts over time so that you are always viewing the most recent data available.

Sales Generated Chart with Moving End of Range Restriction

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

RJMetrics Feature Spotlight: Enhanced Dashboard Menus

Here at RJMetrics, we work hard to create a great user experience and provide our customers with easy access to their data.  We keep in constant contact with our customers to identify new opportunities to enhance our product.

We’re proud to say that our users have been creating new dashboards at a rapid pace!  This is often caused by access to multiple data sources (backend databases, Google Analytics, Twitter, etc) or by users who have access to data from multiple companies (investors, advisors, etc).

To manage the high number of dashboards that some users generate, we enhanced our dashboard menu system to save our clients even more valuable time.

The Old System

Traditionally, all dashboards appeared under the main “Dashboards” menu in the top-right of RJMetrics.

The Original Dashboard Menu

The New System

Now, dashboards are automatically organized into subcategories based on the company whose data they contain and the source of that data.

The screenshots below are for a user who has access to data from two companies: Vandelay Industries and Play Now.  Under the enhanced dashboard menu system, dashboards associated with Vandelay Industries’ data is automatically filed under the Vandelay Industries subcategory. The same is true for Play Now data.  Dashboards with data from multiple companies (like the “Mixed Content” dashboard seen below) appear in the first level of the menu.

Enhanced Dashboard Menu Perspective

Dashboards built using data aggregated from Google Analytics’ or Twitter’s API are automatically filed under the corresponding data source.

API Data-sourced Dashboard Categorization

If you’re interested in learning more about RJMetrics, check out our website where you can learn more and try out a free demo.

Meet the Intern: Brent Linsky

Hello readers of The Metric System!  I’m Brent Linsky, the current intern at RJMetrics. I’m a rising senior at the University of Michigan concentrating in Financial Mathematics/Risk Management and Economics.  RJMetrics represents a fusion of my passions for entrepreneurship and data analysis.

Data analysis plays a pivotal role in a company’s decision-making process and is the core value of RJMetrics’ offerings. The design and implementation of a data analytics suite is both impressive and intriguing to me. I’m very excited to expand my understanding of the product’s creation and growth while I’m here.

In the coming months, I’ll be contributing many new blog entries spotlighting RJMetrics features and other topics of interest.  Stay tuned for more!

See Us Tonight At The Startup Love Triangle

Together with the founders of eFlex CMS and Career Intercom, I am presenting at the Startup Love Triangle.  As with most of my favorite Philadelphia events, this is put together by the fine folks at the Philly Startup Leaders. The three startups will be on stage with four potential investors/advisors, including Gabe Weinberg from DuckDuckGo, Ellen Weber from Robin Hood Ventures, Steve Welsh from DreamIt Ventures, and Gabe Zichermann from the Founders Institute.

There a few (free) tickets left, so grab one and stop by.